2015-05-12 09:43:12

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC 0/3] pagemap: make useable for non-privilege users

This patchset tries to make pagemap useable again in the safe way.
First patch adds bit 'map-exlusive' which is set if page is mapped only here.
Second patch restores access for non-privileged users but hides pfn if task
has no capability CAP_SYS_ADMIN. Third patch removes page-shift bits and
completes migration to the new pagemap format (flags soft-dirty and
mmap-exlusive are available only in the new format).

---

Konstantin Khlebnikov (3):
pagemap: add mmap-exclusive bit for marking pages mapped only here
pagemap: hide physical addresses from non-privileged users
pagemap: switch to the new format and do some cleanup


Documentation/vm/pagemap.txt | 3 -
fs/proc/task_mmu.c | 178 +++++++++++++++++-------------------------
tools/vm/page-types.c | 35 ++++----
3 files changed, 91 insertions(+), 125 deletions(-)


2015-05-12 09:43:30

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here

This patch sets bit 56 in pagemap if this page is mapped only once.
It allows to detect exclusively used pages without exposing PFN:

present file exclusive state
0 0 0 non-present
1 1 0 file page mapped somewhere else
1 1 1 file page mapped only here
1 0 0 anon non-CoWed page (shared with parent/child)
1 0 1 anon CoWed page (or never forked)

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Link: lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com

---

v2:
* handle transparent huge pages
* invert bit and rename shared -> exclusive (less confusing name)
---
Documentation/vm/pagemap.txt | 3 ++-
fs/proc/task_mmu.c | 10 ++++++++++
tools/vm/page-types.c | 12 ++++++++++++
3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..3cfbbb333ea1 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -16,7 +16,8 @@ There are three components to pagemap:
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
- * Bits 56-60 zero
+ * Bit 56 page exlusively mapped
+ * Bits 57-60 zero
* Bit 61 page is file-page or shared-anon
* Bit 62 page swapped
* Bit 63 page present
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d013ff..29febec65de4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -982,6 +982,7 @@ struct pagemapread {
#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))

#define __PM_SOFT_DIRTY (1LL)
+#define __PM_MMAP_EXCLUSIVE (2LL)
#define PM_PRESENT PM_STATUS(4LL)
#define PM_SWAP PM_STATUS(2LL)
#define PM_FILE PM_STATUS(1LL)
@@ -1074,6 +1075,8 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,

if (page && !PageAnon(page))
flags |= PM_FILE;
+ if (page && page_mapcount(page) == 1)
+ flags2 |= __PM_MMAP_EXCLUSIVE;
if ((vma->vm_flags & VM_SOFTDIRTY))
flags2 |= __PM_SOFT_DIRTY;

@@ -1119,6 +1122,13 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
else
pmd_flags2 = 0;

+ if (pmd_present(*pmd)) {
+ struct page *page = pmd_page(*pmd);
+
+ if (page_mapcount(page) == 1)
+ pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
+ }
+
for (; addr != end; addr += PAGE_SIZE) {
unsigned long offset;
pagemap_entry_t pme;
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 8bdf16b8ba60..3a9f193526ee 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -70,9 +70,12 @@
#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)

#define __PM_SOFT_DIRTY (1LL)
+#define __PM_MMAP_EXCLUSIVE (2LL)
#define PM_PRESENT PM_STATUS(4LL)
#define PM_SWAP PM_STATUS(2LL)
+#define PM_FILE PM_STATUS(1LL)
#define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY)
+#define PM_MMAP_EXCLUSIVE __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)


/*
@@ -100,6 +103,8 @@
#define KPF_SLOB_FREE 49
#define KPF_SLUB_FROZEN 50
#define KPF_SLUB_DEBUG 51
+#define KPF_FILE 62
+#define KPF_MMAP_EXCLUSIVE 63

#define KPF_ALL_BITS ((uint64_t)~0ULL)
#define KPF_HACKERS_BITS (0xffffULL << 32)
@@ -149,6 +154,9 @@ static const char * const page_flag_names[] = {
[KPF_SLOB_FREE] = "P:slob_free",
[KPF_SLUB_FROZEN] = "A:slub_frozen",
[KPF_SLUB_DEBUG] = "E:slub_debug",
+
+ [KPF_FILE] = "F:file",
+ [KPF_MMAP_EXCLUSIVE] = "1:mmap_exclusive",
};


@@ -452,6 +460,10 @@ static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)

if (pme & PM_SOFT_DIRTY)
flags |= BIT(SOFTDIRTY);
+ if (pme & PM_FILE)
+ flags |= BIT(FILE);
+ if (pme & PM_MMAP_EXCLUSIVE)
+ flags |= BIT(MMAP_EXCLUSIVE);

return flags;
}

2015-05-12 09:43:25

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users

This patch makes pagemap readable for normal users back but hides physical
addresses from them. For some use cases PFN isn't required at all: flags
give information about presence, page type (anon/file/swap), soft-dirty mark,
and hint about page mapcount state: exclusive(mapcount = 1) or (mapcount > 1).

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
Link: lkml.kernel.org/r/[email protected]
---
fs/proc/task_mmu.c | 36 ++++++++++++++++++++++--------------
1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 29febec65de4..0b7a8ffec95f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -962,6 +962,7 @@ struct pagemapread {
int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
pagemap_entry_t *buffer;
bool v2;
+ bool show_pfn;
};

#define PAGEMAP_WALK_SIZE (PMD_SIZE)
@@ -1046,12 +1047,13 @@ out:
static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
- u64 frame, flags;
+ u64 frame = 0, flags;
struct page *page = NULL;
int flags2 = 0;

if (pte_present(pte)) {
- frame = pte_pfn(pte);
+ if (pm->show_pfn)
+ frame = pte_pfn(pte);
flags = PM_PRESENT;
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
@@ -1087,15 +1089,19 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
pmd_t pmd, int offset, int pmd_flags2)
{
+ u64 frame = 0;
+
/*
* Currently pmd for thp is always present because thp can not be
* swapped-out, migrated, or HWPOISONed (split in such cases instead.)
* This if-check is just to prepare for future implementation.
*/
- if (pmd_present(pmd))
- *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
- | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT);
- else
+ if (pmd_present(pmd)) {
+ if (pm->show_pfn)
+ frame = pmd_pfn(pmd) + offset;
+ *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
+ PM_STATUS2(pm->v2, pmd_flags2));
+ } else
*pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
}
#else
@@ -1171,11 +1177,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
pte_t pte, int offset, int flags2)
{
- if (pte_present(pte))
- *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset) |
- PM_STATUS2(pm->v2, flags2) |
- PM_PRESENT);
- else
+ u64 frame = 0;
+
+ if (pte_present(pte)) {
+ if (pm->show_pfn)
+ frame = pte_pfn(pte) + offset;
+ *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
+ PM_STATUS2(pm->v2, flags2));
+ } else
*pme = make_pme(PM_NOT_PRESENT(pm->v2) |
PM_STATUS2(pm->v2, flags2));
}
@@ -1260,6 +1269,8 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!count)
goto out_task;

+ /* do not disclose physical addresses: attack vector */
+ pm.show_pfn = capable(CAP_SYS_ADMIN);
pm.v2 = soft_dirty_cleared;
pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
@@ -1335,9 +1346,6 @@ out:

static int pagemap_open(struct inode *inode, struct file *file)
{
- /* do not disclose physical addresses: attack vector */
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
"to stop being page-shift some time soon. See the "
"linux/Documentation/vm/pagemap.txt for details.\n");

2015-05-12 09:43:20

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v2 3/3] pagemap: switch to the new format and do some cleanup

This patch removes page-shift bits (scheduled to remove since 3.11) and
completes migration to the new bit layout. Also it cleans messy macro.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/proc/task_mmu.c | 152 ++++++++++++++++---------------------------------
tools/vm/page-types.c | 29 +++------
2 files changed, 58 insertions(+), 123 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0b7a8ffec95f..66bc7207ce90 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -710,23 +710,6 @@ const struct file_operations proc_tid_smaps_operations = {
.release = proc_map_release,
};

-/*
- * We do not want to have constant page-shift bits sitting in
- * pagemap entries and are about to reuse them some time soon.
- *
- * Here's the "migration strategy":
- * 1. when the system boots these bits remain what they are,
- * but a warning about future change is printed in log;
- * 2. once anyone clears soft-dirty bits via clear_refs file,
- * these flag is set to denote, that user is aware of the
- * new API and those page-shift bits change their meaning.
- * The respective warning is printed in dmesg;
- * 3. In a couple of releases we will remove all the mentions
- * of page-shift in pagemap entries.
- */
-
-static bool soft_dirty_cleared __read_mostly;
-
enum clear_refs_types {
CLEAR_REFS_ALL = 1,
CLEAR_REFS_ANON,
@@ -887,13 +870,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
return -EINVAL;

- if (type == CLEAR_REFS_SOFT_DIRTY) {
- soft_dirty_cleared = true;
- pr_warn_once("The pagemap bits 55-60 has changed their meaning!"
- " See the linux/Documentation/vm/pagemap.txt for "
- "details.\n");
- }
-
task = get_proc_task(file_inode(file));
if (!task)
return -ESRCH;
@@ -961,38 +937,26 @@ typedef struct {
struct pagemapread {
int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
pagemap_entry_t *buffer;
- bool v2;
bool show_pfn;
};

#define PAGEMAP_WALK_SIZE (PMD_SIZE)
#define PAGEMAP_WALK_MASK (PMD_MASK)

-#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
-#define PM_STATUS_BITS 3
-#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS 6
-#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define __PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
-/* in "new" pagemap pshift bits are occupied with more status bits */
-#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
-
-#define __PM_SOFT_DIRTY (1LL)
-#define __PM_MMAP_EXCLUSIVE (2LL)
-#define PM_PRESENT PM_STATUS(4LL)
-#define PM_SWAP PM_STATUS(2LL)
-#define PM_FILE PM_STATUS(1LL)
-#define PM_NOT_PRESENT(v2) PM_STATUS2(v2, 0)
+#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
+#define PM_PFEAME_BITS 54
+#define PM_PFRAME_MASK GENMASK_ULL(PM_PFEAME_BITS - 1, 0)
+#define PM_SOFT_DIRTY BIT_ULL(55)
+#define PM_MMAP_EXCLUSIVE BIT_ULL(56)
+#define PM_FILE BIT_ULL(61)
+#define PM_SWAP BIT_ULL(62)
+#define PM_PRESENT BIT_ULL(63)
+
#define PM_END_OF_BUFFER 1

-static inline pagemap_entry_t make_pme(u64 val)
+static inline pagemap_entry_t make_pme(u64 frame, u64 flags)
{
- return (pagemap_entry_t) { .pme = val };
+ return (pagemap_entry_t) { .pme = (frame & PM_PFRAME_MASK) | flags };
}

static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme,
@@ -1013,7 +977,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,

while (addr < end) {
struct vm_area_struct *vma = find_vma(walk->mm, addr);
- pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
+ pagemap_entry_t pme = make_pme(0, 0);
/* End of address space hole, which we mark as non-present. */
unsigned long hole_end;

@@ -1033,7 +997,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,

/* Addresses in the VMA. */
if (vma->vm_flags & VM_SOFTDIRTY)
- pme.pme |= PM_STATUS2(pm->v2, __PM_SOFT_DIRTY);
+ pme = make_pme(0, PM_SOFT_DIRTY);
for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) {
err = add_to_pagemap(addr, &pme, pm);
if (err)
@@ -1044,50 +1008,44 @@ out:
return err;
}

-static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
+static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
- u64 frame = 0, flags;
+ u64 frame = 0, flags = 0;
struct page *page = NULL;
- int flags2 = 0;

if (pte_present(pte)) {
if (pm->show_pfn)
frame = pte_pfn(pte);
- flags = PM_PRESENT;
+ flags |= PM_PRESENT;
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_SOFT_DIRTY;
} else if (is_swap_pte(pte)) {
swp_entry_t entry;
if (pte_swp_soft_dirty(pte))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_SOFT_DIRTY;
entry = pte_to_swp_entry(pte);
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
- flags = PM_SWAP;
+ flags |= PM_SWAP;
if (is_migration_entry(entry))
page = migration_entry_to_page(entry);
- } else {
- if (vma->vm_flags & VM_SOFTDIRTY)
- flags2 |= __PM_SOFT_DIRTY;
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2));
- return;
}

if (page && !PageAnon(page))
flags |= PM_FILE;
if (page && page_mapcount(page) == 1)
- flags2 |= __PM_MMAP_EXCLUSIVE;
- if ((vma->vm_flags & VM_SOFTDIRTY))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_MMAP_EXCLUSIVE;
+ if (vma->vm_flags & VM_SOFTDIRTY)
+ flags |= PM_SOFT_DIRTY;

- *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags);
+ return make_pme(frame, flags);
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pmd_t pmd, int offset, int pmd_flags2)
+static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
+ pmd_t pmd, int offset, u64 flags)
{
u64 frame = 0;

@@ -1099,15 +1057,16 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p
if (pmd_present(pmd)) {
if (pm->show_pfn)
frame = pmd_pfn(pmd) + offset;
- *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
- PM_STATUS2(pm->v2, pmd_flags2));
- } else
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
+ flags |= PM_PRESENT;
+ }
+
+ return make_pme(frame, flags);
}
#else
-static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pmd_t pmd, int offset, int pmd_flags2)
+static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
+ pmd_t pmd, int offset, u64 flags)
{
+ return make_pme(0, 0);
}
#endif

@@ -1121,18 +1080,16 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
int err = 0;

if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
- int pmd_flags2;
+ u64 flags = 0;

if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
- pmd_flags2 = __PM_SOFT_DIRTY;
- else
- pmd_flags2 = 0;
+ flags |= PM_SOFT_DIRTY;

if (pmd_present(*pmd)) {
struct page *page = pmd_page(*pmd);

if (page_mapcount(page) == 1)
- pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
+ flags |= PM_MMAP_EXCLUSIVE;
}

for (; addr != end; addr += PAGE_SIZE) {
@@ -1141,7 +1098,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,

offset = (addr & ~PAGEMAP_WALK_MASK) >>
PAGE_SHIFT;
- thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2);
+ pme = thp_pmd_to_pagemap_entry(pm, *pmd, offset, flags);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
@@ -1161,7 +1118,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
for (; addr < end; pte++, addr += PAGE_SIZE) {
pagemap_entry_t pme;

- pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
+ pme = pte_to_pagemap_entry(pm, vma, addr, *pte);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
@@ -1174,19 +1131,18 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}

#ifdef CONFIG_HUGETLB_PAGE
-static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pte_t pte, int offset, int flags2)
+static pagemap_entry_t huge_pte_to_pagemap_entry(struct pagemapread *pm,
+ pte_t pte, int offset, u64 flags)
{
u64 frame = 0;

if (pte_present(pte)) {
if (pm->show_pfn)
frame = pte_pfn(pte) + offset;
- *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
- PM_STATUS2(pm->v2, flags2));
- } else
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) |
- PM_STATUS2(pm->v2, flags2));
+ flags |= PM_PRESENT;
+ }
+
+ return make_pme(frame, flags);
}

/* This function walks within one hugetlb entry in the single call */
@@ -1197,17 +1153,15 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
struct pagemapread *pm = walk->private;
struct vm_area_struct *vma = walk->vma;
int err = 0;
- int flags2;
+ u64 flags = 0;
pagemap_entry_t pme;

if (vma->vm_flags & VM_SOFTDIRTY)
- flags2 = __PM_SOFT_DIRTY;
- else
- flags2 = 0;
+ flags |= PM_SOFT_DIRTY;

for (; addr != end; addr += PAGE_SIZE) {
int offset = (addr & ~hmask) >> PAGE_SHIFT;
- huge_pte_to_pagemap_entry(&pme, pm, *pte, offset, flags2);
+ pme = huge_pte_to_pagemap_entry(pm, *pte, offset, flags);
err = add_to_pagemap(addr, &pme, pm);
if (err)
return err;
@@ -1228,7 +1182,9 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
* Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
- * Bits 55-60 page shift (page size = 1<<page shift)
+ * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
+ * Bit 56 page exclusively mapped
+ * Bits 57-60 zero
* Bit 61 page is file-page or shared-anon
* Bit 62 page swapped
* Bit 63 page present
@@ -1271,7 +1227,6 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,

/* do not disclose physical addresses: attack vector */
pm.show_pfn = capable(CAP_SYS_ADMIN);
- pm.v2 = soft_dirty_cleared;
pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
ret = -ENOMEM;
@@ -1344,18 +1299,9 @@ out:
return ret;
}

-static int pagemap_open(struct inode *inode, struct file *file)
-{
- pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
- "to stop being page-shift some time soon. See the "
- "linux/Documentation/vm/pagemap.txt for details.\n");
- return 0;
-}
-
const struct file_operations proc_pagemap_operations = {
.llseek = mem_lseek, /* borrow this */
.read = pagemap_read,
- .open = pagemap_open,
};
#endif /* CONFIG_PROC_PAGE_MONITOR */

diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 3a9f193526ee..1fa872e58238 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -57,26 +57,15 @@
* pagemap kernel ABI bits
*/

-#define PM_ENTRY_BYTES sizeof(uint64_t)
-#define PM_STATUS_BITS 3
-#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS 6
-#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define __PM_PSHIFT(x) (((uint64_t) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
-
-#define __PM_SOFT_DIRTY (1LL)
-#define __PM_MMAP_EXCLUSIVE (2LL)
-#define PM_PRESENT PM_STATUS(4LL)
-#define PM_SWAP PM_STATUS(2LL)
-#define PM_FILE PM_STATUS(1LL)
-#define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY)
-#define PM_MMAP_EXCLUSIVE __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)
-
+#define PM_ENTRY_BYTES 8
+#define PM_PFEAME_BITS 54
+#define PM_PFRAME_MASK ((1LL << PM_PFEAME_BITS) - 1)
+#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
+#define PM_SOFT_DIRTY (1ULL << 55)
+#define PM_MMAP_EXCLUSIVE (1ULL << 56)
+#define PM_FILE (1ULL << 61)
+#define PM_SWAP (1ULL << 62)
+#define PM_PRESENT (1ULL << 63)

/*
* kernel page flags

2015-05-12 10:42:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here

On Tue, May 12, 2015 at 12:43:03PM +0300, Konstantin Khlebnikov wrote:
> This patch sets bit 56 in pagemap if this page is mapped only once.
> It allows to detect exclusively used pages without exposing PFN:
>
> present file exclusive state
> 0 0 0 non-present
> 1 1 0 file page mapped somewhere else
> 1 1 1 file page mapped only here
> 1 0 0 anon non-CoWed page (shared with parent/child)
> 1 0 1 anon CoWed page (or never forked)

Probably, worth noting that file-private pages are anon in this context.

--
Kirill A. Shutemov

2015-05-12 10:55:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] pagemap: switch to the new format and do some cleanup

On Tue, May 12, 2015 at 12:43:06PM +0300, Konstantin Khlebnikov wrote:
> This patch removes page-shift bits (scheduled to remove since 3.11) and
> completes migration to the new bit layout. Also it cleans messy macro.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> ---
> fs/proc/task_mmu.c | 152 ++++++++++++++++---------------------------------
> tools/vm/page-types.c | 29 +++------
> 2 files changed, 58 insertions(+), 123 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 0b7a8ffec95f..66bc7207ce90 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -710,23 +710,6 @@ const struct file_operations proc_tid_smaps_operations = {
> .release = proc_map_release,
> };
>
> -/*
> - * We do not want to have constant page-shift bits sitting in
> - * pagemap entries and are about to reuse them some time soon.
> - *
> - * Here's the "migration strategy":
> - * 1. when the system boots these bits remain what they are,
> - * but a warning about future change is printed in log;
> - * 2. once anyone clears soft-dirty bits via clear_refs file,
> - * these flag is set to denote, that user is aware of the
> - * new API and those page-shift bits change their meaning.
> - * The respective warning is printed in dmesg;
> - * 3. In a couple of releases we will remove all the mentions
> - * of page-shift in pagemap entries.
> - */

Wouldn't it be better to just have v2=1 by default for couple releases to
see if anything breaks? This way we can revert easily if regression reported.
I guess someone could miss this change coming if he didn't touch clear_refs.

--
Kirill A. Shutemov

2015-05-12 11:13:34

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] pagemap: make useable for non-privilege users

Hi Konstantin,

Thanks very much for continuing to look at this! It's very much
appreciated. I've been investigating from our end but got caught up
in some gnarly details of our pagemap-consuming code.

I like the approach and it seems like the information you're exposing
will be useful for our application. I'll test the patch and see if it
works for us as-is.

Will follow up with any comments on the individual patches.

Thanks,
Mark

On Tue, May 12, 2015 at 10:43 AM, Konstantin Khlebnikov
<[email protected]> wrote:
> This patchset tries to make pagemap useable again in the safe way.
> First patch adds bit 'map-exlusive' which is set if page is mapped only here.
> Second patch restores access for non-privileged users but hides pfn if task
> has no capability CAP_SYS_ADMIN. Third patch removes page-shift bits and
> completes migration to the new pagemap format (flags soft-dirty and
> mmap-exlusive are available only in the new format).
>
> ---
>
> Konstantin Khlebnikov (3):
> pagemap: add mmap-exclusive bit for marking pages mapped only here
> pagemap: hide physical addresses from non-privileged users
> pagemap: switch to the new format and do some cleanup
>
>
> Documentation/vm/pagemap.txt | 3 -
> fs/proc/task_mmu.c | 178 +++++++++++++++++-------------------------
> tools/vm/page-types.c | 35 ++++----
> 3 files changed, 91 insertions(+), 125 deletions(-)

2015-05-12 11:22:33

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users

Hi Konstantin,

Comments inline...

On Tue, May 12, 2015 at 10:43 AM, Konstantin Khlebnikov
<[email protected]> wrote:
> This patch makes pagemap readable for normal users back but hides physical
> addresses from them. For some use cases PFN isn't required at all: flags
> give information about presence, page type (anon/file/swap), soft-dirty mark,
> and hint about page mapcount state: exclusive(mapcount = 1) or (mapcount > 1).
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
> Link: lkml.kernel.org/r/[email protected]
> ---
> fs/proc/task_mmu.c | 36 ++++++++++++++++++++++--------------
> 1 file changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 29febec65de4..0b7a8ffec95f 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -962,6 +962,7 @@ struct pagemapread {
> int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
> pagemap_entry_t *buffer;
> bool v2;
> + bool show_pfn;
> };
>
> #define PAGEMAP_WALK_SIZE (PMD_SIZE)
> @@ -1046,12 +1047,13 @@ out:
> static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> struct vm_area_struct *vma, unsigned long addr, pte_t pte)
> {
> - u64 frame, flags;
> + u64 frame = 0, flags;
> struct page *page = NULL;
> int flags2 = 0;
>
> if (pte_present(pte)) {
> - frame = pte_pfn(pte);
> + if (pm->show_pfn)
> + frame = pte_pfn(pte);
> flags = PM_PRESENT;
> page = vm_normal_page(vma, addr, pte);
> if (pte_soft_dirty(pte))
> @@ -1087,15 +1089,19 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> pmd_t pmd, int offset, int pmd_flags2)
> {
> + u64 frame = 0;
> +
> /*
> * Currently pmd for thp is always present because thp can not be
> * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
> * This if-check is just to prepare for future implementation.
> */
> - if (pmd_present(pmd))
> - *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
> - | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT);
> - else
> + if (pmd_present(pmd)) {
> + if (pm->show_pfn)
> + frame = pmd_pfn(pmd) + offset;
> + *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
> + PM_STATUS2(pm->v2, pmd_flags2));
> + } else
> *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
> }
> #else
> @@ -1171,11 +1177,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> pte_t pte, int offset, int flags2)
> {
> - if (pte_present(pte))
> - *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset) |
> - PM_STATUS2(pm->v2, flags2) |
> - PM_PRESENT);
> - else
> + u64 frame = 0;
> +
> + if (pte_present(pte)) {
> + if (pm->show_pfn)
> + frame = pte_pfn(pte) + offset;
> + *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
> + PM_STATUS2(pm->v2, flags2));
> + } else
> *pme = make_pme(PM_NOT_PRESENT(pm->v2) |
> PM_STATUS2(pm->v2, flags2));
> }
> @@ -1260,6 +1269,8 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> if (!count)
> goto out_task;
>
> + /* do not disclose physical addresses: attack vector */
> + pm.show_pfn = capable(CAP_SYS_ADMIN);

If I understood correctly, Linus recommended to me that we use the
open-time capabilities of the file descriptor rather than the current
capability state (to mitigate against an attacker passing an FD to a
setuid process, I think).

FWIW, I knocked up a quick internal patch (less comprehensive than
yours!) and used file_ns_capable() successfully, i.e:
pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);

It looked promising to be but I've not done the checking to verify
that this is strictly correct; the capabilities stuff is not an area
of the kernel I'm familiar with.

> pm.v2 = soft_dirty_cleared;
> pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
> @@ -1335,9 +1346,6 @@ out:
>
> static int pagemap_open(struct inode *inode, struct file *file)
> {
> - /* do not disclose physical addresses: attack vector */
> - if (!capable(CAP_SYS_ADMIN))
> - return -EPERM;
> pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
> "to stop being page-shift some time soon. See the "
> "linux/Documentation/vm/pagemap.txt for details.\n");
>

No other comments on this, looks like it would help us.

Thanks,
Mark

2015-05-12 12:05:50

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here

Hi Konstantin,

I hope you won't mind me thinking out loud here on the idea of adding
a flag to the v2 pagemap fields... From a kernel PoV, I agree that
this seems like the cleanest approach. However, with my application
developer hat on:

1. I was hoping we'd be able to backport a compatible fix to older
kernels that might adopt the pagemap permissions change. Using the V2
format flags rules out doing this for kernels that are too old to have
soft-dirty, I think.
2. From our software's PoV, I feel it's worth noting that it doesn't
strictly fix ABI compatibility, though I realise that's probably not
your primary concern here. We'll need to modify our code to write the
clear_refs file but that change is OK for us if it's the preferred
solution.

In the patches I've been playing with, I was considering putting the
Exclusive flag in the now-unused PFN field of the pagemap entries.
Since we're specifically trying to work around for the lack of PFN
information, would there be any appetite for mirroring this flag
unconditionally into the now-empty PFN field (i.e. whether using v1 or
v2 flags) when accessed by an unprivileged process?

I realise it's ugly from a kernel PoV and I feel a little bad for
suggesting it - but it would address points 1 and 2 for us (our
existing code just looks for changes in the pagemap entry, so sticking
the flag in there would cause it to do the right thing).

I'm sorry to raise application-specific issues at this point; I
appreciate that your primary concern is to improve the kernel and
technically I like the approach that you've taken! I'll try and
provide more code-oriented feedback once I've tried out the changes.

Thanks,
Mark

On Tue, May 12, 2015 at 10:43 AM, Konstantin Khlebnikov
<[email protected]> wrote:
> This patch sets bit 56 in pagemap if this page is mapped only once.
> It allows to detect exclusively used pages without exposing PFN:
>
> present file exclusive state
> 0 0 0 non-present
> 1 1 0 file page mapped somewhere else
> 1 1 1 file page mapped only here
> 1 0 0 anon non-CoWed page (shared with parent/child)
> 1 0 1 anon CoWed page (or never forked)
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Link: lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com
>
> ---
>
> v2:
> * handle transparent huge pages
> * invert bit and rename shared -> exclusive (less confusing name)
> ---
> Documentation/vm/pagemap.txt | 3 ++-
> fs/proc/task_mmu.c | 10 ++++++++++
> tools/vm/page-types.c | 12 ++++++++++++
> 3 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index 6bfbc172cdb9..3cfbbb333ea1 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -16,7 +16,8 @@ There are three components to pagemap:
> * Bits 0-4 swap type if swapped
> * Bits 5-54 swap offset if swapped
> * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
> - * Bits 56-60 zero
> + * Bit 56 page exlusively mapped
> + * Bits 57-60 zero
> * Bit 61 page is file-page or shared-anon
> * Bit 62 page swapped
> * Bit 63 page present
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6dee68d013ff..29febec65de4 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -982,6 +982,7 @@ struct pagemapread {
> #define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
>
> #define __PM_SOFT_DIRTY (1LL)
> +#define __PM_MMAP_EXCLUSIVE (2LL)
> #define PM_PRESENT PM_STATUS(4LL)
> #define PM_SWAP PM_STATUS(2LL)
> #define PM_FILE PM_STATUS(1LL)
> @@ -1074,6 +1075,8 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
>
> if (page && !PageAnon(page))
> flags |= PM_FILE;
> + if (page && page_mapcount(page) == 1)
> + flags2 |= __PM_MMAP_EXCLUSIVE;
> if ((vma->vm_flags & VM_SOFTDIRTY))
> flags2 |= __PM_SOFT_DIRTY;
>
> @@ -1119,6 +1122,13 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> else
> pmd_flags2 = 0;
>
> + if (pmd_present(*pmd)) {
> + struct page *page = pmd_page(*pmd);
> +
> + if (page_mapcount(page) == 1)
> + pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
> + }
> +
> for (; addr != end; addr += PAGE_SIZE) {
> unsigned long offset;
> pagemap_entry_t pme;
> diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
> index 8bdf16b8ba60..3a9f193526ee 100644
> --- a/tools/vm/page-types.c
> +++ b/tools/vm/page-types.c
> @@ -70,9 +70,12 @@
> #define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
>
> #define __PM_SOFT_DIRTY (1LL)
> +#define __PM_MMAP_EXCLUSIVE (2LL)
> #define PM_PRESENT PM_STATUS(4LL)
> #define PM_SWAP PM_STATUS(2LL)
> +#define PM_FILE PM_STATUS(1LL)
> #define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY)
> +#define PM_MMAP_EXCLUSIVE __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)
>
>
> /*
> @@ -100,6 +103,8 @@
> #define KPF_SLOB_FREE 49
> #define KPF_SLUB_FROZEN 50
> #define KPF_SLUB_DEBUG 51
> +#define KPF_FILE 62
> +#define KPF_MMAP_EXCLUSIVE 63
>
> #define KPF_ALL_BITS ((uint64_t)~0ULL)
> #define KPF_HACKERS_BITS (0xffffULL << 32)
> @@ -149,6 +154,9 @@ static const char * const page_flag_names[] = {
> [KPF_SLOB_FREE] = "P:slob_free",
> [KPF_SLUB_FROZEN] = "A:slub_frozen",
> [KPF_SLUB_DEBUG] = "E:slub_debug",
> +
> + [KPF_FILE] = "F:file",
> + [KPF_MMAP_EXCLUSIVE] = "1:mmap_exclusive",
> };
>
>
> @@ -452,6 +460,10 @@ static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)
>
> if (pme & PM_SOFT_DIRTY)
> flags |= BIT(SOFTDIRTY);
> + if (pme & PM_FILE)
> + flags |= BIT(FILE);
> + if (pme & PM_MMAP_EXCLUSIVE)
> + flags |= BIT(MMAP_EXCLUSIVE);
>
> return flags;
> }
>

2015-05-12 15:06:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users

On Tue, May 12, 2015 at 2:43 AM, Konstantin Khlebnikov
<[email protected]> wrote:
> @@ -1260,6 +1269,8 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> if (!count)
> goto out_task;
>
> + /* do not disclose physical addresses: attack vector */
> + pm.show_pfn = capable(CAP_SYS_ADMIN);
> pm.v2 = soft_dirty_cleared;
> pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);

NO! Dammit, no, no, no!

How many times must people do this major security faux-pas before we learn?

WE DO NOT CHECK CURRENT CAPABILITIES AT READ/WRITE TIME!

It's a bug. It's a security issue. It's not how Unix capabilities work!

Capabilities are checked at open time.:

> @@ -1335,9 +1346,6 @@ out:
>
> static int pagemap_open(struct inode *inode, struct file *file)
> {
> - /* do not disclose physical addresses: attack vector */
> - if (!capable(CAP_SYS_ADMIN))
> - return -EPERM;

THIS is where you are supposed to check for capabilities. The place
where you removed it!

The reason we check capabilities at open time, and open time ONLY is
because that is really very integral to the whole Unix security model.
Otherwise, you get into this situation:

- unprivileged process opens file

- unprivileged process tricks suid process to do the actual access for it

where the traditional model is to just force a "write()" by opening
the file as stderr, and then executing a suid process (traditionally
"sudo") that writes an error message to it.

So *don't* do permission checks using read/write time credentials.
They are wrong.

Now, if there is some reason that you really can't do it when opening
the file, and you actually need to use capability information at
read/write time, you use the "file->f_cred" field, which is the
open-time capabilities. So you _can_ do permission checks at
read/write time, but you have to use the credentials of the opener,
not "current".

So in this case, I guess you could use

pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);

if you really need to do this at read time, and cannot fill in that
"show_pfn" at open-time.

Linus

2015-05-12 15:41:40

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users

On 12.05.2015 18:06, Linus Torvalds wrote:
> On Tue, May 12, 2015 at 2:43 AM, Konstantin Khlebnikov
> <[email protected]> wrote:
>> @@ -1260,6 +1269,8 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
>> if (!count)
>> goto out_task;
>>
>> + /* do not disclose physical addresses: attack vector */
>> + pm.show_pfn = capable(CAP_SYS_ADMIN);
>> pm.v2 = soft_dirty_cleared;
>> pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>> pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
>
> NO! Dammit, no, no, no!
>
> How many times must people do this major security faux-pas before we learn?

Oops. Sorry. I guess everybody must do that mistake at least once.
That's my first time. =)


So, in this case existing call of mm_access() from pagemap_read()
is a bug too because it checks CAP_SYS_PTRACE for current task.

I'll rework it in the same way as /proc/*/[s]maps.

>
> WE DO NOT CHECK CURRENT CAPABILITIES AT READ/WRITE TIME!
>
> It's a bug. It's a security issue. It's not how Unix capabilities work!
>
> Capabilities are checked at open time.:
>
>> @@ -1335,9 +1346,6 @@ out:
>>
>> static int pagemap_open(struct inode *inode, struct file *file)
>> {
>> - /* do not disclose physical addresses: attack vector */
>> - if (!capable(CAP_SYS_ADMIN))
>> - return -EPERM;
>
> THIS is where you are supposed to check for capabilities. The place
> where you removed it!
>
> The reason we check capabilities at open time, and open time ONLY is
> because that is really very integral to the whole Unix security model.
> Otherwise, you get into this situation:
>
> - unprivileged process opens file
>
> - unprivileged process tricks suid process to do the actual access for it
>
> where the traditional model is to just force a "write()" by opening
> the file as stderr, and then executing a suid process (traditionally
> "sudo") that writes an error message to it.
>
> So *don't* do permission checks using read/write time credentials.
> They are wrong.
>
> Now, if there is some reason that you really can't do it when opening
> the file, and you actually need to use capability information at
> read/write time, you use the "file->f_cred" field, which is the
> open-time capabilities. So you _can_ do permission checks at
> read/write time, but you have to use the credentials of the opener,
> not "current".
>
> So in this case, I guess you could use
>
> pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);
>
> if you really need to do this at read time, and cannot fill in that
> "show_pfn" at open-time.
>
> Linus
>


--
Konstantin

2015-05-13 10:51:35

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here

On 12.05.2015 15:05, Mark Williamson wrote:
> Hi Konstantin,
>
> I hope you won't mind me thinking out loud here on the idea of adding
> a flag to the v2 pagemap fields... From a kernel PoV, I agree that
> this seems like the cleanest approach. However, with my application
> developer hat on:
>
> 1. I was hoping we'd be able to backport a compatible fix to older
> kernels that might adopt the pagemap permissions change. Using the V2
> format flags rules out doing this for kernels that are too old to have
> soft-dirty, I think.
>
> 2. From our software's PoV, I feel it's worth noting that it doesn't
> strictly fix ABI compatibility, though I realise that's probably not
> your primary concern here. We'll need to modify our code to write the
> clear_refs file but that change is OK for us if it's the preferred
> solution.
>
> In the patches I've been playing with, I was considering putting the
> Exclusive flag in the now-unused PFN field of the pagemap entries.
> Since we're specifically trying to work around for the lack of PFN
> information, would there be any appetite for mirroring this flag
> unconditionally into the now-empty PFN field (i.e. whether using v1 or
> v2 flags) when accessed by an unprivileged process?
>
> I realise it's ugly from a kernel PoV and I feel a little bad for
> suggesting it - but it would address points 1 and 2 for us (our
> existing code just looks for changes in the pagemap entry, so sticking
> the flag in there would cause it to do the right thing).
>
> I'm sorry to raise application-specific issues at this point; I
> appreciate that your primary concern is to improve the kernel and
> technically I like the approach that you've taken! I'll try and
> provide more code-oriented feedback once I've tried out the changes.

I prefer to backport v2 format (except soft-dirty bit and clear_refs)
into older kernels. Page-shift bits are barely used so nobody will see
the difference.

--
Konstantin

2015-05-13 10:59:51

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here

On 12.05.2015 13:40, Kirill A. Shutemov wrote:
> On Tue, May 12, 2015 at 12:43:03PM +0300, Konstantin Khlebnikov wrote:
>> This patch sets bit 56 in pagemap if this page is mapped only once.
>> It allows to detect exclusively used pages without exposing PFN:
>>
>> present file exclusive state
>> 0 0 0 non-present
>> 1 1 0 file page mapped somewhere else
>> 1 1 1 file page mapped only here
>> 1 0 0 anon non-CoWed page (shared with parent/child)
>> 1 0 1 anon CoWed page (or never forked)
>
> Probably, worth noting that file-private pages are anon in this context.
>

You mean there's another kind of CoW pages? Yep, but from the kernel
point of view these pages are the same. Anyway Userspace could look
into /proc/*/maps and see is there any file beyond anon vma.

--
Konstantin

2015-05-13 11:40:18

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] pagemap: switch to the new format and do some cleanup

On 12.05.2015 13:54, Kirill A. Shutemov wrote:
> On Tue, May 12, 2015 at 12:43:06PM +0300, Konstantin Khlebnikov wrote:
>> This patch removes page-shift bits (scheduled to remove since 3.11) and
>> completes migration to the new bit layout. Also it cleans messy macro.
>>
>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>> ---
>> fs/proc/task_mmu.c | 152 ++++++++++++++++---------------------------------
>> tools/vm/page-types.c | 29 +++------
>> 2 files changed, 58 insertions(+), 123 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 0b7a8ffec95f..66bc7207ce90 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -710,23 +710,6 @@ const struct file_operations proc_tid_smaps_operations = {
>> .release = proc_map_release,
>> };
>>
>> -/*
>> - * We do not want to have constant page-shift bits sitting in
>> - * pagemap entries and are about to reuse them some time soon.
>> - *
>> - * Here's the "migration strategy":
>> - * 1. when the system boots these bits remain what they are,
>> - * but a warning about future change is printed in log;
>> - * 2. once anyone clears soft-dirty bits via clear_refs file,
>> - * these flag is set to denote, that user is aware of the
>> - * new API and those page-shift bits change their meaning.
>> - * The respective warning is printed in dmesg;
>> - * 3. In a couple of releases we will remove all the mentions
>> - * of page-shift in pagemap entries.
>> - */
>
> Wouldn't it be better to just have v2=1 by default for couple releases to
> see if anything breaks? This way we can revert easily if regression reported.
> I guess someone could miss this change coming if he didn't touch clear_refs.
>

I don't believe that constant PAGE_SHIFT bits are used by anybody.
Recent change of permissions was much more destructive and there is just
one report about that. Kernel prints message at first pagemap open for
ten releases. I think that's enough.

--
Konstantin

2015-05-14 18:40:35

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] pagemap: make useable for non-privilege users

Hi Konstantin,

I modified our code to check for the map-exclusive flag where it used
to compare pageframe numbers. First tests look pretty promising, so
this patch looks like a viable approach for us.

Is there anything further we can do to help?

Thanks,
Mark

On Tue, May 12, 2015 at 12:13 PM, Mark Williamson
<[email protected]> wrote:
> Hi Konstantin,
>
> Thanks very much for continuing to look at this! It's very much
> appreciated. I've been investigating from our end but got caught up
> in some gnarly details of our pagemap-consuming code.
>
> I like the approach and it seems like the information you're exposing
> will be useful for our application. I'll test the patch and see if it
> works for us as-is.
>
> Will follow up with any comments on the individual patches.
>
> Thanks,
> Mark
>
> On Tue, May 12, 2015 at 10:43 AM, Konstantin Khlebnikov
> <[email protected]> wrote:
>> This patchset tries to make pagemap useable again in the safe way.
>> First patch adds bit 'map-exlusive' which is set if page is mapped only here.
>> Second patch restores access for non-privileged users but hides pfn if task
>> has no capability CAP_SYS_ADMIN. Third patch removes page-shift bits and
>> completes migration to the new pagemap format (flags soft-dirty and
>> mmap-exlusive are available only in the new format).
>>
>> ---
>>
>> Konstantin Khlebnikov (3):
>> pagemap: add mmap-exclusive bit for marking pages mapped only here
>> pagemap: hide physical addresses from non-privileged users
>> pagemap: switch to the new format and do some cleanup
>>
>>
>> Documentation/vm/pagemap.txt | 3 -
>> fs/proc/task_mmu.c | 178 +++++++++++++++++-------------------------
>> tools/vm/page-types.c | 35 ++++----
>> 3 files changed, 91 insertions(+), 125 deletions(-)

2015-05-14 18:50:34

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here

Hi Konstantin,

On Wed, May 13, 2015 at 11:51 AM, Konstantin Khlebnikov
<[email protected]> wrote:
> On 12.05.2015 15:05, Mark Williamson wrote:
<snip>
>> 1. I was hoping we'd be able to backport a compatible fix to older
>> kernels that might adopt the pagemap permissions change. Using the V2
>> format flags rules out doing this for kernels that are too old to have
>> soft-dirty, I think.
>>
>> 2. From our software's PoV, I feel it's worth noting that it doesn't
>> strictly fix ABI compatibility, though I realise that's probably not
>> your primary concern here. We'll need to modify our code to write the
>> clear_refs file but that change is OK for us if it's the preferred
>> solution.
<snip>
> I prefer to backport v2 format (except soft-dirty bit and clear_refs)
> into older kernels. Page-shift bits are barely used so nobody will see
> the difference.

My concern was whether a change to format would be acceptable to
include in the various -stable kernels; they are already including the
additional protections on pagemap, so we're starting to need our
fallback mode in distributions. Do you think that such a patch would
be acceptable there?

(As an application vendor we're likely to be particularly stuck with
what the commercial distributions decide to ship, which is why I'm
trying to keep an eye on this)

I appreciate that this is a slightly administrative concern! I
definitely like the technical approach of this code and it seems to
work fine for us.

Thanks,
Mark

2015-05-15 09:39:16

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here

On 14.05.2015 21:50, Mark Williamson wrote:
> Hi Konstantin,
>
> On Wed, May 13, 2015 at 11:51 AM, Konstantin Khlebnikov
> <[email protected]> wrote:
>> On 12.05.2015 15:05, Mark Williamson wrote:
> <snip>
>>> 1. I was hoping we'd be able to backport a compatible fix to older
>>> kernels that might adopt the pagemap permissions change. Using the V2
>>> format flags rules out doing this for kernels that are too old to have
>>> soft-dirty, I think.
>>>
>>> 2. From our software's PoV, I feel it's worth noting that it doesn't
>>> strictly fix ABI compatibility, though I realise that's probably not
>>> your primary concern here. We'll need to modify our code to write the
>>> clear_refs file but that change is OK for us if it's the preferred
>>> solution.
> <snip>
>> I prefer to backport v2 format (except soft-dirty bit and clear_refs)
>> into older kernels. Page-shift bits are barely used so nobody will see
>> the difference.
>
> My concern was whether a change to format would be acceptable to
> include in the various -stable kernels; they are already including the
> additional protections on pagemap, so we're starting to need our
> fallback mode in distributions. Do you think that such a patch would
> be acceptable there?
>
> (As an application vendor we're likely to be particularly stuck with
> what the commercial distributions decide to ship, which is why I'm
> trying to keep an eye on this)
>
> I appreciate that this is a slightly administrative concern! I
> definitely like the technical approach of this code and it seems to
> work fine for us.

I cannot guarantee that v2 format will be accepted into stable kernels
and into distributives. I'm not the gate keeper.

As a fallback probably you should invent some kind of suid helper
which gives you access to required information without exposing pfn.
For example: it gets pids and memory ranges as arguments and prints
bitmap of CoWed pages into stdout.

--
Konstantin

2015-06-08 12:53:46

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] pagemap: make useable for non-privilege users

Hi Konstantin,

Would you still be intending to re-submit this patch? We'd be quite
keen to assist, so if there's anything further I can do please let me
know!

Just to re-confirm - we do think that the patch will solve our problem
(relatively minor changes required on our side).

Thanks,
Mark

On Thu, May 14, 2015 at 7:40 PM, Mark Williamson
<[email protected]> wrote:
> Hi Konstantin,
>
> I modified our code to check for the map-exclusive flag where it used
> to compare pageframe numbers. First tests look pretty promising, so
> this patch looks like a viable approach for us.
>
> Is there anything further we can do to help?
>
> Thanks,
> Mark
>
> On Tue, May 12, 2015 at 12:13 PM, Mark Williamson
> <[email protected]> wrote:
>> Hi Konstantin,
>>
>> Thanks very much for continuing to look at this! It's very much
>> appreciated. I've been investigating from our end but got caught up
>> in some gnarly details of our pagemap-consuming code.
>>
>> I like the approach and it seems like the information you're exposing
>> will be useful for our application. I'll test the patch and see if it
>> works for us as-is.
>>
>> Will follow up with any comments on the individual patches.
>>
>> Thanks,
>> Mark
>>
>> On Tue, May 12, 2015 at 10:43 AM, Konstantin Khlebnikov
>> <[email protected]> wrote:
>>> This patchset tries to make pagemap useable again in the safe way.
>>> First patch adds bit 'map-exlusive' which is set if page is mapped only here.
>>> Second patch restores access for non-privileged users but hides pfn if task
>>> has no capability CAP_SYS_ADMIN. Third patch removes page-shift bits and
>>> completes migration to the new pagemap format (flags soft-dirty and
>>> mmap-exlusive are available only in the new format).
>>>
>>> ---
>>>
>>> Konstantin Khlebnikov (3):
>>> pagemap: add mmap-exclusive bit for marking pages mapped only here
>>> pagemap: hide physical addresses from non-privileged users
>>> pagemap: switch to the new format and do some cleanup
>>>
>>>
>>> Documentation/vm/pagemap.txt | 3 -
>>> fs/proc/task_mmu.c | 178 +++++++++++++++++-------------------------
>>> tools/vm/page-types.c | 35 ++++----
>>> 3 files changed, 91 insertions(+), 125 deletions(-)