2019-07-23 04:11:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

On Mon, 22 Jul 2019 17:32:04 -0400 "Joel Fernandes (Google)" <[email protected]> wrote:

> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> This is quite cumbersome and can be error-prone too. If between
> accessing the per-PID pagemap and the global page_idle bitmap, if
> something changes with the page then the information is not accurate.

Well, it's never going to be "accurate" - something could change one
nanosecond after userspace has read the data...

Presumably with this approach the data will be "more" accurate. How
big a problem has this inaccuracy proven to be in real-world usage?

> More over looking up PFN from pagemap in Android devices is not
> supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> the PFN.
>
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file. This
> eliminates the need for userspace to calculate the mapping of the page.
> It follows the exact same semantics as the global
> /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> where looking up PFN is not needed and also does not require SYS_ADMIN.
> It ended up simplifying userspace code, solving the security issue
> mentioned and works quite well. SELinux does not need to be turned off
> since no pagemap look up is needed.
>
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time.
>
> Documentation material:
> The idle page tracking API for virtual address indexing using virtual page
> frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> except that it uses virtual instead of physical frame numbers.
>
> This idle page tracking API can be simpler to use than physical address
> indexing, since the pagemap for a process does not need to be looked up
> to mark or read a page's idle bit. It is also more accurate than
> physical address indexing since in physical address indexing, address
> space changes can occur between reading the pagemap and reading the
> bitmap. In virtual address indexing, the process's mmap_sem is held for
> the duration of the access.
>
> ...
>
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -11,6 +11,7 @@
> #include <linux/mmu_notifier.h>
> #include <linux/page_ext.h>
> #include <linux/page_idle.h>
> +#include <linux/sched/mm.h>
>
> #define BITMAP_CHUNK_SIZE sizeof(u64)
> #define BITMAP_CHUNK_BITS (BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> @@ -28,15 +29,12 @@
> *
> * This function tries to get a user memory page by pfn as described above.
> */

Above comment needs updating or moving?

> -static struct page *page_idle_get_page(unsigned long pfn)
> +static struct page *page_idle_get_page(struct page *page_in)
> {
> struct page *page;
> pg_data_t *pgdat;
>
> - if (!pfn_valid(pfn))
> - return NULL;
> -
> - page = pfn_to_page(pfn);
> + page = page_in;
> if (!page || !PageLRU(page) ||
> !get_page_unless_zero(page))
> return NULL;
>
> ...
>
> +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> + unsigned long *start, unsigned long *end)
> +{
> + unsigned long max_frame;
> +
> + /* If an mm is not given, assume we want physical frames */
> + max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> +
> + if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> + return -EINVAL;
> +
> + *start = pos * BITS_PER_BYTE;
> + if (*start >= max_frame)
> + return -ENXIO;

Is said to mean "The system tried to use the device represented by a
file you specified, and it couldnt find the device. This can mean that
the device file was installed incorrectly, or that the physical device
is missing or not correctly attached to the computer."

This doesn't seem appropriate in this usage and is hence possibly
misleading. Someone whose application fails with ENXIO will be
scratching their heads.

> + *end = *start + count * BITS_PER_BYTE;
> + if (*end > max_frame)
> + *end = max_frame;
> + return 0;
> +}
> +
>
> ...
>
> +static void add_page_idle_list(struct page *page,
> + unsigned long addr, struct mm_walk *walk)
> +{
> + struct page *page_get;
> + struct page_node *pn;
> + int bit;
> + unsigned long frames;
> + struct page_idle_proc_priv *priv = walk->private;
> + u64 *chunk = (u64 *)priv->buffer;
> +
> + if (priv->write) {
> + /* Find whether this page was asked to be marked */
> + frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> + bit = frames % BITMAP_CHUNK_BITS;
> + chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> + if (((*chunk >> bit) & 1) == 0)
> + return;
> + }
> +
> + page_get = page_idle_get_page(page);
> + if (!page_get)
> + return;
> +
> + pn = kmalloc(sizeof(*pn), GFP_ATOMIC);

I'm not liking this GFP_ATOMIC. If I'm reading the code correctly,
userspace can ask for an arbitrarily large number of GFP_ATOMIC
allocations by doing a large read. This can potentially exhaust page
reserves which things like networking Rx interrupts need and can make
this whole feature less reliable.

> + if (!pn)
> + return;
> +
> + pn->page = page_get;
> + pn->addr = addr;
> + list_add(&pn->list, &idle_page_list);
> +}
> +
> +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end,
> + struct mm_walk *walk)
> +{
> + struct vm_area_struct *vma = walk->vma;
> + pte_t *pte;
> + spinlock_t *ptl;
> + struct page *page;
> +
> + ptl = pmd_trans_huge_lock(pmd, vma);
> + if (ptl) {
> + if (pmd_present(*pmd)) {
> + page = follow_trans_huge_pmd(vma, addr, pmd,
> + FOLL_DUMP|FOLL_WRITE);
> + if (!IS_ERR_OR_NULL(page))
> + add_page_idle_list(page, addr, walk);
> + }
> + spin_unlock(ptl);
> + return 0;
> + }
> +
> + if (pmd_trans_unstable(pmd))
> + return 0;
> +
> + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> + for (; addr != end; pte++, addr += PAGE_SIZE) {
> + if (!pte_present(*pte))
> + continue;
> +
> + page = vm_normal_page(vma, addr, *pte);
> + if (page)
> + add_page_idle_list(page, addr, walk);
> + }
> +
> + pte_unmap_unlock(pte - 1, ptl);
> + return 0;
> +}
> +
> +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> + size_t count, loff_t *pos,
> + struct task_struct *tsk, int write)
> +{
> + int ret;
> + char *buffer;
> + u64 *out;
> + unsigned long start_addr, end_addr, start_frame, end_frame;
> + struct mm_struct *mm = file->private_data;
> + struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> + struct page_node *cur, *next;
> + struct page_idle_proc_priv priv;
> + bool walk_error = false;
> +
> + if (!mm || !mmget_not_zero(mm))
> + return -EINVAL;
> +
> + if (count > PAGE_SIZE)
> + count = PAGE_SIZE;
> +
> + buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!buffer) {
> + ret = -ENOMEM;
> + goto out_mmput;
> + }
> + out = (u64 *)buffer;
> +
> + if (write && copy_from_user(buffer, ubuff, count)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> + if (ret)
> + goto out;
> +
> + start_addr = (start_frame << PAGE_SHIFT);
> + end_addr = (end_frame << PAGE_SHIFT);
> + priv.buffer = buffer;
> + priv.start_addr = start_addr;
> + priv.write = write;
> + walk.private = &priv;
> + walk.mm = mm;
> +
> + down_read(&mm->mmap_sem);
> +
> + /*
> + * Protects the idle_page_list which is needed because
> + * walk_page_vma() holds ptlock which deadlocks with
> + * page_idle_clear_pte_refs(). So we have to collect all
> + * pages first, and then call page_idle_clear_pte_refs().
> + */
> + spin_lock(&idle_page_list_lock);
> + ret = walk_page_range(start_addr, end_addr, &walk);
> + if (ret)
> + walk_error = true;
> +
> + list_for_each_entry_safe(cur, next, &idle_page_list, list) {
> + int bit, index;
> + unsigned long off;
> + struct page *page = cur->page;
> +
> + if (unlikely(walk_error))
> + goto remove_page;
> +
> + if (write) {
> + page_idle_clear_pte_refs(page);
> + set_page_idle(page);
> + } else {
> + if (page_really_idle(page)) {
> + off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
> + bit = off % BITMAP_CHUNK_BITS;
> + index = off / BITMAP_CHUNK_BITS;
> + out[index] |= 1ULL << bit;
> + }
> + }
> +remove_page:
> + put_page(page);
> + list_del(&cur->list);
> + kfree(cur);
> + }
> + spin_unlock(&idle_page_list_lock);
> +
> + if (!write && !walk_error)
> + ret = copy_to_user(ubuff, buffer, count);
> +
> + up_read(&mm->mmap_sem);
> +out:
> + kfree(buffer);
> +out_mmput:
> + mmput(mm);
> + if (!ret)
> + ret = count;
> + return ret;
> +
> +}
> +
> +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
> + size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> + return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
> +}
> +
> +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
> + size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> + return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
> +}
> +
> static int __init page_idle_init(void)
> {
> int err;
>
> + INIT_LIST_HEAD(&idle_page_list);
> +
> err = sysfs_create_group(mm_kobj, &page_idle_attr_group);
> if (err) {
> pr_err("page_idle: register sysfs failed\n");
> --
>
> ...
>


2019-07-24 00:16:53

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

On Mon, Jul 22, 2019 at 03:06:39PM -0700, Andrew Morton wrote:
> On Mon, 22 Jul 2019 17:32:04 -0400 "Joel Fernandes (Google)" <[email protected]> wrote:
>
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
>
> Well, it's never going to be "accurate" - something could change one
> nanosecond after userspace has read the data...
>
> Presumably with this approach the data will be "more" accurate. How
> big a problem has this inaccuracy proven to be in real-world usage?

Has proven to be quite a thorn. But the security issue is the main problem..

> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.

..as mentioned here.

I should have emphasized on the security issue more, will do so in the next
revision.

> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > It ended up simplifying userspace code, solving the security issue
> > mentioned and works quite well. SELinux does not need to be turned off
> > since no pagemap look up is needed.
> >
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> >
> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> >
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> >
> > ...
> >
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -11,6 +11,7 @@
> > #include <linux/mmu_notifier.h>
> > #include <linux/page_ext.h>
> > #include <linux/page_idle.h>
> > +#include <linux/sched/mm.h>
> >
> > #define BITMAP_CHUNK_SIZE sizeof(u64)
> > #define BITMAP_CHUNK_BITS (BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> > @@ -28,15 +29,12 @@
> > *
> > * This function tries to get a user memory page by pfn as described above.
> > */
>
> Above comment needs updating or moving?
>
> > -static struct page *page_idle_get_page(unsigned long pfn)
> > +static struct page *page_idle_get_page(struct page *page_in)
> > {
> > struct page *page;
> > pg_data_t *pgdat;
> >
> > - if (!pfn_valid(pfn))
> > - return NULL;
> > -
> > - page = pfn_to_page(pfn);
> > + page = page_in;
> > if (!page || !PageLRU(page) ||
> > !get_page_unless_zero(page))
> > return NULL;
> >
> > ...
> >
> > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > + unsigned long *start, unsigned long *end)
> > +{
> > + unsigned long max_frame;
> > +
> > + /* If an mm is not given, assume we want physical frames */
> > + max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > +
> > + if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > + return -EINVAL;
> > +
> > + *start = pos * BITS_PER_BYTE;
> > + if (*start >= max_frame)
> > + return -ENXIO;
>
> Is said to mean "The system tried to use the device represented by a
> file you specified, and it couldnt find the device. This can mean that
> the device file was installed incorrectly, or that the physical device
> is missing or not correctly attached to the computer."
>
> This doesn't seem appropriate in this usage and is hence possibly
> misleading. Someone whose application fails with ENXIO will be
> scratching their heads.

This actually keeps it consistent with the current code. I refactored that
code a bit and I'm reusing parts of it to keep lines of code less. See
page_idle_bitmap_write where it returns -ENXIO in current upstream.

However note that I am actually returning 0 if page_idle_bitmap_write()
returns -ENXIO:

+ ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+ if (ret == -ENXIO)
+ return 0; /* Reads beyond max_pfn do nothing */

The reason I do it this way is, I am using page_idle_get_frames() in the old
code and the new code, a bit confusing I know! But it is the cleanest way I
could find to keep this code common.

> > + *end = *start + count * BITS_PER_BYTE;
> > + if (*end > max_frame)
> > + *end = max_frame;
> > + return 0;
> > +}
> > +
> >
> > ...
> >
> > +static void add_page_idle_list(struct page *page,
> > + unsigned long addr, struct mm_walk *walk)
> > +{
> > + struct page *page_get;
> > + struct page_node *pn;
> > + int bit;
> > + unsigned long frames;
> > + struct page_idle_proc_priv *priv = walk->private;
> > + u64 *chunk = (u64 *)priv->buffer;
> > +
> > + if (priv->write) {
> > + /* Find whether this page was asked to be marked */
> > + frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> > + bit = frames % BITMAP_CHUNK_BITS;
> > + chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> > + if (((*chunk >> bit) & 1) == 0)
> > + return;
> > + }
> > +
> > + page_get = page_idle_get_page(page);
> > + if (!page_get)
> > + return;
> > +
> > + pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
>
> I'm not liking this GFP_ATOMIC. If I'm reading the code correctly,
> userspace can ask for an arbitrarily large number of GFP_ATOMIC
> allocations by doing a large read. This can potentially exhaust page
> reserves which things like networking Rx interrupts need and can make
> this whole feature less reliable.

Ok, I will look into this more and possibly do the allocation another way.
spinlocks are held hence I use GFP_ATOMIC..

thanks,

- Joel

2019-07-24 20:30:22

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

On Mon, Jul 22, 2019 at 03:06:39PM -0700, Andrew Morton wrote:
[snip]
> > + *end = *start + count * BITS_PER_BYTE;
> > + if (*end > max_frame)
> > + *end = max_frame;
> > + return 0;
> > +}
> > +
> >
> > ...
> >
> > +static void add_page_idle_list(struct page *page,
> > + unsigned long addr, struct mm_walk *walk)
> > +{
> > + struct page *page_get;
> > + struct page_node *pn;
> > + int bit;
> > + unsigned long frames;
> > + struct page_idle_proc_priv *priv = walk->private;
> > + u64 *chunk = (u64 *)priv->buffer;
> > +
> > + if (priv->write) {
> > + /* Find whether this page was asked to be marked */
> > + frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> > + bit = frames % BITMAP_CHUNK_BITS;
> > + chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> > + if (((*chunk >> bit) & 1) == 0)
> > + return;
> > + }
> > +
> > + page_get = page_idle_get_page(page);
> > + if (!page_get)
> > + return;
> > +
> > + pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
>
> I'm not liking this GFP_ATOMIC. If I'm reading the code correctly,
> userspace can ask for an arbitrarily large number of GFP_ATOMIC
> allocations by doing a large read. This can potentially exhaust page
> reserves which things like networking Rx interrupts need and can make
> this whole feature less reliable.

For the revision, I will pre-allocate the page nodes in advance so it does
not need to do this. Diff on top of this patch is below. Let me know any
comments, thanks.

Btw, I also dropped the idle_page_list_lock by putting the idle_page_list
list_head on the stack instead of heap.
---8<-----------------------

From: "Joel Fernandes (Google)" <[email protected]>
Subject: [PATCH] mm/page_idle: Avoid need for GFP_ATOMIC

GFP_ATOMIC can harm allocations does by other allocations that are in
need of reserves and the like. Pre-allocate the nodes list so that
spinlocked region can just use it.

Suggested-by: Andrew Morton <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
mm/page_idle.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 874a60c41fef..b9c790721f16 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -266,6 +266,10 @@ struct page_idle_proc_priv {
unsigned long start_addr;
char *buffer;
int write;
+
+ /* Pre-allocate and provide nodes to add_page_idle_list() */
+ struct page_node *page_nodes;
+ int cur_page_node;
};

static void add_page_idle_list(struct page *page,
@@ -291,10 +295,7 @@ static void add_page_idle_list(struct page *page,
if (!page_get)
return;

- pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
- if (!pn)
- return;
-
+ pn = &(priv->page_nodes[priv->cur_page_node++]);
pn->page = page_get;
pn->addr = addr;
list_add(&pn->list, &idle_page_list);
@@ -379,6 +380,15 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
priv.buffer = buffer;
priv.start_addr = start_addr;
priv.write = write;
+
+ priv.cur_page_node = 0;
+ priv.page_nodes = kzalloc(sizeof(struct page_node) * (end_frame - start_frame),
+ GFP_KERNEL);
+ if (!priv.page_nodes) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
walk.private = &priv;
walk.mm = mm;

@@ -425,6 +435,7 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
ret = copy_to_user(ubuff, buffer, count);

up_read(&mm->mmap_sem);
+ kfree(priv.page_nodes);
out:
kfree(buffer);
out_mmput:
--
2.22.0.657.g960e92d24f-goog