2011-06-22 12:06:39

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] mm: Do not keep page locked during page fault while charging it for memcg

Currently we are keeping faulted page locked throughout whole __do_fault
call (except for page_mkwrite code path). If we do early COW we allocate a
new page which has to be charged for a memcg (mem_cgroup_newpage_charge).
This function, however, might block for unbounded amount of time if memcg
oom killer is disabled because the only way out of the OOM situation is
either an external event (kill a process from the group or resize the group
hard limit) or internal event (that would get us under the limit). Many
times the external event is the only chance to move forward, though.
In the end we are keeping the faulted page locked and blocking other
processes from faulting it in which is not good at all because we are
basically punishing potentially an unrelated process for OOM condition
in a different group (I have seen stuck system because of ld-2.11.1.so being
locked).

Let's unlock the faulted page while we are charging a new page and then
recheck whether it wasn't truncated in the mean time. We should retry the
fault in that case.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/memory.c | 18 +++++++++++++++++-
1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 87d9353..12e7ccc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3177,7 +3177,23 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ret = VM_FAULT_OOM;
goto out;
}
- if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+
+ /* We have to drop the page lock here because memcg
+ * charging might block for unbound time if memcg oom
+ * killer is disabled.
+ */
+ unlock_page(vmf.page);
+ ret = mem_cgroup_newpage_charge(page, mm, GFP_KERNEL);
+ lock_page(vmf.page);
+
+ if (!vmf.page->mapping) {
+ if (!ret)
+ mem_cgroup_uncharge_page(page);
+ page_cache_release(page);
+ ret = 0; /* retry the fault */
+ goto out;
+ }
+ if (ret) {
ret = VM_FAULT_OOM;
page_cache_release(page);
goto out;
--
1.7.5.4

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic


2011-06-22 12:15:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] mm: Do not keep page locked during page fault while charging it for memcg

> +
> + /* We have to drop the page lock here because memcg
> + * charging might block for unbound time if memcg oom
> + * killer is disabled.
> + */
> + unlock_page(vmf.page);
> + ret = mem_cgroup_newpage_charge(page, mm, GFP_KERNEL);
> + lock_page(vmf.page);

This introduces a completely poinless unlock/lock cycle for non-memcg
pagefaults. Please make sure it only happens when actually needed.

2011-06-22 12:32:08

by Michal Hocko

[permalink] [raw]
Subject: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

On Wed 22-06-11 08:15:16, Christoph Hellwig wrote:
> > +
> > + /* We have to drop the page lock here because memcg
> > + * charging might block for unbound time if memcg oom
> > + * killer is disabled.
> > + */
> > + unlock_page(vmf.page);
> > + ret = mem_cgroup_newpage_charge(page, mm, GFP_KERNEL);
> > + lock_page(vmf.page);
>
> This introduces a completely poinless unlock/lock cycle for non-memcg
> pagefaults. Please make sure it only happens when actually needed.

Fair point. Thanks!
What about the following?
I realize that pushing more memcg logic into mm/memory.c is not nice but
I found it better than pushing the old page into mem_cgroup_newpage_charge.
We could also check whether the old page is in the root cgroup because
memcg oom killer is not active there but that would add more code into
this hot path so I guess it is not worth it.

Changes since v1
- do not unlock page when memory controller is disabled.

8<------
>From 82d2b5ce6c38ad3d6df7ccf7010084c2d6658634 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 22 Jun 2011 13:56:54 +0200
Subject: [PATCH] mm: Do not keep page locked during page fault while charging
it for memcg

Currently we are keeping faulted page locked throughout whole __do_fault
call (except for page_mkwrite code path). If we do early COW we allocate a
new page which has to be charged for a memcg (mem_cgroup_newpage_charge).
This function, however, might block for unbounded amount of time if memcg
oom killer is disabled because the only way out of the OOM situation is
either an external event (kill a process from the group or resize the group
hard limit) or internal event (that would get us under the limit). Many
times the external event is the only chance to move forward, though.
In the end we are keeping the faulted page locked and blocking other
processes from faulting it in which is not good at all because we are
basically punishing potentially an unrelated process for OOM condition
in a different group (I have seen stuck system because of ld-2.11.1.so being
locked).

Let's unlock the faulted page while we are charging a new page and then
recheck whether it wasn't truncated in the mean time. We should retry the
fault in that case.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/memory.c | 21 ++++++++++++++++++++-
1 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 87d9353..627eb6a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3177,7 +3177,26 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ret = VM_FAULT_OOM;
goto out;
}
- if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+
+ /* We have to drop the page lock here because memcg
+ * charging might block for unbound time if memcg oom
+ * killer is disabled.
+ */
+ if (!mem_cgroup_disabled())
+ unlock_page(vmf.page);
+ ret = mem_cgroup_newpage_charge(page, mm, GFP_KERNEL);
+ if (!mem_cgroup_disabled()) {
+ lock_page(vmf.page);
+
+ if (!vmf.page->mapping) {
+ if (!ret)
+ mem_cgroup_uncharge_page(page);
+ page_cache_release(page);
+ ret = 0; /* retry the fault */
+ goto out;
+ }
+ }
+ if (ret) {
ret = VM_FAULT_OOM;
page_cache_release(page);
goto out;
--
1.7.5.4


--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-23 06:16:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH] memcg: unlock page before charging it. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

On Wed, 22 Jun 2011 14:32:04 +0200
Michal Hocko <[email protected]> wrote:

> On Wed 22-06-11 08:15:16, Christoph Hellwig wrote:
> > > +
> > > + /* We have to drop the page lock here because memcg
> > > + * charging might block for unbound time if memcg oom
> > > + * killer is disabled.
> > > + */
> > > + unlock_page(vmf.page);
> > > + ret = mem_cgroup_newpage_charge(page, mm, GFP_KERNEL);
> > > + lock_page(vmf.page);
> >
> > This introduces a completely poinless unlock/lock cycle for non-memcg
> > pagefaults. Please make sure it only happens when actually needed.
>
> Fair point. Thanks!
> What about the following?
> I realize that pushing more memcg logic into mm/memory.c is not nice but
> I found it better than pushing the old page into mem_cgroup_newpage_charge.
> We could also check whether the old page is in the root cgroup because
> memcg oom killer is not active there but that would add more code into
> this hot path so I guess it is not worth it.
>
> Changes since v1
> - do not unlock page when memory controller is disabled.
>

Great work. Then I confirmed Lutz' problem is fixed.

But I like following style rather than additional lock/unlock.
How do you think ? I tested this on the latest git tree and confirmed
the Lutz's livelock problem is fixed. And I think this should go stable tree.


==
>From 7e9250da9ff529958d4c1ff511458dbdac8e4b81 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 23 Jun 2011 15:05:57 +0900
Subject: [PATCH] memcg: unlock page before charging it.

Currently we are keeping faulted page locked throughout whole __do_fault
call (except for page_mkwrite code path). If we do early COW we allocate a
new page which has to be charged for a memcg (mem_cgroup_newpage_charge).

This function, however, might block for unbounded amount of time if memcg
oom killer is disabled or fork-bomb is running because the only way out of
the OOM situation is either an external event or OOM-situation fix.

processes from faulting it in which is not good at all because we are
basically punishing potentially an unrelated process for OOM condition
in a different group (I have seen stuck system because of ld-2.11.1.so being
locked).

We can do test easily.
% cgcreate -g memory:A
% cgset -r memory.limit_in_bytes=64M A
% cgset -r memory.memsw.limit_in_bytes=64M A
% cd kernel_dir; cgexec -g memory:A make -j

Then, the whole system will live-locked until you kill 'make -j'
by hands (or push reboot...) This is because some important
page in a shared library are locked and never released bcause of fork-bomb.

This patch delays "charge" until unlock_page() called. There is
no problem as far as we keep reference on a page.
(memcg doesn't require page_lock()).

Then, above livelock disappears.

Reported-by: Lutz Vieweg <[email protected]>
Original-idea-by: Michal Hocko <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memory.c | 28 +++++++++++++++++++---------
1 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 87d9353..66442da 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3129,7 +3129,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
pte_t entry;
int anon = 0;
- int charged = 0;
+ struct page *need_charge = NULL;
struct page *dirty_page = NULL;
struct vm_fault vmf;
int ret;
@@ -3177,12 +3177,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ret = VM_FAULT_OOM;
goto out;
}
- if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
- ret = VM_FAULT_OOM;
- page_cache_release(page);
- goto out;
- }
- charged = 1;
+ need_charge = page;
copy_user_highpage(page, vmf.page, address, vma);
__SetPageUptodate(page);
} else {
@@ -3251,12 +3246,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, address, page_table);
} else {
- if (charged)
- mem_cgroup_uncharge_page(page);
if (anon)
page_cache_release(page);
else
anon = 1; /* no anon but release faulted_page */
+ need_charge = NULL;
}

pte_unmap_unlock(page_table, ptl);
@@ -3268,6 +3262,17 @@ out:
if (set_page_dirty(dirty_page))
page_mkwrite = 1;
unlock_page(dirty_page);
+ if (need_charge) {
+ /*
+ * charge this page before we drop refcnt.
+ * memory cgroup returns OOM condition when
+ * this task is killed. So, it's not necesasry
+ * to undo.
+ */
+ if (mem_cgroup_newpage_charge(need_charge,
+ mm, GFP_KERNEL))
+ ret = VM_FAULT_OOM;
+ }
put_page(dirty_page);
if (page_mkwrite && mapping) {
/*
@@ -3282,6 +3287,11 @@ out:
file_update_time(vma->vm_file);
} else {
unlock_page(vmf.page);
+ if (need_charge) {
+ if (mem_cgroup_newpage_charge(need_charge,
+ mm, GFP_KERNEL))
+ ret = VM_FAULT_OOM;
+ }
if (anon)
page_cache_release(vmf.page);
}
--
1.7.4.1



2011-06-23 07:39:20

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH] memcg: unlock page before charging it. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

On Thu, 23 Jun 2011 15:08:42 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Wed, 22 Jun 2011 14:32:04 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Wed 22-06-11 08:15:16, Christoph Hellwig wrote:
> > > > +
> > > > + /* We have to drop the page lock here because memcg
> > > > + * charging might block for unbound time if memcg oom
> > > > + * killer is disabled.
> > > > + */
> > > > + unlock_page(vmf.page);
> > > > + ret = mem_cgroup_newpage_charge(page, mm, GFP_KERNEL);
> > > > + lock_page(vmf.page);
> > >
> > > This introduces a completely poinless unlock/lock cycle for non-memcg
> > > pagefaults. Please make sure it only happens when actually needed.
> >
> > Fair point. Thanks!
> > What about the following?
> > I realize that pushing more memcg logic into mm/memory.c is not nice but
> > I found it better than pushing the old page into mem_cgroup_newpage_charge.
> > We could also check whether the old page is in the root cgroup because
> > memcg oom killer is not active there but that would add more code into
> > this hot path so I guess it is not worth it.
> >
> > Changes since v1
> > - do not unlock page when memory controller is disabled.
> >
>
> Great work. Then I confirmed Lutz' problem is fixed.
>
> But I like following style rather than additional lock/unlock.
> How do you think ? I tested this on the latest git tree and confirmed
> the Lutz's livelock problem is fixed. And I think this should go stable tree.
>
I vote for this one.

One comments are inlined below.

>
> ==
> From 7e9250da9ff529958d4c1ff511458dbdac8e4b81 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 23 Jun 2011 15:05:57 +0900
> Subject: [PATCH] memcg: unlock page before charging it.
>
> Currently we are keeping faulted page locked throughout whole __do_fault
> call (except for page_mkwrite code path). If we do early COW we allocate a
> new page which has to be charged for a memcg (mem_cgroup_newpage_charge).
>
> This function, however, might block for unbounded amount of time if memcg
> oom killer is disabled or fork-bomb is running because the only way out of
> the OOM situation is either an external event or OOM-situation fix.
>
> processes from faulting it in which is not good at all because we are
> basically punishing potentially an unrelated process for OOM condition
> in a different group (I have seen stuck system because of ld-2.11.1.so being
> locked).
>
> We can do test easily.
> % cgcreate -g memory:A
> % cgset -r memory.limit_in_bytes=64M A
> % cgset -r memory.memsw.limit_in_bytes=64M A
> % cd kernel_dir; cgexec -g memory:A make -j
>
> Then, the whole system will live-locked until you kill 'make -j'
> by hands (or push reboot...) This is because some important
> page in a shared library are locked and never released bcause of fork-bomb.
>
> This patch delays "charge" until unlock_page() called. There is
> no problem as far as we keep reference on a page.
> (memcg doesn't require page_lock()).
>
> Then, above livelock disappears.
>
> Reported-by: Lutz Vieweg <[email protected]>
> Original-idea-by: Michal Hocko <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/memory.c | 28 +++++++++++++++++++---------
> 1 files changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 87d9353..66442da 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3129,7 +3129,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> struct page *page;
> pte_t entry;
> int anon = 0;
> - int charged = 0;
> + struct page *need_charge = NULL;
> struct page *dirty_page = NULL;
> struct vm_fault vmf;
> int ret;
> @@ -3177,12 +3177,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> ret = VM_FAULT_OOM;
> goto out;
> }
> - if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
> - ret = VM_FAULT_OOM;
> - page_cache_release(page);
> - goto out;
> - }
> - charged = 1;
> + need_charge = page;
> copy_user_highpage(page, vmf.page, address, vma);
> __SetPageUptodate(page);
> } else {
> @@ -3251,12 +3246,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> /* no need to invalidate: a not-present page won't be cached */
> update_mmu_cache(vma, address, page_table);
> } else {
> - if (charged)
> - mem_cgroup_uncharge_page(page);
> if (anon)
> page_cache_release(page);
> else
> anon = 1; /* no anon but release faulted_page */
> + need_charge = NULL;
> }
>
> pte_unmap_unlock(page_table, ptl);
> @@ -3268,6 +3262,17 @@ out:
> if (set_page_dirty(dirty_page))
> page_mkwrite = 1;
> unlock_page(dirty_page);
> + if (need_charge) {
> + /*
> + * charge this page before we drop refcnt.
> + * memory cgroup returns OOM condition when
> + * this task is killed. So, it's not necesasry
> + * to undo.
> + */
> + if (mem_cgroup_newpage_charge(need_charge,
> + mm, GFP_KERNEL))
> + ret = VM_FAULT_OOM;
> + }
> put_page(dirty_page);
> if (page_mkwrite && mapping) {
> /*
Hmm, if I read the code correctly, we don't come to this path.
Because "dirty_page" is set only in "anon == 0" case and, when we set "need_charge",
we set "anon" too.
So, we can do mem_cgroup_newpage_charge(need_charge) outside of
"if (dirty_page) ... else ..." block ?


Thanks,
Daisuke Nishimura.

> @@ -3282,6 +3287,11 @@ out:
> file_update_time(vma->vm_file);
> } else {
> unlock_page(vmf.page);
> + if (need_charge) {
> + if (mem_cgroup_newpage_charge(need_charge,
> + mm, GFP_KERNEL))
> + ret = VM_FAULT_OOM;
> + }
> if (anon)
> page_cache_release(vmf.page);
> }
> --
> 1.7.4.1
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-06-23 07:41:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: unlock page before charging it. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

On Thu 23-06-11 15:08:42, KAMEZAWA Hiroyuki wrote:
> On Wed, 22 Jun 2011 14:32:04 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Wed 22-06-11 08:15:16, Christoph Hellwig wrote:
> > > > +
> > > > + /* We have to drop the page lock here because memcg
> > > > + * charging might block for unbound time if memcg oom
> > > > + * killer is disabled.
> > > > + */
> > > > + unlock_page(vmf.page);
> > > > + ret = mem_cgroup_newpage_charge(page, mm, GFP_KERNEL);
> > > > + lock_page(vmf.page);
> > >
> > > This introduces a completely poinless unlock/lock cycle for non-memcg
> > > pagefaults. Please make sure it only happens when actually needed.
> >
> > Fair point. Thanks!
> > What about the following?
> > I realize that pushing more memcg logic into mm/memory.c is not nice but
> > I found it better than pushing the old page into mem_cgroup_newpage_charge.
> > We could also check whether the old page is in the root cgroup because
> > memcg oom killer is not active there but that would add more code into
> > this hot path so I guess it is not worth it.
> >
> > Changes since v1
> > - do not unlock page when memory controller is disabled.
> >
>
> Great work. Then I confirmed Lutz' problem is fixed.
>
> But I like following style rather than additional lock/unlock.
> How do you think ?

Yes, I like it much more than the hairy way I did it. See comments bellow.

> I tested this on the latest git tree and confirmed
> the Lutz's livelock problem is fixed. And I think this should go stable tree.
>
>
> ==
> From 7e9250da9ff529958d4c1ff511458dbdac8e4b81 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 23 Jun 2011 15:05:57 +0900
> Subject: [PATCH] memcg: unlock page before charging it.
>
> Currently we are keeping faulted page locked throughout whole __do_fault
> call (except for page_mkwrite code path). If we do early COW we allocate a
> new page which has to be charged for a memcg (mem_cgroup_newpage_charge).
>
> This function, however, might block for unbounded amount of time if memcg
> oom killer is disabled or fork-bomb is running because the only way out of
> the OOM situation is either an external event or OOM-situation fix.
>
> processes from faulting it in which is not good at all because we are

Missing the beginning of the sentence?

> basically punishing potentially an unrelated process for OOM condition
> in a different group (I have seen stuck system because of ld-2.11.1.so being
> locked).
>
> We can do test easily.
> % cgcreate -g memory:A
> % cgset -r memory.limit_in_bytes=64M A
> % cgset -r memory.memsw.limit_in_bytes=64M A
> % cd kernel_dir; cgexec -g memory:A make -j
>
> Then, the whole system will live-locked until you kill 'make -j'
> by hands (or push reboot...) This is because some important
> page in a shared library are locked and never released bcause of fork-bomb.
>
> This patch delays "charge" until unlock_page() called. There is
> no problem as far as we keep reference on a page.
> (memcg doesn't require page_lock()).
>
> Then, above livelock disappears.
>
> Reported-by: Lutz Vieweg <[email protected]>
> Original-idea-by: Michal Hocko <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/memory.c | 28 +++++++++++++++++++---------
> 1 files changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 87d9353..66442da 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3129,7 +3129,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> struct page *page;
> pte_t entry;
> int anon = 0;
> - int charged = 0;
> + struct page *need_charge = NULL;
> struct page *dirty_page = NULL;
> struct vm_fault vmf;
> int ret;
> @@ -3177,12 +3177,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> ret = VM_FAULT_OOM;
> goto out;
> }
> - if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
> - ret = VM_FAULT_OOM;
> - page_cache_release(page);
> - goto out;
> - }
> - charged = 1;
> + need_charge = page;
> copy_user_highpage(page, vmf.page, address, vma);
> __SetPageUptodate(page);
> } else {
> @@ -3251,12 +3246,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> /* no need to invalidate: a not-present page won't be cached */
> update_mmu_cache(vma, address, page_table);
> } else {
> - if (charged)
> - mem_cgroup_uncharge_page(page);
> if (anon)
> page_cache_release(page);
> else
> anon = 1; /* no anon but release faulted_page */
> + need_charge = NULL;
> }
>
> pte_unmap_unlock(page_table, ptl);
> @@ -3268,6 +3262,17 @@ out:
> if (set_page_dirty(dirty_page))
> page_mkwrite = 1;
> unlock_page(dirty_page);
> + if (need_charge) {
> + /*
> + * charge this page before we drop refcnt.
> + * memory cgroup returns OOM condition when
> + * this task is killed. So, it's not necesasry
> + * to undo.
> + */
> + if (mem_cgroup_newpage_charge(need_charge,
> + mm, GFP_KERNEL))
> + ret = VM_FAULT_OOM;
> + }

We do not need this hunk, don't we? dirty_page is set only if !anon so
we never get to this path from COW.

Other than that:
Reviewed-by: Michal Hocko <[email protected]>

Thanks!
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-23 08:15:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] memcg: unlock page before charging it. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

On Thu, 23 Jun 2011 09:41:33 +0200
Michal Hocko <[email protected]> wrote:

> On Thu 23-06-11 15:08:42, KAMEZAWA Hiroyuki wrote:
> > On Wed, 22 Jun 2011 14:32:04 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Wed 22-06-11 08:15:16, Christoph Hellwig wrote:
> > > > > +
> > > > > + /* We have to drop the page lock here because memcg
> > > > > + * charging might block for unbound time if memcg oom
> > > > > + * killer is disabled.
> > > > > + */
> > > > > + unlock_page(vmf.page);
> > > > > + ret = mem_cgroup_newpage_charge(page, mm, GFP_KERNEL);
> > > > > + lock_page(vmf.page);
> > > >
> > > > This introduces a completely poinless unlock/lock cycle for non-memcg
> > > > pagefaults. Please make sure it only happens when actually needed.
> > >
> > > Fair point. Thanks!
> > > What about the following?
> > > I realize that pushing more memcg logic into mm/memory.c is not nice but
> > > I found it better than pushing the old page into mem_cgroup_newpage_charge.
> > > We could also check whether the old page is in the root cgroup because
> > > memcg oom killer is not active there but that would add more code into
> > > this hot path so I guess it is not worth it.
> > >
> > > Changes since v1
> > > - do not unlock page when memory controller is disabled.
> > >
> >
> > Great work. Then I confirmed Lutz' problem is fixed.
> >
> > But I like following style rather than additional lock/unlock.
> > How do you think ?
>
> Yes, I like it much more than the hairy way I did it. See comments bellow.
>
> > I tested this on the latest git tree and confirmed
> > the Lutz's livelock problem is fixed. And I think this should go stable tree.
> >
> >
> > ==
> > From 7e9250da9ff529958d4c1ff511458dbdac8e4b81 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <[email protected]>
> > Date: Thu, 23 Jun 2011 15:05:57 +0900
> > Subject: [PATCH] memcg: unlock page before charging it.
> >
> > Currently we are keeping faulted page locked throughout whole __do_fault
> > call (except for page_mkwrite code path). If we do early COW we allocate a
> > new page which has to be charged for a memcg (mem_cgroup_newpage_charge).
> >
> > This function, however, might block for unbounded amount of time if memcg
> > oom killer is disabled or fork-bomb is running because the only way out of
> > the OOM situation is either an external event or OOM-situation fix.
> >
> > processes from faulting it in which is not good at all because we are
>
> Missing the beginning of the sentence?
>

Ah, yes...

> > basically punishing potentially an unrelated process for OOM condition
> > in a different group (I have seen stuck system because of ld-2.11.1.so being
> > locked).
> >
> > We can do test easily.
> > % cgcreate -g memory:A
> > % cgset -r memory.limit_in_bytes=64M A
> > % cgset -r memory.memsw.limit_in_bytes=64M A
> > % cd kernel_dir; cgexec -g memory:A make -j
> >
> > Then, the whole system will live-locked until you kill 'make -j'
> > by hands (or push reboot...) This is because some important
> > page in a shared library are locked and never released bcause of fork-bomb.
> >
> > This patch delays "charge" until unlock_page() called. There is
> > no problem as far as we keep reference on a page.
> > (memcg doesn't require page_lock()).
> >
> > Then, above livelock disappears.
> >
> > Reported-by: Lutz Vieweg <[email protected]>
> > Original-idea-by: Michal Hocko <[email protected]>
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > mm/memory.c | 28 +++++++++++++++++++---------
> > 1 files changed, 19 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 87d9353..66442da 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3129,7 +3129,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > struct page *page;
> > pte_t entry;
> > int anon = 0;
> > - int charged = 0;
> > + struct page *need_charge = NULL;
> > struct page *dirty_page = NULL;
> > struct vm_fault vmf;
> > int ret;
> > @@ -3177,12 +3177,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > ret = VM_FAULT_OOM;
> > goto out;
> > }
> > - if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
> > - ret = VM_FAULT_OOM;
> > - page_cache_release(page);
> > - goto out;
> > - }
> > - charged = 1;
> > + need_charge = page;
> > copy_user_highpage(page, vmf.page, address, vma);
> > __SetPageUptodate(page);
> > } else {
> > @@ -3251,12 +3246,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > /* no need to invalidate: a not-present page won't be cached */
> > update_mmu_cache(vma, address, page_table);
> > } else {
> > - if (charged)
> > - mem_cgroup_uncharge_page(page);
> > if (anon)
> > page_cache_release(page);
> > else
> > anon = 1; /* no anon but release faulted_page */
> > + need_charge = NULL;
> > }
> >
> > pte_unmap_unlock(page_table, ptl);
> > @@ -3268,6 +3262,17 @@ out:
> > if (set_page_dirty(dirty_page))
> > page_mkwrite = 1;
> > unlock_page(dirty_page);
> > + if (need_charge) {
> > + /*
> > + * charge this page before we drop refcnt.
> > + * memory cgroup returns OOM condition when
> > + * this task is killed. So, it's not necesasry
> > + * to undo.
> > + */
> > + if (mem_cgroup_newpage_charge(need_charge,
> > + mm, GFP_KERNEL))
> > + ret = VM_FAULT_OOM;
> > + }
>
> We do not need this hunk, don't we? dirty_page is set only if !anon so
> we never get to this path from COW.
>
You're right. (And Nishimura pointed out this, too)

> Other than that:
> Reviewed-by: Michal Hocko <[email protected]>
>

I found the page is added to LRU before charging. (In this case,
memcg's LRU is ignored.) I'll post a new version with a fix.

Thanks,
-Kame



2011-06-23 09:02:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: unlock page before charging it. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

On Thu 23-06-11 17:08:11, KAMEZAWA Hiroyuki wrote:
> On Thu, 23 Jun 2011 09:41:33 +0200
> Michal Hocko <[email protected]> wrote:
[...]
> > Other than that:
> > Reviewed-by: Michal Hocko <[email protected]>
> >
>
> I found the page is added to LRU before charging. (In this case,
> memcg's LRU is ignored.) I'll post a new version with a fix.

Yes, you are right. I have missed that.
This means that we might race with reclaim which could evict the COWed
page wich in turn would uncharge that page even though we haven't
charged it yet.

Can we postpone page_add_new_anon_rmap to the charging path or it would
just race somewhere else?

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-23 10:09:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH] mm: preallocate page before lock_page at filemap COW. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

On Thu, 23 Jun 2011 11:02:04 +0200
Michal Hocko <[email protected]> wrote:

> On Thu 23-06-11 17:08:11, KAMEZAWA Hiroyuki wrote:
> > On Thu, 23 Jun 2011 09:41:33 +0200
> > Michal Hocko <[email protected]> wrote:
> [...]
> > > Other than that:
> > > Reviewed-by: Michal Hocko <[email protected]>
> > >
> >
> > I found the page is added to LRU before charging. (In this case,
> > memcg's LRU is ignored.) I'll post a new version with a fix.
>
> Yes, you are right. I have missed that.
> This means that we might race with reclaim which could evict the COWed
> page wich in turn would uncharge that page even though we haven't
> charged it yet.
>
> Can we postpone page_add_new_anon_rmap to the charging path or it would
> just race somewhere else?
>

I got a different idea. How about this ?
I think this will have benefit for non-memcg users under OOM, too.

A concerns is VM_FAULT_RETRY case but wait-for-lock will be much heavier
than preallocation + free-for-retry cost.

(I'm sorry I'll not be very active until the next week, so feel free to
post your own version if necessary.)

This is onto -rc4 and worked well on my test.

==
>From 7b0aa038f3a1c6a479a3ce6acb38c7d2740d7a75 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 23 Jun 2011 18:50:32 +0900
Subject: [PATCH] mm: preallocate page before lock_page() at filemap COW.

Currently we are keeping faulted page locked throughout whole __do_fault
call (except for page_mkwrite code path) after calling file system's
fault code. If we do early COW, we allocate a new page which has to be
charged for a memcg (mem_cgroup_newpage_charge).

This function, however, might block for unbounded amount of time if memcg
oom killer is disabled or fork-bomb is running because the only way out of
the OOM situation is either an external event or OOM-situation fix.

In the end we are keeping the faulted page locked and blocking other
processes from faulting it in which is not good at all because we are
basically punishing potentially an unrelated process for OOM condition
in a different group (I have seen stuck system because of ld-2.11.1.so being
locked).

We can do test easily.

% cgcreate -g memory:A
% cgset -r memory.limit_in_bytes=64M A
% cgset -r memory.memsw.limit_in_bytes=64M A
% cd kernel_dir; cgexec -g memory:A make -j

Then, the whole system will live-locked until you kill 'make -j'
by hands (or push reboot...) This is because some important
page in a shared library are locked.

Considering again, the new page is not necessary to be allocated
with lock_page() held. So....

This patch moves "charge" and memory allocation for COW page
before lock_page(). Then, we can avoid scanning LRU with holding
a lock on a page.

Then, above livelock disappears.

Reported-by: Lutz Vieweg <[email protected]>
Original-idea-by: Michal Hocko <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Changelog:
- change the logic from "do charge after unlock" to
"do charge before lock".
---
mm/memory.c | 56 ++++++++++++++++++++++++++++++++++----------------------
1 files changed, 34 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 87d9353..845378c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3127,14 +3127,34 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table;
spinlock_t *ptl;
struct page *page;
+ struct page *cow_page;
pte_t entry;
int anon = 0;
- int charged = 0;
struct page *dirty_page = NULL;
struct vm_fault vmf;
int ret;
int page_mkwrite = 0;

+ /*
+ * If we do COW later, allocate page befor taking lock_page()
+ * on the file cache page. This will reduce lock holding time.
+ */
+ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
+ cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!cow_page)
+ return VM_FAULT_OOM;
+
+ if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
+ page_cache_release(cow_page);
+ return VM_FAULT_OOM;
+ }
+ } else
+ cow_page = NULL;
+
vmf.virtual_address = (void __user *)(address & PAGE_MASK);
vmf.pgoff = pgoff;
vmf.flags = flags;
@@ -3143,12 +3163,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ret = vma->vm_ops->fault(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
VM_FAULT_RETRY)))
- return ret;
+ goto uncharge_out;

if (unlikely(PageHWPoison(vmf.page))) {
if (ret & VM_FAULT_LOCKED)
unlock_page(vmf.page);
- return VM_FAULT_HWPOISON;
+ ret = VM_FAULT_HWPOISON;
+ goto uncharge_out;
}

/*
@@ -3166,23 +3187,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
page = vmf.page;
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
+ page = cow_page;
anon = 1;
- if (unlikely(anon_vma_prepare(vma))) {
- ret = VM_FAULT_OOM;
- goto out;
- }
- page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
- vma, address);
- if (!page) {
- ret = VM_FAULT_OOM;
- goto out;
- }
- if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
- ret = VM_FAULT_OOM;
- page_cache_release(page);
- goto out;
- }
- charged = 1;
copy_user_highpage(page, vmf.page, address, vma);
__SetPageUptodate(page);
} else {
@@ -3251,8 +3257,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, address, page_table);
} else {
- if (charged)
- mem_cgroup_uncharge_page(page);
+ if (cow_page)
+ mem_cgroup_uncharge_page(cow_page);
if (anon)
page_cache_release(page);
else
@@ -3261,7 +3267,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,

pte_unmap_unlock(page_table, ptl);

-out:
if (dirty_page) {
struct address_space *mapping = page->mapping;

@@ -3291,6 +3296,13 @@ out:
unwritable_page:
page_cache_release(page);
return ret;
+uncharge_out:
+ /* fs's fault handler get error */
+ if (cow_page) {
+ mem_cgroup_uncharge_page(cow_page);
+ page_cache_release(cow_page);
+ }
+ return ret;
}

static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
--
1.7.4.1


2011-06-23 11:58:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: preallocate page before lock_page at filemap COW. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

On Thu 23-06-11 19:01:57, KAMEZAWA Hiroyuki wrote:
> On Thu, 23 Jun 2011 11:02:04 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Thu 23-06-11 17:08:11, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 23 Jun 2011 09:41:33 +0200
> > > Michal Hocko <[email protected]> wrote:
> > [...]
> > > > Other than that:
> > > > Reviewed-by: Michal Hocko <[email protected]>
> > > >
> > >
> > > I found the page is added to LRU before charging. (In this case,
> > > memcg's LRU is ignored.) I'll post a new version with a fix.
> >
> > Yes, you are right. I have missed that.
> > This means that we might race with reclaim which could evict the COWed
> > page wich in turn would uncharge that page even though we haven't
> > charged it yet.
> >
> > Can we postpone page_add_new_anon_rmap to the charging path or it would
> > just race somewhere else?
> >
>
> I got a different idea. How about this ?
> I think this will have benefit for non-memcg users under OOM, too.

Could you be more specific? I do not see how preallocation which might
turn out to be pointless could help under OOM.

>
> A concerns is VM_FAULT_RETRY case but wait-for-lock will be much heavier
> than preallocation + free-for-retry cost.

Preallocation is rather costly when fault handler fails (e.g. SIGBUS
which is the easiest one to trigger).

I am not saying this approach is bad but I think that preallocation can
be much more costly than unlock, charge and lock&recheck approach.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-23 13:01:43

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH] mm: preallocate page before lock_page at filemap COW. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

2011/6/23 Michal Hocko <[email protected]>:
> On Thu 23-06-11 19:01:57, KAMEZAWA Hiroyuki wrote:
>> On Thu, 23 Jun 2011 11:02:04 +0200
>> Michal Hocko <[email protected]> wrote:
>>
>> > On Thu 23-06-11 17:08:11, KAMEZAWA Hiroyuki wrote:
>> > > On Thu, 23 Jun 2011 09:41:33 +0200
>> > > Michal Hocko <[email protected]> wrote:
>> > [...]
>> > > > Other than that:
>> > > > Reviewed-by: Michal Hocko <[email protected]>
>> > > >
>> > >
>> > > I found the page is added to LRU before charging. (In this case,
>> > > memcg's LRU is ignored.) I'll post a new version with a fix.
>> >
>> > Yes, you are right. I have missed that.
>> > This means that we might race with reclaim which could evict the COWed
>> > page wich in turn would uncharge that page even though we haven't
>> > charged it yet.
>> >
>> > Can we postpone page_add_new_anon_rmap to the charging path or it would
>> > just race somewhere else?
>> >
>>
>> I got a different idea. How about this ?
>> I think this will have benefit for non-memcg users under OOM, too.
>
> Could you be more specific? I do not see how preallocation which might
> turn out to be pointless could help under OOM.
>

We'll have no page allocation under lock_page() held in this path.
I think it is good.

>>
>> A concerns is VM_FAULT_RETRY case but wait-for-lock will be much heavier
>> than preallocation + free-for-retry cost.
>
> Preallocation is rather costly when fault handler fails (e.g. SIGBUS
> which is the easiest one to trigger).
>
I think pcp cache of free page allocater does enough good job and I guess
we'll see no problem even if there is a storm of SIGBUS.

> I am not saying this approach is bad but I think that preallocation can
> be much more costly than unlock, charge and lock&recheck approach.

memcg_is_disabled() cannot help ROOT cgroup. And additional
lock/unlock method may kill FAULT_RETRY at lock contention optimization
which was added recently.


Thanks,
-Kame

2011-06-23 13:23:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: preallocate page before lock_page at filemap COW. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

On Thu 23-06-11 22:01:40, Hiroyuki Kamezawa wrote:
> 2011/6/23 Michal Hocko <[email protected]>:
> > On Thu 23-06-11 19:01:57, KAMEZAWA Hiroyuki wrote:
> >> On Thu, 23 Jun 2011 11:02:04 +0200
> >> Michal Hocko <[email protected]> wrote:
> >>
> >> > On Thu 23-06-11 17:08:11, KAMEZAWA Hiroyuki wrote:
> >> > > On Thu, 23 Jun 2011 09:41:33 +0200
> >> > > Michal Hocko <[email protected]> wrote:
> >> > [...]
> >> > > > Other than that:
> >> > > > Reviewed-by: Michal Hocko <[email protected]>
> >> > > >
> >> > >
> >> > > I found the page is added to LRU before charging. (In this case,
> >> > > memcg's LRU is ignored.) I'll post a new version with a fix.
> >> >
> >> > Yes, you are right. I have missed that.
> >> > This means that we might race with reclaim which could evict the COWed
> >> > page wich in turn would uncharge that page even though we haven't
> >> > charged it yet.
> >> >
> >> > Can we postpone page_add_new_anon_rmap to the charging path or it would
> >> > just race somewhere else?
> >> >
> >>
> >> I got a different idea. How about this ?
> >> I think this will have benefit for non-memcg users under OOM, too.
> >
> > Could you be more specific? I do not see how preallocation which might
> > turn out to be pointless could help under OOM.
> >
>
> We'll have no page allocation under lock_page() held in this path.
> I think it is good.

But it can also cause that the page, we are about to fault in, is evicted
due to allocation so we would have to do a major fault... This is
probably not that serious, though.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-23 13:51:30

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH] mm: preallocate page before lock_page at filemap COW. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

2011/6/23 Michal Hocko <[email protected]>:
> On Thu 23-06-11 22:01:40, Hiroyuki Kamezawa wrote:
>> 2011/6/23 Michal Hocko <[email protected]>:
>> > On Thu 23-06-11 19:01:57, KAMEZAWA Hiroyuki wrote:
>> >> On Thu, 23 Jun 2011 11:02:04 +0200
>> >> Michal Hocko <[email protected]> wrote:
>> >>
>> >> > On Thu 23-06-11 17:08:11, KAMEZAWA Hiroyuki wrote:
>> >> > > On Thu, 23 Jun 2011 09:41:33 +0200
>> >> > > Michal Hocko <[email protected]> wrote:
>> >> > [...]
>> >> > > > Other than that:
>> >> > > > Reviewed-by: Michal Hocko <[email protected]>
>> >> > > >
>> >> > >
>> >> > > I found the page is added to LRU before charging. (In this case,
>> >> > > memcg's LRU is ignored.) I'll post a new version with a fix.
>> >> >
>> >> > Yes, you are right. I have missed that.
>> >> > This means that we might race with reclaim which could evict the COWed
>> >> > page wich in turn would uncharge that page even though we haven't
>> >> > charged it yet.
>> >> >
>> >> > Can we postpone page_add_new_anon_rmap to the charging path or it would
>> >> > just race somewhere else?
>> >> >
>> >>
>> >> I got a different idea. How about this ?
>> >> I think this will have benefit for non-memcg users under OOM, too.
>> >
>> > Could you be more specific? I do not see how preallocation which might
>> > turn out to be pointless could help under OOM.
>> >
>>
>> We'll have no page allocation under lock_page() held in this path.
>> I think it is good.
>
> But it can also cause that the page, we are about to fault in, is evicted
> due to allocation so we would have to do a major fault... This is
> probably not that serious, though.

For other purpose, I have(had) other patch to prevent it (and planned
to post it.)

The basic logic is...

1. add a new member variable to vm_area_struct as

vma->vm_faulting_to

2. at __do_fault(), set vm_faulting_to as

vma->vm_faulting_to = pgoff.


3. chec vma->vm_faulting_to at page_referenced_file() as

if (pgoff (Was page->index) == vma->vm_faulting_to)
referenced++

Then, the page which someone is waiting for page-fault will be marked as
referenced and go KEEP_LOCKED.

(vm_faulting_to can be cleared after we got lock_page()).

In corner case, several threads which shares vma may fault into a vma.
But this will help typical case and have no overheads, I think.

Thanks,
-Kame

2011-06-24 07:57:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: preallocate page before lock_page at filemap COW. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

On Thu 23-06-11 19:01:57, KAMEZAWA Hiroyuki wrote:
> From 7b0aa038f3a1c6a479a3ce6acb38c7d2740d7a75 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 23 Jun 2011 18:50:32 +0900
> Subject: [PATCH] mm: preallocate page before lock_page() at filemap COW.
>
> Currently we are keeping faulted page locked throughout whole __do_fault
> call (except for page_mkwrite code path) after calling file system's
> fault code. If we do early COW, we allocate a new page which has to be
> charged for a memcg (mem_cgroup_newpage_charge).
>
> This function, however, might block for unbounded amount of time if memcg
> oom killer is disabled or fork-bomb is running because the only way out of
> the OOM situation is either an external event or OOM-situation fix.
>
> In the end we are keeping the faulted page locked and blocking other
> processes from faulting it in which is not good at all because we are
> basically punishing potentially an unrelated process for OOM condition
> in a different group (I have seen stuck system because of ld-2.11.1.so being
> locked).
>
> We can do test easily.
>
> % cgcreate -g memory:A
> % cgset -r memory.limit_in_bytes=64M A
> % cgset -r memory.memsw.limit_in_bytes=64M A
> % cd kernel_dir; cgexec -g memory:A make -j
>
> Then, the whole system will live-locked until you kill 'make -j'
> by hands (or push reboot...) This is because some important
> page in a shared library are locked.
>
> Considering again, the new page is not necessary to be allocated
> with lock_page() held. So....
>
> This patch moves "charge" and memory allocation for COW page
> before lock_page(). Then, we can avoid scanning LRU with holding
> a lock on a page.
>
> Then, above livelock disappears.
>
> Reported-by: Lutz Vieweg <[email protected]>
> Original-idea-by: Michal Hocko <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Sorry, forgot to send my
Reviewed-by: Michal Hocko <mhocko@suse>

I still have concerns about this way to handle the issue. See the follow
up discussion in other thread (https://lkml.org/lkml/2011/6/23/135).

Anyway I think that we do not have many other options to handle this.
Either we unlock, charge, lock&restes or we preallocate, fault in

Or am I missing some other ways how to do it? What do others think about
these approaches?

>
> Changelog:
> - change the logic from "do charge after unlock" to
> "do charge before lock".
> ---
> mm/memory.c | 56 ++++++++++++++++++++++++++++++++++----------------------
> 1 files changed, 34 insertions(+), 22 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 87d9353..845378c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3127,14 +3127,34 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> pte_t *page_table;
> spinlock_t *ptl;
> struct page *page;
> + struct page *cow_page;
> pte_t entry;
> int anon = 0;
> - int charged = 0;
> struct page *dirty_page = NULL;
> struct vm_fault vmf;
> int ret;
> int page_mkwrite = 0;
>
> + /*
> + * If we do COW later, allocate page befor taking lock_page()
> + * on the file cache page. This will reduce lock holding time.
> + */
> + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> +
> + if (unlikely(anon_vma_prepare(vma)))
> + return VM_FAULT_OOM;
> +
> + cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> + if (!cow_page)
> + return VM_FAULT_OOM;
> +
> + if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
> + page_cache_release(cow_page);
> + return VM_FAULT_OOM;
> + }
> + } else
> + cow_page = NULL;
> +
> vmf.virtual_address = (void __user *)(address & PAGE_MASK);
> vmf.pgoff = pgoff;
> vmf.flags = flags;
> @@ -3143,12 +3163,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> ret = vma->vm_ops->fault(vma, &vmf);
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
> VM_FAULT_RETRY)))
> - return ret;
> + goto uncharge_out;
>
> if (unlikely(PageHWPoison(vmf.page))) {
> if (ret & VM_FAULT_LOCKED)
> unlock_page(vmf.page);
> - return VM_FAULT_HWPOISON;
> + ret = VM_FAULT_HWPOISON;
> + goto uncharge_out;
> }
>
> /*
> @@ -3166,23 +3187,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> page = vmf.page;
> if (flags & FAULT_FLAG_WRITE) {
> if (!(vma->vm_flags & VM_SHARED)) {
> + page = cow_page;
> anon = 1;
> - if (unlikely(anon_vma_prepare(vma))) {
> - ret = VM_FAULT_OOM;
> - goto out;
> - }
> - page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> - vma, address);
> - if (!page) {
> - ret = VM_FAULT_OOM;
> - goto out;
> - }
> - if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
> - ret = VM_FAULT_OOM;
> - page_cache_release(page);
> - goto out;
> - }
> - charged = 1;
> copy_user_highpage(page, vmf.page, address, vma);
> __SetPageUptodate(page);
> } else {
> @@ -3251,8 +3257,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> /* no need to invalidate: a not-present page won't be cached */
> update_mmu_cache(vma, address, page_table);
> } else {
> - if (charged)
> - mem_cgroup_uncharge_page(page);
> + if (cow_page)
> + mem_cgroup_uncharge_page(cow_page);
> if (anon)
> page_cache_release(page);
> else
> @@ -3261,7 +3267,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>
> pte_unmap_unlock(page_table, ptl);
>
> -out:
> if (dirty_page) {
> struct address_space *mapping = page->mapping;
>
> @@ -3291,6 +3296,13 @@ out:
> unwritable_page:
> page_cache_release(page);
> return ret;
> +uncharge_out:
> + /* fs's fault handler get error */
> + if (cow_page) {
> + mem_cgroup_uncharge_page(cow_page);
> + page_cache_release(cow_page);
> + }
> + return ret;
> }
>
> static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> --
> 1.7.4.1
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-06-24 11:46:32

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH] mm: preallocate page before lock_page at filemap COW. (WasRe: [PATCH V2] mm: Do not keep page locked during page fault while charging it for memcg

2011/6/24 Michal Hocko <[email protected]>:
> Sorry, forgot to send my
> Reviewed-by: Michal Hocko <mhocko@suse>
>

Thanks.

> I still have concerns about this way to handle the issue. See the follow
> up discussion in other thread (https://lkml.org/lkml/2011/6/23/135).
>
> Anyway I think that we do not have many other options to handle this.
> Either we unlock, charge, lock&restes or we preallocate, fault in
>
I agree.

> Or am I missing some other ways how to do it? What do others think about
> these approaches?
>

Yes, I'd like to hear other mm specialists' suggestion. and I'll think
other way, again.
Anyway, memory reclaim with holding a lock_page() can cause big latency
or starvation especially when memcg is used. It's better to avoid it.

Thanks,
-Kame