2012-11-14 09:15:24

by David Rientjes

[permalink] [raw]
Subject: [patch 1/4] mm, oom: ensure sysrq+f always passes valid zonelist

With hotpluggable and memoryless nodes, it's possible that node 0 will
not be online, so use the first online node's zonelist rather than
hardcoding node 0 to pass a zonelist with all zones to the oom killer.

Cc: Greg Kroah-Hartman <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
drivers/tty/sysrq.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -346,7 +346,8 @@ static struct sysrq_key_op sysrq_term_op = {

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL, true);
+ out_of_memory(node_zonelist(first_online_node, GFP_KERNEL), GFP_KERNEL,
+ 0, NULL, true);
}

static DECLARE_WORK(moom_work, moom_callback);


2012-11-14 09:15:29

by David Rientjes

[permalink] [raw]
Subject: [patch 2/4] mm, oom: cleanup pagefault oom handler

To lock the entire system from parallel oom killing, it's possible to
pass in a zonelist with all zones rather than using
for_each_populated_zone() for the iteration. This obsoletes
try_set_system_oom() and clear_system_oom() so that they can be removed.

Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
mm/oom_kill.c | 49 +++++++------------------------------------------
1 files changed, 7 insertions(+), 42 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -591,43 +591,6 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
spin_unlock(&zone_scan_lock);
}

-/*
- * Try to acquire the oom killer lock for all system zones. Returns zero if a
- * parallel oom killing is taking place, otherwise locks all zones and returns
- * non-zero.
- */
-static int try_set_system_oom(void)
-{
- struct zone *zone;
- int ret = 1;
-
- spin_lock(&zone_scan_lock);
- for_each_populated_zone(zone)
- if (zone_is_oom_locked(zone)) {
- ret = 0;
- goto out;
- }
- for_each_populated_zone(zone)
- zone_set_flag(zone, ZONE_OOM_LOCKED);
-out:
- spin_unlock(&zone_scan_lock);
- return ret;
-}
-
-/*
- * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
- * attempts or page faults may now recall the oom killer, if necessary.
- */
-static void clear_system_oom(void)
-{
- struct zone *zone;
-
- spin_lock(&zone_scan_lock);
- for_each_populated_zone(zone)
- zone_clear_flag(zone, ZONE_OOM_LOCKED);
- spin_unlock(&zone_scan_lock);
-}
-
/**
* out_of_memory - kill the "best" process when we run out of memory
* @zonelist: zonelist pointer
@@ -708,15 +671,17 @@ out:

/*
* The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If a populated zone has ZONE_OOM_LOCKED set, a parallel
- * oom killing is already in progress so do nothing. If a task is found with
- * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a
+ * parallel oom killing is already in progress so do nothing.
*/
void pagefault_out_of_memory(void)
{
- if (try_set_system_oom()) {
+ struct zonelist *zonelist = node_zonelist(first_online_node,
+ GFP_KERNEL);
+
+ if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
out_of_memory(NULL, 0, 0, NULL, false);
- clear_system_oom();
+ clear_zonelist_oom(zonelist, GFP_KERNEL);
}
schedule_timeout_killable(1);
}

2012-11-14 09:15:35

by David Rientjes

[permalink] [raw]
Subject: [patch 4/4] mm, oom: remove statically defined arch functions of same name

out_of_memory() is a globally defined function to call the oom killer.
x86, sh, and powerpc all use a function of the same name within file
scope in their respective fault.c unnecessarily. Inline the functions
into the pagefault handlers to clean the code up.

Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Mundt <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
arch/powerpc/mm/fault.c | 27 ++++++++++++---------------
arch/sh/mm/fault.c | 19 +++++++------------
arch/x86/mm/fault.c | 23 ++++++++---------------
3 files changed, 27 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -113,19 +113,6 @@ static int store_updates_sp(struct pt_regs *regs)
#define MM_FAULT_CONTINUE -1
#define MM_FAULT_ERR(sig) (sig)

-static int out_of_memory(struct pt_regs *regs)
-{
- /*
- * We ran out of memory, or some other thing happened to us that made
- * us unable to handle the page fault gracefully.
- */
- up_read(&current->mm->mmap_sem);
- if (!user_mode(regs))
- return MM_FAULT_ERR(SIGKILL);
- pagefault_out_of_memory();
- return MM_FAULT_RETURN;
-}
-
static int do_sigbus(struct pt_regs *regs, unsigned long address)
{
siginfo_t info;
@@ -169,8 +156,18 @@ static int mm_fault_error(struct pt_regs *regs, unsigned long addr, int fault)
return MM_FAULT_CONTINUE;

/* Out of memory */
- if (fault & VM_FAULT_OOM)
- return out_of_memory(regs);
+ if (fault & VM_FAULT_OOM) {
+ up_read(&current->mm->mmap_sem);
+
+ /*
+ * We ran out of memory, or some other thing happened to us that
+ * made us unable to handle the page fault gracefully.
+ */
+ if (!user_mode(regs))
+ return MM_FAULT_ERR(SIGKILL);
+ pagefault_out_of_memory();
+ return MM_FAULT_RETURN;
+ }

/* Bus error. x86 handles HWPOISON here, we'll add this if/when
* we support the feature in HW
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -301,17 +301,6 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
__bad_area(regs, error_code, address, SEGV_ACCERR);
}

-static void out_of_memory(void)
-{
- /*
- * We ran out of memory, call the OOM killer, and return the userspace
- * (which will retry the fault, or kill us if we got oom-killed):
- */
- up_read(&current->mm->mmap_sem);
-
- pagefault_out_of_memory();
-}
-
static void
do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
{
@@ -353,8 +342,14 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
no_context(regs, error_code, address);
return 1;
}
+ up_read(&current->mm->mmap_sem);

- out_of_memory();
+ /*
+ * We ran out of memory, call the OOM killer, and return the
+ * userspace (which will retry the fault, or kill us if we got
+ * oom-killed):
+ */
+ pagefault_out_of_memory();
} else {
if (fault & VM_FAULT_SIGBUS)
do_sigbus(regs, error_code, address);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -803,20 +803,6 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
__bad_area(regs, error_code, address, SEGV_ACCERR);
}

-/* TODO: fixup for "mm-invoke-oom-killer-from-page-fault.patch" */
-static void
-out_of_memory(struct pt_regs *regs, unsigned long error_code,
- unsigned long address)
-{
- /*
- * We ran out of memory, call the OOM killer, and return the userspace
- * (which will retry the fault, or kill us if we got oom-killed):
- */
- up_read(&current->mm->mmap_sem);
-
- pagefault_out_of_memory();
-}
-
static void
do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
unsigned int fault)
@@ -879,7 +865,14 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
return 1;
}

- out_of_memory(regs, error_code, address);
+ up_read(&current->mm->mmap_sem);
+
+ /*
+ * We ran out of memory, call the OOM killer, and return the
+ * userspace (which will retry the fault, or kill us if we got
+ * oom-killed):
+ */
+ pagefault_out_of_memory();
} else {
if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
VM_FAULT_HWPOISON_LARGE))

2012-11-14 09:16:07

by David Rientjes

[permalink] [raw]
Subject: [patch 3/4] mm, oom: remove redundant sleep in pagefault oom handler

out_of_memory() will already cause current to schedule if it has not been
killed, so doing it again in pagefault_out_of_memory() is redundant.
Remove it.

Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
mm/oom_kill.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -683,5 +683,4 @@ void pagefault_out_of_memory(void)
out_of_memory(NULL, 0, 0, NULL, false);
clear_zonelist_oom(zonelist, GFP_KERNEL);
}
- schedule_timeout_killable(1);
}

2012-11-14 10:50:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 1/4] mm, oom: ensure sysrq+f always passes valid zonelist

On Wed 14-11-12 01:15:19, David Rientjes wrote:
> With hotpluggable and memoryless nodes, it's possible that node 0 will
> not be online, so use the first online node's zonelist rather than
> hardcoding node 0 to pass a zonelist with all zones to the oom killer.

Makes sense although I haven't seen a machine with no 0 node yet.
According to 13808910 this is indeed possible.

> Cc: Greg Kroah-Hartman <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> drivers/tty/sysrq.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
> --- a/drivers/tty/sysrq.c
> +++ b/drivers/tty/sysrq.c
> @@ -346,7 +346,8 @@ static struct sysrq_key_op sysrq_term_op = {
>
> static void moom_callback(struct work_struct *ignored)
> {
> - out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL, true);
> + out_of_memory(node_zonelist(first_online_node, GFP_KERNEL), GFP_KERNEL,
> + 0, NULL, true);
> }
>
> static DECLARE_WORK(moom_work, moom_callback);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Michal Hocko
SUSE Labs

2012-11-14 11:03:07

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 1/4] mm, oom: ensure sysrq+f always passes valid zonelist

On Wed, 14 Nov 2012, Michal Hocko wrote:

> > With hotpluggable and memoryless nodes, it's possible that node 0 will
> > not be online, so use the first online node's zonelist rather than
> > hardcoding node 0 to pass a zonelist with all zones to the oom killer.
>
> Makes sense although I haven't seen a machine with no 0 node yet.

We routinely do testing with them, actually, just by physically removing
all memory described by the SRAT that maps to node 0. You could do the
same thing by making all pxms that map to node 0 to be hotpluggable in
your memory affinity structure. I've been bit by it one too many times so
I always keep in mind that no single node id is guaranteed to be online
(although at least one node is always online); hence, first_online_node is
the solution.

> According to 13808910 this is indeed possible.
>
> > Cc: Greg Kroah-Hartman <[email protected]>
> > Signed-off-by: David Rientjes <[email protected]>
>
> Reviewed-by: Michal Hocko <[email protected]>
>

Thanks!

2012-11-14 13:31:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 1/4] mm, oom: ensure sysrq+f always passes valid zonelist

On Wed 14-11-12 03:03:02, David Rientjes wrote:
> On Wed, 14 Nov 2012, Michal Hocko wrote:
>
> > > With hotpluggable and memoryless nodes, it's possible that node 0 will
> > > not be online, so use the first online node's zonelist rather than
> > > hardcoding node 0 to pass a zonelist with all zones to the oom killer.
> >
> > Makes sense although I haven't seen a machine with no 0 node yet.
>
> We routinely do testing with them, actually, just by physically removing
> all memory described by the SRAT that maps to node 0. You could do the
> same thing by making all pxms that map to node 0 to be hotpluggable in
> your memory affinity structure. I've been bit by it one too many times so
> I always keep in mind that no single node id is guaranteed to be online
> (although at least one node is always online); hence, first_online_node is
> the solution.

I thought that a boot cpu would be bound to a node0 or something similar.
Thanks for the clarification!

> > According to 13808910 this is indeed possible.
> >
> > > Cc: Greg Kroah-Hartman <[email protected]>
> > > Signed-off-by: David Rientjes <[email protected]>
> >
> > Reviewed-by: Michal Hocko <[email protected]>
> >
>
> Thanks!

--
Michal Hocko
SUSE Labs

2012-11-14 13:32:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 2/4] mm, oom: cleanup pagefault oom handler

On Wed 14-11-12 01:15:22, David Rientjes wrote:
> To lock the entire system from parallel oom killing, it's possible to
> pass in a zonelist with all zones rather than using
> for_each_populated_zone() for the iteration. This obsoletes
> try_set_system_oom() and clear_system_oom() so that they can be removed.
>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

The only _potential_ problem I can see with this is that if we ever have
a HW which requires that a node zonelist doesn't contain others nodes'
zones then this wouldn't work. I do not think such a HW exists. Such a HW
would need more changes in the code anyway.

so
Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/oom_kill.c | 49 +++++++------------------------------------------
> 1 files changed, 7 insertions(+), 42 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -591,43 +591,6 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> spin_unlock(&zone_scan_lock);
> }
>
> -/*
> - * Try to acquire the oom killer lock for all system zones. Returns zero if a
> - * parallel oom killing is taking place, otherwise locks all zones and returns
> - * non-zero.
> - */
> -static int try_set_system_oom(void)
> -{
> - struct zone *zone;
> - int ret = 1;
> -
> - spin_lock(&zone_scan_lock);
> - for_each_populated_zone(zone)
> - if (zone_is_oom_locked(zone)) {
> - ret = 0;
> - goto out;
> - }
> - for_each_populated_zone(zone)
> - zone_set_flag(zone, ZONE_OOM_LOCKED);
> -out:
> - spin_unlock(&zone_scan_lock);
> - return ret;
> -}
> -
> -/*
> - * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
> - * attempts or page faults may now recall the oom killer, if necessary.
> - */
> -static void clear_system_oom(void)
> -{
> - struct zone *zone;
> -
> - spin_lock(&zone_scan_lock);
> - for_each_populated_zone(zone)
> - zone_clear_flag(zone, ZONE_OOM_LOCKED);
> - spin_unlock(&zone_scan_lock);
> -}
> -
> /**
> * out_of_memory - kill the "best" process when we run out of memory
> * @zonelist: zonelist pointer
> @@ -708,15 +671,17 @@ out:
>
> /*
> * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If a populated zone has ZONE_OOM_LOCKED set, a parallel
> - * oom killing is already in progress so do nothing. If a task is found with
> - * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> + * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a
> + * parallel oom killing is already in progress so do nothing.
> */
> void pagefault_out_of_memory(void)
> {
> - if (try_set_system_oom()) {
> + struct zonelist *zonelist = node_zonelist(first_online_node,
> + GFP_KERNEL);
> +
> + if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
> out_of_memory(NULL, 0, 0, NULL, false);
> - clear_system_oom();
> + clear_zonelist_oom(zonelist, GFP_KERNEL);
> }
> schedule_timeout_killable(1);
> }

--
Michal Hocko
SUSE Labs

2012-11-14 13:45:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 3/4] mm, oom: remove redundant sleep in pagefault oom handler

On Wed 14-11-12 01:15:25, David Rientjes wrote:
> out_of_memory() will already cause current to schedule if it has not been
> killed, so doing it again in pagefault_out_of_memory() is redundant.
> Remove it.
>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/oom_kill.c | 1 -
> 1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -683,5 +683,4 @@ void pagefault_out_of_memory(void)
> out_of_memory(NULL, 0, 0, NULL, false);
> clear_zonelist_oom(zonelist, GFP_KERNEL);
> }
> - schedule_timeout_killable(1);
> }

--
Michal Hocko
SUSE Labs

2012-11-14 13:48:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 4/4] mm, oom: remove statically defined arch functions of same name

On Wed 14-11-12 01:15:28, David Rientjes wrote:
> out_of_memory() is a globally defined function to call the oom killer.
> x86, sh, and powerpc all use a function of the same name within file
> scope in their respective fault.c unnecessarily. Inline the functions
> into the pagefault handlers to clean the code up.

Yes I like it. It is really confusing to have a local function with the
same name.

>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Paul Mundt <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> arch/powerpc/mm/fault.c | 27 ++++++++++++---------------
> arch/sh/mm/fault.c | 19 +++++++------------
> arch/x86/mm/fault.c | 23 ++++++++---------------
> 3 files changed, 27 insertions(+), 42 deletions(-)
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -113,19 +113,6 @@ static int store_updates_sp(struct pt_regs *regs)
> #define MM_FAULT_CONTINUE -1
> #define MM_FAULT_ERR(sig) (sig)
>
> -static int out_of_memory(struct pt_regs *regs)
> -{
> - /*
> - * We ran out of memory, or some other thing happened to us that made
> - * us unable to handle the page fault gracefully.
> - */
> - up_read(&current->mm->mmap_sem);
> - if (!user_mode(regs))
> - return MM_FAULT_ERR(SIGKILL);
> - pagefault_out_of_memory();
> - return MM_FAULT_RETURN;
> -}
> -
> static int do_sigbus(struct pt_regs *regs, unsigned long address)
> {
> siginfo_t info;
> @@ -169,8 +156,18 @@ static int mm_fault_error(struct pt_regs *regs, unsigned long addr, int fault)
> return MM_FAULT_CONTINUE;
>
> /* Out of memory */
> - if (fault & VM_FAULT_OOM)
> - return out_of_memory(regs);
> + if (fault & VM_FAULT_OOM) {
> + up_read(&current->mm->mmap_sem);
> +
> + /*
> + * We ran out of memory, or some other thing happened to us that
> + * made us unable to handle the page fault gracefully.
> + */
> + if (!user_mode(regs))
> + return MM_FAULT_ERR(SIGKILL);
> + pagefault_out_of_memory();
> + return MM_FAULT_RETURN;
> + }
>
> /* Bus error. x86 handles HWPOISON here, we'll add this if/when
> * we support the feature in HW
> diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
> --- a/arch/sh/mm/fault.c
> +++ b/arch/sh/mm/fault.c
> @@ -301,17 +301,6 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
> __bad_area(regs, error_code, address, SEGV_ACCERR);
> }
>
> -static void out_of_memory(void)
> -{
> - /*
> - * We ran out of memory, call the OOM killer, and return the userspace
> - * (which will retry the fault, or kill us if we got oom-killed):
> - */
> - up_read(&current->mm->mmap_sem);
> -
> - pagefault_out_of_memory();
> -}
> -
> static void
> do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
> {
> @@ -353,8 +342,14 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> no_context(regs, error_code, address);
> return 1;
> }
> + up_read(&current->mm->mmap_sem);
>
> - out_of_memory();
> + /*
> + * We ran out of memory, call the OOM killer, and return the
> + * userspace (which will retry the fault, or kill us if we got
> + * oom-killed):
> + */
> + pagefault_out_of_memory();
> } else {
> if (fault & VM_FAULT_SIGBUS)
> do_sigbus(regs, error_code, address);
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -803,20 +803,6 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
> __bad_area(regs, error_code, address, SEGV_ACCERR);
> }
>
> -/* TODO: fixup for "mm-invoke-oom-killer-from-page-fault.patch" */
> -static void
> -out_of_memory(struct pt_regs *regs, unsigned long error_code,
> - unsigned long address)
> -{
> - /*
> - * We ran out of memory, call the OOM killer, and return the userspace
> - * (which will retry the fault, or kill us if we got oom-killed):
> - */
> - up_read(&current->mm->mmap_sem);
> -
> - pagefault_out_of_memory();
> -}
> -
> static void
> do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
> unsigned int fault)
> @@ -879,7 +865,14 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> return 1;
> }
>
> - out_of_memory(regs, error_code, address);
> + up_read(&current->mm->mmap_sem);
> +
> + /*
> + * We ran out of memory, call the OOM killer, and return the
> + * userspace (which will retry the fault, or kill us if we got
> + * oom-killed):
> + */
> + pagefault_out_of_memory();
> } else {
> if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
> VM_FAULT_HWPOISON_LARGE))

--
Michal Hocko
SUSE Labs

2012-11-15 08:42:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 1/4] mm, oom: ensure sysrq+f always passes valid zonelist

(2012/11/14 18:15), David Rientjes wrote:
> With hotpluggable and memoryless nodes, it's possible that node 0 will
> not be online, so use the first online node's zonelist rather than
> hardcoding node 0 to pass a zonelist with all zones to the oom killer.
>
> Cc: Greg Kroah-Hartman <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

Thank you.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-11-15 08:45:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 2/4] mm, oom: cleanup pagefault oom handler

(2012/11/14 18:15), David Rientjes wrote:
> To lock the entire system from parallel oom killing, it's possible to
> pass in a zonelist with all zones rather than using
> for_each_populated_zone() for the iteration. This obsoletes
> try_set_system_oom() and clear_system_oom() so that they can be removed.
>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

I'm sorry if I missed something...

> ---
> mm/oom_kill.c | 49 +++++++------------------------------------------
> 1 files changed, 7 insertions(+), 42 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -591,43 +591,6 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> spin_unlock(&zone_scan_lock);
> }
>
> -/*
> - * Try to acquire the oom killer lock for all system zones. Returns zero if a
> - * parallel oom killing is taking place, otherwise locks all zones and returns
> - * non-zero.
> - */
> -static int try_set_system_oom(void)
> -{
> - struct zone *zone;
> - int ret = 1;
> -
> - spin_lock(&zone_scan_lock);
> - for_each_populated_zone(zone)
> - if (zone_is_oom_locked(zone)) {
> - ret = 0;
> - goto out;
> - }
> - for_each_populated_zone(zone)
> - zone_set_flag(zone, ZONE_OOM_LOCKED);
> -out:
> - spin_unlock(&zone_scan_lock);
> - return ret;
> -}
> -
> -/*
> - * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
> - * attempts or page faults may now recall the oom killer, if necessary.
> - */
> -static void clear_system_oom(void)
> -{
> - struct zone *zone;
> -
> - spin_lock(&zone_scan_lock);
> - for_each_populated_zone(zone)
> - zone_clear_flag(zone, ZONE_OOM_LOCKED);
> - spin_unlock(&zone_scan_lock);
> -}
> -
> /**
> * out_of_memory - kill the "best" process when we run out of memory
> * @zonelist: zonelist pointer
> @@ -708,15 +671,17 @@ out:
>
> /*
> * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If a populated zone has ZONE_OOM_LOCKED set, a parallel
> - * oom killing is already in progress so do nothing. If a task is found with
> - * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> + * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a
> + * parallel oom killing is already in progress so do nothing.
> */
> void pagefault_out_of_memory(void)
> {
> - if (try_set_system_oom()) {
> + struct zonelist *zonelist = node_zonelist(first_online_node,
> + GFP_KERNEL);


why GFP_KERNEL ? not GFP_HIGHUSER_MOVABLE ?

Thanks,
-Kame

> +
> + if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
> out_of_memory(NULL, 0, 0, NULL, false);
> - clear_system_oom();
> + clear_zonelist_oom(zonelist, GFP_KERNEL);
> }
> schedule_timeout_killable(1);
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-11-15 08:47:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 3/4] mm, oom: remove redundant sleep in pagefault oom handler

(2012/11/14 18:15), David Rientjes wrote:
> out_of_memory() will already cause current to schedule if it has not been
> killed, so doing it again in pagefault_out_of_memory() is redundant.
> Remove it.
>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-11-15 08:49:15

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 4/4] mm, oom: remove statically defined arch functions of same name

(2012/11/14 18:15), David Rientjes wrote:
> out_of_memory() is a globally defined function to call the oom killer.
> x86, sh, and powerpc all use a function of the same name within file
> scope in their respective fault.c unnecessarily. Inline the functions
> into the pagefault handlers to clean the code up.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Paul Mundt <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

I think this is good.

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2012-11-15 09:02:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 2/4] mm, oom: cleanup pagefault oom handler

On Thu 15-11-12 17:45:18, KAMEZAWA Hiroyuki wrote:
> (2012/11/14 18:15), David Rientjes wrote:
[...]
> >@@ -708,15 +671,17 @@ out:
> >
> > /*
> > * The pagefault handler calls here because it is out of memory, so kill a
> >- * memory-hogging task. If a populated zone has ZONE_OOM_LOCKED set, a parallel
> >- * oom killing is already in progress so do nothing. If a task is found with
> >- * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> >+ * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a
> >+ * parallel oom killing is already in progress so do nothing.
> > */
> > void pagefault_out_of_memory(void)
> > {
> >- if (try_set_system_oom()) {
> >+ struct zonelist *zonelist = node_zonelist(first_online_node,
> >+ GFP_KERNEL);
>
>
> why GFP_KERNEL ? not GFP_HIGHUSER_MOVABLE ?

I was wondering about the same but gfp_zonelist cares only about
__GFP_THISNODE so GFP_HIGHUSER_MOVABLE doesn't do any difference.
--
Michal Hocko
SUSE Labs

2012-11-15 21:01:28

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 2/4] mm, oom: cleanup pagefault oom handler

On Thu, 15 Nov 2012, Kamezawa Hiroyuki wrote:

> > @@ -708,15 +671,17 @@ out:
> >
> > /*
> > * The pagefault handler calls here because it is out of memory, so kill a
> > - * memory-hogging task. If a populated zone has ZONE_OOM_LOCKED set, a
> > parallel
> > - * oom killing is already in progress so do nothing. If a task is found
> > with
> > - * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> > + * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a
> > + * parallel oom killing is already in progress so do nothing.
> > */
> > void pagefault_out_of_memory(void)
> > {
> > - if (try_set_system_oom()) {
> > + struct zonelist *zonelist = node_zonelist(first_online_node,
> > + GFP_KERNEL);
>
>
> why GFP_KERNEL ? not GFP_HIGHUSER_MOVABLE ?
>

The usual way to get a zonelist consisting of all zones ordered by
preferring a node is node_zonelist(nid, GFP_KERNEL), but there's no
difference between using GFP_KERNEL, GFP_HIGHUSER_MOVABLE, or even 0. I
simply duplicated what the sysrq trigger was doing, but you could also do
&first_online_pgdat->node_zonelists[0], it's really just a matter of
preference.