2010-01-17 00:38:21

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

Hi,

I thing the snippet below is a good summary of what this is about.

On Saturday 16 January 2010, Rafael J. Wysocki wrote:
> On Saturday 16 January 2010, Maxim Levitsky wrote:
> > On Sat, 2010-01-16 at 01:57 +0100, Rafael J. Wysocki wrote:
> > > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > > On Fri, 2010-01-15 at 23:03 +0100, Rafael J. Wysocki wrote:
> > > > > On Friday 15 January 2010, Maxim Levitsky wrote:
> > > > > > Hi,
> > > > >
> > > > > Hi,
> > > > >
> > > > > > I know that this is very controversial, because here I want to describe
> > > > > > a problem in a proprietary driver that happens now in 2.6.33-rc3
> > > > > > I am taking about nvidia driver.
> > > > > >
> > > > > > Some time ago I did very long hibernate test and found no errors after
> > > > > > more that 200 cycles.
> > > > > >
> > > > > > Now I update to 2.6.33 and notice that system will hand when nvidia
> > > > > > driver allocates memory is their .suspend functions.
> > > > >
> > > > > They shouldn't do that, there's no guarantee that's going to work at all.
> > > > >
> > > > > > This could fail in 2.6.32 if I would run many memory hungry
> > > > > > applications, but now this happens with most of memory free.
> > > > >
> > > > > This sounds a little strange. What's the requested size of the image?
> > > > Don't know, but system has to be very tight on memory.
> > >
> > > Can you send full dmesg, please?
> >
> > I deleted it, but for this case I think that hang was somewhere else.
> > This task was hand on doing forking, which probably happened even before
> > the freezer.
> >
> > Anyway, the problem is clear. Now __get_free_pages blocks more often,
> > and can block in .suspend even if there is plenty of memory free.

This is suspicious, but I leave it to the MM people for consideration.

> > I now patched nvidia to use GFP_ATOMIC _always_, and problem disappear.
> > It isn't such great solution when memory is tight though....
> >
> > This is going to hit hard all nvidia users...
>
> Well, generally speaking, no driver should ever allocate memory using
> GFP_KERNEL in its .suspend() routine, because that's not going to work, as you
> can readily see. So this is a NVidia bug, hands down.
>
> Now having said that, we've been considering a change that will turn all
> GFP_KERNEL allocations into GFP_NOIO during suspend/resume, so perhaps I'll
> prepare a patch to do that and let's see what people think.

If I didn't confuse anything (which is likely, because it's a bit late here
now), the patch below should do the trick. I have only checked that it doesn't
break compilation, so please take it with a grain of salt.

Comments welcome.

Rafael

---
include/linux/gfp.h | 5 +++++
kernel/power/hibernate.c | 6 ++++++
kernel/power/main.c | 30 ++++++++++++++++++++++++++++++
kernel/power/power.h | 2 ++
kernel/power/suspend.c | 2 ++
5 files changed, 45 insertions(+)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -342,4 +342,9 @@ static inline void set_gfp_allowed_mask(
gfp_allowed_mask = mask;
}

+static inline gfp_t get_gfp_allowed_mask(void)
+{
+ return gfp_allowed_mask;
+}
+
#endif /* __LINUX_GFP_H */
Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -334,6 +334,7 @@ int hibernation_snapshot(int platform_mo
goto Close;

suspend_console();
+ pm_force_noio_allocations();
error = dpm_suspend_start(PMSG_FREEZE);
if (error)
goto Recover_platform;
@@ -351,6 +352,7 @@ int hibernation_snapshot(int platform_mo

dpm_resume_end(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
+ pm_allow_io_allocations();
resume_console();
Close:
platform_end(platform_mode);
@@ -448,11 +450,13 @@ int hibernation_restore(int platform_mod

pm_prepare_console();
suspend_console();
+ pm_force_noio_allocations();
error = dpm_suspend_start(PMSG_QUIESCE);
if (!error) {
error = resume_target_kernel(platform_mode);
dpm_resume_end(PMSG_RECOVER);
}
+ pm_allow_io_allocations();
resume_console();
pm_restore_console();
return error;
@@ -481,6 +485,7 @@ int hibernation_platform_enter(void)

entering_platform_hibernation = true;
suspend_console();
+ pm_force_noio_allocations();
error = dpm_suspend_start(PMSG_HIBERNATE);
if (error) {
if (hibernation_ops->recover)
@@ -518,6 +523,7 @@ int hibernation_platform_enter(void)
Resume_devices:
entering_platform_hibernation = false;
dpm_resume_end(PMSG_RESTORE);
+ pm_allow_io_allocations();
resume_console();

Close:
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -187,6 +187,8 @@ static inline void suspend_test_finish(c
#ifdef CONFIG_PM_SLEEP
/* kernel/power/main.c */
extern int pm_notifier_call_chain(unsigned long val);
+extern void pm_force_noio_allocations(void);
+extern void pm_allow_io_allocations(void);
#endif

#ifdef CONFIG_HIGHMEM
Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -208,6 +208,7 @@ int suspend_devices_and_enter(suspend_st
goto Close;
}
suspend_console();
+ pm_force_noio_allocations();
suspend_test_start();
error = dpm_suspend_start(PMSG_SUSPEND);
if (error) {
@@ -224,6 +225,7 @@ int suspend_devices_and_enter(suspend_st
suspend_test_start();
dpm_resume_end(PMSG_RESUME);
suspend_test_finish("resume devices");
+ pm_allow_io_allocations();
resume_console();
Close:
if (suspend_ops->end)
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -12,6 +12,7 @@
#include <linux/string.h>
#include <linux/resume-trace.h>
#include <linux/workqueue.h>
+#include <linux/gfp.h>

#include "power.h"

@@ -22,6 +23,35 @@ EXPORT_SYMBOL(pm_flags);

#ifdef CONFIG_PM_SLEEP

+static gfp_t saved_gfp_allowed_mask;
+
+/**
+ * pm_force_noio_allocations - Modify gfp_allowed_mask to disable IO allocations
+ *
+ * Change gfp_allowed_mask by unsetting __GFP_IO and __GFP_FS in it and save the
+ * old value.
+ */
+void pm_force_noio_allocations(void)
+{
+ saved_gfp_allowed_mask = get_gfp_allowed_mask();
+ set_gfp_allowed_mask(saved_gfp_allowed_mask & ~(__GFP_IO | __GFP_FS));
+}
+
+/**
+ * pm_allow_io_allocations - Modify gfp_allowed_mask to allow IO allocations
+ *
+ * If the saved value of gfp_allowed_mask has __GFP_IO set, modify the current
+ * gfp_allowed_mask by setting this bit and anlogously for __GFP_FS.
+ */
+void pm_allow_io_allocations(void)
+{
+ gfp_t gfp_mask;
+
+ gfp_mask = get_gfp_allowed_mask();
+ gfp_mask |= saved_gfp_allowed_mask & (__GFP_IO | __GFP_FS);
+ set_gfp_allowed_mask(gfp_mask);
+}
+
/* Routines for PM-transition notifications */

static BLOCKING_NOTIFIER_HEAD(pm_chain_head);


2010-01-17 01:23:54

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

Am Sonntag, 17. Januar 2010 01:38:37 schrieb Rafael J. Wysocki:
> > Now having said that, we've been considering a change that will turn all
> > GFP_KERNEL allocations into GFP_NOIO during suspend/resume, so perhaps I'll
> > prepare a patch to do that and let's see what people think.
>
> If I didn't confuse anything (which is likely, because it's a bit late here
> now), the patch below should do the trick. I have only checked that it doesn't
> break compilation, so please take it with a grain of salt.
>
> Comments welcome.

I think this is a bad idea as it makes the mm subsystem behave differently
in the runtime and in the whole system cases. What's so hard about telling
people that they need to use GFP_NOIO in suspend() and resume()?

Regards
Oliver

2010-01-17 13:27:14

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Sunday 17 January 2010, Oliver Neukum wrote:
> Am Sonntag, 17. Januar 2010 01:38:37 schrieb Rafael J. Wysocki:
> > > Now having said that, we've been considering a change that will turn all
> > > GFP_KERNEL allocations into GFP_NOIO during suspend/resume, so perhaps I'll
> > > prepare a patch to do that and let's see what people think.
> >
> > If I didn't confuse anything (which is likely, because it's a bit late here
> > now), the patch below should do the trick. I have only checked that it doesn't
> > break compilation, so please take it with a grain of salt.
> >
> > Comments welcome.
>
> I think this is a bad idea as it makes the mm subsystem behave differently
> in the runtime and in the whole system cases.

s/runtime/suspend/ ?

Yes it will, but why exactly shouldn't it? System suspend/resume _is_ a
special situation anyway.

> What's so hard about telling people that they need to use GFP_NOIO in
> suspend() and resume()?

Memory allocations are made for other purposes during suspend/resume too. For
example, new kernel threads may be created (for async suspend/resume among
other things).

Besides, the fact that you tell people to do something doesn't necessary imply
that they will listen. :-)

I have discussed that with Ben for a couple of times and we have generally
agreed that memory allocation problems during suspend/resume are not avoidable
in general unless we disable __GFP_FS and __GFP_IO at the high level.

Rafael

2010-01-17 13:35:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Sunday 17 January 2010, Rafael J. Wysocki wrote:
> On Sunday 17 January 2010, Oliver Neukum wrote:
> > Am Sonntag, 17. Januar 2010 01:38:37 schrieb Rafael J. Wysocki:
> > > > Now having said that, we've been considering a change that will turn all
> > > > GFP_KERNEL allocations into GFP_NOIO during suspend/resume, so perhaps I'll
> > > > prepare a patch to do that and let's see what people think.
> > >
> > > If I didn't confuse anything (which is likely, because it's a bit late here
> > > now), the patch below should do the trick. I have only checked that it doesn't
> > > break compilation, so please take it with a grain of salt.
> > >
> > > Comments welcome.
> >
> > I think this is a bad idea as it makes the mm subsystem behave differently
> > in the runtime and in the whole system cases.
>
> s/runtime/suspend/ ?
>
> Yes it will, but why exactly shouldn't it? System suspend/resume _is_ a
> special situation anyway.

Moreover, as the Maxim's report indicates, the mm subsystem already behaves
differently during suspend/resume and this behavior is erroneous, because it
causes the system to hang.

Rafael

2010-01-17 13:55:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Sunday 17 January 2010, Rafael J. Wysocki wrote:
> Hi,
>
> I thing the snippet below is a good summary of what this is about.
>
> On Saturday 16 January 2010, Rafael J. Wysocki wrote:
> > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > On Sat, 2010-01-16 at 01:57 +0100, Rafael J. Wysocki wrote:
> > > > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > > > On Fri, 2010-01-15 at 23:03 +0100, Rafael J. Wysocki wrote:
> > > > > > On Friday 15 January 2010, Maxim Levitsky wrote:
> > > > > > > Hi,
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > I know that this is very controversial, because here I want to describe
> > > > > > > a problem in a proprietary driver that happens now in 2.6.33-rc3
> > > > > > > I am taking about nvidia driver.
> > > > > > >
> > > > > > > Some time ago I did very long hibernate test and found no errors after
> > > > > > > more that 200 cycles.
> > > > > > >
> > > > > > > Now I update to 2.6.33 and notice that system will hand when nvidia
> > > > > > > driver allocates memory is their .suspend functions.
> > > > > >
> > > > > > They shouldn't do that, there's no guarantee that's going to work at all.
> > > > > >
> > > > > > > This could fail in 2.6.32 if I would run many memory hungry
> > > > > > > applications, but now this happens with most of memory free.
> > > > > >
> > > > > > This sounds a little strange. What's the requested size of the image?
> > > > > Don't know, but system has to be very tight on memory.
> > > >
> > > > Can you send full dmesg, please?
> > >
> > > I deleted it, but for this case I think that hang was somewhere else.
> > > This task was hand on doing forking, which probably happened even before
> > > the freezer.
> > >
> > > Anyway, the problem is clear. Now __get_free_pages blocks more often,
> > > and can block in .suspend even if there is plenty of memory free.
>
> This is suspicious, but I leave it to the MM people for consideration.
>
> > > I now patched nvidia to use GFP_ATOMIC _always_, and problem disappear.
> > > It isn't such great solution when memory is tight though....
> > >
> > > This is going to hit hard all nvidia users...
> >
> > Well, generally speaking, no driver should ever allocate memory using
> > GFP_KERNEL in its .suspend() routine, because that's not going to work, as you
> > can readily see. So this is a NVidia bug, hands down.
> >
> > Now having said that, we've been considering a change that will turn all
> > GFP_KERNEL allocations into GFP_NOIO during suspend/resume, so perhaps I'll
> > prepare a patch to do that and let's see what people think.
>
> If I didn't confuse anything (which is likely, because it's a bit late here
> now), the patch below should do the trick. I have only checked that it doesn't
> break compilation, so please take it with a grain of salt.

Appended is another version that attempts to remove some possible races.
It's been tested a little too.

Rafael

---
include/linux/gfp.h | 5 +----
kernel/power/hibernate.c | 6 ++++++
kernel/power/main.c | 1 +
kernel/power/power.h | 3 +++
kernel/power/suspend.c | 2 ++
mm/Makefile | 1 +
mm/internal.h | 3 +++
mm/page_alloc.c | 16 +++++++++++++++-
mm/pm.c | 34 ++++++++++++++++++++++++++++++++++
9 files changed, 66 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -337,9 +337,6 @@ void drain_local_pages(void *dummy);

extern gfp_t gfp_allowed_mask;

-static inline void set_gfp_allowed_mask(gfp_t mask)
-{
- gfp_allowed_mask = mask;
-}
+extern void set_gfp_allowed_mask(gfp_t mask);

#endif /* __LINUX_GFP_H */
Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -334,6 +334,7 @@ int hibernation_snapshot(int platform_mo
goto Close;

suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_FREEZE);
if (error)
goto Recover_platform;
@@ -351,6 +352,7 @@ int hibernation_snapshot(int platform_mo

dpm_resume_end(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
+ mm_allow_io_allocations();
resume_console();
Close:
platform_end(platform_mode);
@@ -448,11 +450,13 @@ int hibernation_restore(int platform_mod

pm_prepare_console();
suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_QUIESCE);
if (!error) {
error = resume_target_kernel(platform_mode);
dpm_resume_end(PMSG_RECOVER);
}
+ mm_allow_io_allocations();
resume_console();
pm_restore_console();
return error;
@@ -481,6 +485,7 @@ int hibernation_platform_enter(void)

entering_platform_hibernation = true;
suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_HIBERNATE);
if (error) {
if (hibernation_ops->recover)
@@ -518,6 +523,7 @@ int hibernation_platform_enter(void)
Resume_devices:
entering_platform_hibernation = false;
dpm_resume_end(PMSG_RESTORE);
+ mm_allow_io_allocations();
resume_console();

Close:
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -187,6 +187,9 @@ static inline void suspend_test_finish(c
#ifdef CONFIG_PM_SLEEP
/* kernel/power/main.c */
extern int pm_notifier_call_chain(unsigned long val);
+/* mm/pm.c */
+extern void mm_force_noio_allocations(void);
+extern void mm_allow_io_allocations(void);
#endif

#ifdef CONFIG_HIGHMEM
Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -208,6 +208,7 @@ int suspend_devices_and_enter(suspend_st
goto Close;
}
suspend_console();
+ mm_force_noio_allocations();
suspend_test_start();
error = dpm_suspend_start(PMSG_SUSPEND);
if (error) {
@@ -224,6 +225,7 @@ int suspend_devices_and_enter(suspend_st
suspend_test_start();
dpm_resume_end(PMSG_RESUME);
suspend_test_finish("resume devices");
+ mm_allow_io_allocations();
resume_console();
Close:
if (suspend_ops->end)
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -12,6 +12,7 @@
#include <linux/string.h>
#include <linux/resume-trace.h>
#include <linux/workqueue.h>
+#include <linux/gfp.h>

#include "power.h"

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -76,6 +76,17 @@ unsigned long totalreserve_pages __read_
int percpu_pagelist_fraction;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

+DECLARE_RWSEM(gfp_allowed_mask_sem);
+
+void set_gfp_allowed_mask(gfp_t mask)
+{
+ /* Wait for all slowpath allocations using the old mask to complete */
+ down_write(&gfp_allowed_mask_sem);
+ gfp_allowed_mask = mask;
+ up_write(&gfp_allowed_mask_sem);
+}
+
+
#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
int pageblock_order __read_mostly;
#endif
@@ -1963,10 +1974,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
preferred_zone, migratetype);
- if (unlikely(!page))
+ if (unlikely(!page)) {
+ down_read(&gfp_allowed_mask_sem);
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
+ up_read(&gfp_allowed_mask_sem);
+ }

trace_mm_page_alloc(page, order, gfp_mask, migratetype);
return page;
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_PM_SLEEP) += pm.o
Index: linux-2.6/mm/pm.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/pm.c
@@ -0,0 +1,34 @@
+#include "internal.h"
+
+static gfp_t saved_gfp_allowed_mask;
+
+/**
+ * mm_force_noio_allocations - Modify gfp_allowed_mask to disable IO allocations
+ *
+ * Change gfp_allowed_mask by unsetting __GFP_IO and __GFP_FS in it and save the
+ * old value.
+ */
+void mm_force_noio_allocations(void)
+{
+ /* Wait for all slowpath allocations using the old mask to complete */
+ down_write(&gfp_allowed_mask_sem);
+ saved_gfp_allowed_mask = gfp_allowed_mask;
+ gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
+ up_write(&gfp_allowed_mask_sem);
+}
+
+/**
+ * mm_allow_io_allocations - Modify gfp_allowed_mask to allow IO allocations
+ *
+ * If the saved value of gfp_allowed_mask has __GFP_IO set, modify the current
+ * gfp_allowed_mask by setting this bit and anlogously for __GFP_FS.
+ */
+void mm_allow_io_allocations(void)
+{
+ gfp_t gfp_mask = saved_gfp_allowed_mask & (__GFP_IO | __GFP_FS);
+
+ /* Wait for all slowpath allocations using the old mask to complete */
+ down_write(&gfp_allowed_mask_sem);
+ gfp_allowed_mask |= gfp_mask;
+ up_write(&gfp_allowed_mask_sem);
+}
Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -12,6 +12,7 @@
#define __MM_INTERNAL_H

#include <linux/mm.h>
+#include <linux/rwsem.h>

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
@@ -259,3 +260,5 @@ extern u64 hwpoison_filter_flags_mask;
extern u64 hwpoison_filter_flags_value;
extern u64 hwpoison_filter_memcg;
extern u32 hwpoison_filter_enable;
+
+extern struct rw_semaphore gfp_allowed_mask_sem;

2010-01-17 16:21:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

Hi, Rafael.

On Sun, 2010-01-17 at 14:55 +0100, Rafael J. Wysocki wrote:
> On Sunday 17 January 2010, Rafael J. Wysocki wrote:
> > Hi,
> >
> > I thing the snippet below is a good summary of what this is about.
> >
> > On Saturday 16 January 2010, Rafael J. Wysocki wrote:
> > > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > > On Sat, 2010-01-16 at 01:57 +0100, Rafael J. Wysocki wrote:
> > > > > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > > > > On Fri, 2010-01-15 at 23:03 +0100, Rafael J. Wysocki wrote:
> > > > > > > On Friday 15 January 2010, Maxim Levitsky wrote:
> > > > > > > > Hi,
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > > I know that this is very controversial, because here I want to describe
> > > > > > > > a problem in a proprietary driver that happens now in 2.6.33-rc3
> > > > > > > > I am taking about nvidia driver.
> > > > > > > >
> > > > > > > > Some time ago I did very long hibernate test and found no errors after
> > > > > > > > more that 200 cycles.
> > > > > > > >
> > > > > > > > Now I update to 2.6.33 and notice that system will hand when nvidia
> > > > > > > > driver allocates memory is their .suspend functions.
> > > > > > >
> > > > > > > They shouldn't do that, there's no guarantee that's going to work at all.
> > > > > > >
> > > > > > > > This could fail in 2.6.32 if I would run many memory hungry
> > > > > > > > applications, but now this happens with most of memory free.
> > > > > > >
> > > > > > > This sounds a little strange. What's the requested size of the image?
> > > > > > Don't know, but system has to be very tight on memory.
> > > > >
> > > > > Can you send full dmesg, please?
> > > >
> > > > I deleted it, but for this case I think that hang was somewhere else.
> > > > This task was hand on doing forking, which probably happened even before
> > > > the freezer.
> > > >
> > > > Anyway, the problem is clear. Now __get_free_pages blocks more often,
> > > > and can block in .suspend even if there is plenty of memory free.
> >
> > This is suspicious, but I leave it to the MM people for consideration.
> >
> > > > I now patched nvidia to use GFP_ATOMIC _always_, and problem disappear.
> > > > It isn't such great solution when memory is tight though....
> > > >
> > > > This is going to hit hard all nvidia users...
> > >
> > > Well, generally speaking, no driver should ever allocate memory using
> > > GFP_KERNEL in its .suspend() routine, because that's not going to work, as you
> > > can readily see. So this is a NVidia bug, hands down.
> > >
> > > Now having said that, we've been considering a change that will turn all
> > > GFP_KERNEL allocations into GFP_NOIO during suspend/resume, so perhaps I'll
> > > prepare a patch to do that and let's see what people think.
> >
> > If I didn't confuse anything (which is likely, because it's a bit late here
> > now), the patch below should do the trick. I have only checked that it doesn't
> > break compilation, so please take it with a grain of salt.
>
> Appended is another version that attempts to remove some possible races.
> It's been tested a little too.
>
> Rafael
>
> ---
> include/linux/gfp.h | 5 +----
> kernel/power/hibernate.c | 6 ++++++
> kernel/power/main.c | 1 +
> kernel/power/power.h | 3 +++
> kernel/power/suspend.c | 2 ++
> mm/Makefile | 1 +
> mm/internal.h | 3 +++
> mm/page_alloc.c | 16 +++++++++++++++-
> mm/pm.c | 34 ++++++++++++++++++++++++++++++++++
> 9 files changed, 66 insertions(+), 5 deletions(-)
>
> Index: linux-2.6/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/gfp.h
> +++ linux-2.6/include/linux/gfp.h
> @@ -337,9 +337,6 @@ void drain_local_pages(void *dummy);
>
> extern gfp_t gfp_allowed_mask;
>
> -static inline void set_gfp_allowed_mask(gfp_t mask)
> -{
> - gfp_allowed_mask = mask;
> -}
> +extern void set_gfp_allowed_mask(gfp_t mask);
>
> #endif /* __LINUX_GFP_H */
> Index: linux-2.6/kernel/power/hibernate.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/hibernate.c
> +++ linux-2.6/kernel/power/hibernate.c
> @@ -334,6 +334,7 @@ int hibernation_snapshot(int platform_mo
> goto Close;
>
> suspend_console();
> + mm_force_noio_allocations();
> error = dpm_suspend_start(PMSG_FREEZE);
> if (error)
> goto Recover_platform;
> @@ -351,6 +352,7 @@ int hibernation_snapshot(int platform_mo
>
> dpm_resume_end(in_suspend ?
> (error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
> + mm_allow_io_allocations();
> resume_console();
> Close:
> platform_end(platform_mode);
> @@ -448,11 +450,13 @@ int hibernation_restore(int platform_mod
>
> pm_prepare_console();
> suspend_console();
> + mm_force_noio_allocations();
> error = dpm_suspend_start(PMSG_QUIESCE);
> if (!error) {
> error = resume_target_kernel(platform_mode);
> dpm_resume_end(PMSG_RECOVER);
> }
> + mm_allow_io_allocations();
> resume_console();
> pm_restore_console();
> return error;
> @@ -481,6 +485,7 @@ int hibernation_platform_enter(void)
>
> entering_platform_hibernation = true;
> suspend_console();
> + mm_force_noio_allocations();
> error = dpm_suspend_start(PMSG_HIBERNATE);
> if (error) {
> if (hibernation_ops->recover)
> @@ -518,6 +523,7 @@ int hibernation_platform_enter(void)
> Resume_devices:
> entering_platform_hibernation = false;
> dpm_resume_end(PMSG_RESTORE);
> + mm_allow_io_allocations();
> resume_console();
>
> Close:
> Index: linux-2.6/kernel/power/power.h
> ===================================================================
> --- linux-2.6.orig/kernel/power/power.h
> +++ linux-2.6/kernel/power/power.h
> @@ -187,6 +187,9 @@ static inline void suspend_test_finish(c
> #ifdef CONFIG_PM_SLEEP
> /* kernel/power/main.c */
> extern int pm_notifier_call_chain(unsigned long val);
> +/* mm/pm.c */
> +extern void mm_force_noio_allocations(void);
> +extern void mm_allow_io_allocations(void);
> #endif
>
> #ifdef CONFIG_HIGHMEM
> Index: linux-2.6/kernel/power/suspend.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/suspend.c
> +++ linux-2.6/kernel/power/suspend.c
> @@ -208,6 +208,7 @@ int suspend_devices_and_enter(suspend_st
> goto Close;
> }
> suspend_console();
> + mm_force_noio_allocations();
> suspend_test_start();
> error = dpm_suspend_start(PMSG_SUSPEND);
> if (error) {
> @@ -224,6 +225,7 @@ int suspend_devices_and_enter(suspend_st
> suspend_test_start();
> dpm_resume_end(PMSG_RESUME);
> suspend_test_finish("resume devices");
> + mm_allow_io_allocations();
> resume_console();
> Close:
> if (suspend_ops->end)
> Index: linux-2.6/kernel/power/main.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/main.c
> +++ linux-2.6/kernel/power/main.c
> @@ -12,6 +12,7 @@
> #include <linux/string.h>
> #include <linux/resume-trace.h>
> #include <linux/workqueue.h>
> +#include <linux/gfp.h>
>
> #include "power.h"
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -76,6 +76,17 @@ unsigned long totalreserve_pages __read_
> int percpu_pagelist_fraction;
> gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>
> +DECLARE_RWSEM(gfp_allowed_mask_sem);
> +
> +void set_gfp_allowed_mask(gfp_t mask)
> +{
> + /* Wait for all slowpath allocations using the old mask to complete */

We used gfp_allowed_mask in fast path.
Why did you consider only slow path?

> + down_write(&gfp_allowed_mask_sem);
> + gfp_allowed_mask = mask;
> + up_write(&gfp_allowed_mask_sem);
> +}
> +
> +
> #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> int pageblock_order __read_mostly;
> #endif
> @@ -1963,10 +1974,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
> page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> preferred_zone, migratetype);
> - if (unlikely(!page))
> + if (unlikely(!page)) {
> + down_read(&gfp_allowed_mask_sem);
> page = __alloc_pages_slowpath(gfp_mask, order,
> zonelist, high_zoneidx, nodemask,
> preferred_zone, migratetype);
> + up_read(&gfp_allowed_mask_sem);
> + }
>
> trace_mm_page_alloc(page, order, gfp_mask, migratetype);
> return page;
> Index: linux-2.6/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/mm/Makefile
> +++ linux-2.6/mm/Makefile
> @@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
> obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> +obj-$(CONFIG_PM_SLEEP) += pm.o
> Index: linux-2.6/mm/pm.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/mm/pm.c
> @@ -0,0 +1,34 @@
> +#include "internal.h"
> +
> +static gfp_t saved_gfp_allowed_mask;
> +
> +/**
> + * mm_force_noio_allocations - Modify gfp_allowed_mask to disable IO allocations
> + *
> + * Change gfp_allowed_mask by unsetting __GFP_IO and __GFP_FS in it and save the
> + * old value.
> + */
> +void mm_force_noio_allocations(void)
> +{
> + /* Wait for all slowpath allocations using the old mask to complete */
> + down_write(&gfp_allowed_mask_sem);
> + saved_gfp_allowed_mask = gfp_allowed_mask;
> + gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
> + up_write(&gfp_allowed_mask_sem);

How about using set_gfp_allowed_mask?
Let's return old mask in set_gfp_allowed_mask.

> +}
> +
> +/**
> + * mm_allow_io_allocations - Modify gfp_allowed_mask to allow IO allocations
> + *
> + * If the saved value of gfp_allowed_mask has __GFP_IO set, modify the current
> + * gfp_allowed_mask by setting this bit and anlogously for __GFP_FS.
> + */
> +void mm_allow_io_allocations(void)
> +{
> + gfp_t gfp_mask = saved_gfp_allowed_mask & (__GFP_IO | __GFP_FS);
> +
> + /* Wait for all slowpath allocations using the old mask to complete */
> + down_write(&gfp_allowed_mask_sem);
> + gfp_allowed_mask |= gfp_mask;
> + up_write(&gfp_allowed_mask_sem);

Ditto

> +}
> Index: linux-2.6/mm/internal.h
> ===================================================================
> --- linux-2.6.orig/mm/internal.h
> +++ linux-2.6/mm/internal.h
> @@ -12,6 +12,7 @@
> #define __MM_INTERNAL_H
>
> #include <linux/mm.h>
> +#include <linux/rwsem.h>
>
> void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> unsigned long floor, unsigned long ceiling);
> @@ -259,3 +260,5 @@ extern u64 hwpoison_filter_flags_mask;
> extern u64 hwpoison_filter_flags_value;
> extern u64 hwpoison_filter_memcg;
> extern u32 hwpoison_filter_enable;
> +
> +extern struct rw_semaphore gfp_allowed_mask_sem;

I think we can use lockdep annotation, too. but it's overkill.
That's because suspend/resume is rare event so that I want to add
the cost in lockdep. so I like this idea. But, I have a concern.
You are adding a little bit cost in alloc path although it's slow one.

Really really do we need this?
Can't we remove the wrong usage in review or test process before merge?

I don't have many experience at suspend/resume.
I depends on your experience about this patch's value. :)

Thanks.

--
Kind regards,
Minchan Kim

2010-01-17 16:23:51

by Minchan Kim

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

> I think we can use lockdep annotation, too. but it's overkill.
> That's because suspend/resume is rare event so that I want to add

so that I don't want to add

Sorry for the typo.
--
Kind regards,
Minchan Kim

2010-01-17 18:58:49

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Sun, 2010-01-17 at 14:27 +0100, Rafael J. Wysocki wrote:

> Yes it will, but why exactly shouldn't it? System suspend/resume _is_ a
> special situation anyway.

To some extent this is similar to the boot time allocation problem for
which it was decided to bury the logic in the allocator as well.

> Memory allocations are made for other purposes during suspend/resume too. For
> example, new kernel threads may be created (for async suspend/resume among
> other things).

Right. Well, I would add in fact that this isn't even the main issue I
see. If it was just a matter of changing a kmalloc() call in a driver
suspend() routine, I would agree with Oliver.

However, there are two categories of allocations that make this
extremely difficult:

- One is implicit allocations. IE. suspend() is a normal task context,
it's expected that any function can be called that might itself call a
function etc... that does an allocation. There is simply no way all of
these code path can be identified and the allocation "flags" pushed up
all the way to the API in every case.

- There's a more subtle issue at play here. The moment the core starts
calling driver's suspend() routines, all allocations can potentially
hang since a device with dirty pages might have been suspended and the
VM can stall trying to swap out to it. (I don't think Rafael proposed
patch handles this in a race free way btw, but that's hard, especially
for allocations already blocked waiting for a write back ...). That
means that a driver that has -not- been suspended yet (and thus doesn't
necessarily know the suspend process has been started) might be blocked
in an allocation somewhere, holding a mutex or similar, which will then
cause a deadlock when that same driver's suspend() routine is called
which tries to take the same mutex.

Overall, it's a can of worms. The only way out I can see that is
reasonably sane and doesn't impose API changes thorough the kernel and
unreasonable expectations from driver writers is to deal with it at the
allocator level.

However, it's hard to deal with the case of allocations that have
already started waiting for IOs. It might be possible to have some VM
hook to make them wakeup, re-evaluate the situation and get out of that
code path but in any case it would be tricky.

So Rafael's proposed patch is a first step toward fixing that problem
but isn't, I believe, enough.

> Besides, the fact that you tell people to do something doesn't necessary imply
> that they will listen. :-)
>
> I have discussed that with Ben for a couple of times and we have generally
> agreed that memory allocation problems during suspend/resume are not avoidable
> in general unless we disable __GFP_FS and __GFP_IO at the high level.
>

Cheers,
Ben.

2010-01-17 23:00:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Sunday 17 January 2010, Benjamin Herrenschmidt wrote:
> On Sun, 2010-01-17 at 14:27 +0100, Rafael J. Wysocki wrote:
...
> However, it's hard to deal with the case of allocations that have
> already started waiting for IOs. It might be possible to have some VM
> hook to make them wakeup, re-evaluate the situation and get out of that
> code path but in any case it would be tricky.

In the second version of the patch I used an rwsem that made us wait for these
allocations to complete before we changed gfp_allowed_mask.

[This is kinda buggy in the version I sent, but I'm going to send an update
in a minute.]

Rafael

2010-01-18 00:26:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

On Sunday 17 January 2010, Minchan Kim wrote:
> Hi, Rafael.
>
> On Sun, 2010-01-17 at 14:55 +0100, Rafael J. Wysocki wrote:
> > On Sunday 17 January 2010, Rafael J. Wysocki wrote:
> > > Hi,
> > >
> > > I thing the snippet below is a good summary of what this is about.
> > >
> > > On Saturday 16 January 2010, Rafael J. Wysocki wrote:
> > > > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > > > On Sat, 2010-01-16 at 01:57 +0100, Rafael J. Wysocki wrote:
> > > > > > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > > > > > On Fri, 2010-01-15 at 23:03 +0100, Rafael J. Wysocki wrote:
> > > > > > > > On Friday 15 January 2010, Maxim Levitsky wrote:
> > > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > > I know that this is very controversial, because here I want to describe
> > > > > > > > > a problem in a proprietary driver that happens now in 2.6.33-rc3
> > > > > > > > > I am taking about nvidia driver.
> > > > > > > > >
> > > > > > > > > Some time ago I did very long hibernate test and found no errors after
> > > > > > > > > more that 200 cycles.
> > > > > > > > >
> > > > > > > > > Now I update to 2.6.33 and notice that system will hand when nvidia
> > > > > > > > > driver allocates memory is their .suspend functions.
> > > > > > > >
> > > > > > > > They shouldn't do that, there's no guarantee that's going to work at all.
> > > > > > > >
> > > > > > > > > This could fail in 2.6.32 if I would run many memory hungry
> > > > > > > > > applications, but now this happens with most of memory free.
> > > > > > > >
> > > > > > > > This sounds a little strange. What's the requested size of the image?
> > > > > > > Don't know, but system has to be very tight on memory.
> > > > > >
> > > > > > Can you send full dmesg, please?
> > > > >
> > > > > I deleted it, but for this case I think that hang was somewhere else.
> > > > > This task was hand on doing forking, which probably happened even before
> > > > > the freezer.
> > > > >
> > > > > Anyway, the problem is clear. Now __get_free_pages blocks more often,
> > > > > and can block in .suspend even if there is plenty of memory free.
> > >
> > > This is suspicious, but I leave it to the MM people for consideration.
> > >
> > > > > I now patched nvidia to use GFP_ATOMIC _always_, and problem disappear.
> > > > > It isn't such great solution when memory is tight though....
> > > > >
> > > > > This is going to hit hard all nvidia users...
> > > >
> > > > Well, generally speaking, no driver should ever allocate memory using
> > > > GFP_KERNEL in its .suspend() routine, because that's not going to work, as you
> > > > can readily see. So this is a NVidia bug, hands down.
> > > >
> > > > Now having said that, we've been considering a change that will turn all
> > > > GFP_KERNEL allocations into GFP_NOIO during suspend/resume, so perhaps I'll
> > > > prepare a patch to do that and let's see what people think.
> > >
> > > If I didn't confuse anything (which is likely, because it's a bit late here
> > > now), the patch below should do the trick. I have only checked that it doesn't
> > > break compilation, so please take it with a grain of salt.
> >
> > Appended is another version that attempts to remove some possible races.
> > It's been tested a little too.
...
> > - if (unlikely(!page))
> > + if (unlikely(!page)) {
> > + down_read(&gfp_allowed_mask_sem);

This obviously is going too far, because it would change atomic allocations
into non-atomic in general (kinda embarassing ...).

However, I think it would be sufficient to acquire the lock only in the
(__GFP_IO | __GFP_FS) case. Appended is an updated patch doing this.

> > page = __alloc_pages_slowpath(gfp_mask, order,
> > zonelist, high_zoneidx, nodemask,
> > preferred_zone, migratetype);
> > + up_read(&gfp_allowed_mask_sem);
> > + }
...
>
> I think we can use lockdep annotation, too. but it's overkill.
> That's because suspend/resume is rare event so that I want to add
> the cost in lockdep. so I like this idea. But, I have a concern.
> You are adding a little bit cost in alloc path although it's slow one.
>
> Really really do we need this?

Without it, we would have to duplicate every piece of code that normally uses
GFP_KERNEL allocations and that may be called during suspend/resume. I don't
really think that would be practical.

> Can't we remove the wrong usage in review or test process before merge?

That's not so simple. The Ben's message in this thread described the possible
issues quite well (http://lkml.org/lkml/2010/1/17/120).

I think the suspend process should wait for all the already started allocations
using I/O, because otherwise it might disturb them. So this also is a matter
of correctness in general.

> I don't have many experience at suspend/resume.
> I depends on your experience about this patch's value. :)

OK :-)

Rafael

---
kernel/power/hibernate.c | 6 +++++
kernel/power/power.h | 3 ++
kernel/power/suspend.c | 2 +
mm/Makefile | 1
mm/internal.h | 3 ++
mm/page_alloc.c | 5 +++-
mm/pm.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 75 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -334,6 +334,7 @@ int hibernation_snapshot(int platform_mo
goto Close;

suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_FREEZE);
if (error)
goto Recover_platform;
@@ -351,6 +352,7 @@ int hibernation_snapshot(int platform_mo

dpm_resume_end(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
+ mm_allow_io_allocations();
resume_console();
Close:
platform_end(platform_mode);
@@ -448,11 +450,13 @@ int hibernation_restore(int platform_mod

pm_prepare_console();
suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_QUIESCE);
if (!error) {
error = resume_target_kernel(platform_mode);
dpm_resume_end(PMSG_RECOVER);
}
+ mm_allow_io_allocations();
resume_console();
pm_restore_console();
return error;
@@ -481,6 +485,7 @@ int hibernation_platform_enter(void)

entering_platform_hibernation = true;
suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_HIBERNATE);
if (error) {
if (hibernation_ops->recover)
@@ -518,6 +523,7 @@ int hibernation_platform_enter(void)
Resume_devices:
entering_platform_hibernation = false;
dpm_resume_end(PMSG_RESTORE);
+ mm_allow_io_allocations();
resume_console();

Close:
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -187,6 +187,9 @@ static inline void suspend_test_finish(c
#ifdef CONFIG_PM_SLEEP
/* kernel/power/main.c */
extern int pm_notifier_call_chain(unsigned long val);
+/* mm/pm.c */
+extern void mm_force_noio_allocations(void);
+extern void mm_allow_io_allocations(void);
#endif

#ifdef CONFIG_HIGHMEM
Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -208,6 +208,7 @@ int suspend_devices_and_enter(suspend_st
goto Close;
}
suspend_console();
+ mm_force_noio_allocations();
suspend_test_start();
error = dpm_suspend_start(PMSG_SUSPEND);
if (error) {
@@ -224,6 +225,7 @@ int suspend_devices_and_enter(suspend_st
suspend_test_start();
dpm_resume_end(PMSG_RESUME);
suspend_test_finish("resume devices");
+ mm_allow_io_allocations();
resume_console();
Close:
if (suspend_ops->end)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1963,10 +1963,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
preferred_zone, migratetype);
- if (unlikely(!page))
+ if (unlikely(!page)) {
+ mm_lock_suspend(gfp_mask);
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
+ mm_unlock_suspend(gfp_mask);
+ }

trace_mm_page_alloc(page, order, gfp_mask, migratetype);
return page;
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_PM_SLEEP) += pm.o
Index: linux-2.6/mm/pm.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/pm.c
@@ -0,0 +1,56 @@
+#include <linux/gfp.h>
+#include <linux/rwsem.h>
+
+static DECLARE_RWSEM(gfp_suspend_sem);
+static gfp_t saved_gfp_allowed_mask;
+
+#define GFP_IOFS (__GFP_IO | __GFP_FS)
+
+/**
+ * mm_lock_suspend - Acquire GFP suspend semaphore.
+ * @gfp_mask: GFP mask used to determine whether to acquire the semaphore.
+ */
+void mm_lock_suspend(gfp_t gfp_mask)
+{
+ if (gfp_mask & GFP_IOFS)
+ down_read(&gfp_suspend_sem);
+}
+
+/**
+ * mm_unlock_suspend - Release GFP suspend semaphore.
+ * @gfp_mask: GFP mask used to determine whether to release the semaphore.
+ */
+void mm_unlock_suspend(gfp_t gfp_mask)
+{
+ if (gfp_mask & GFP_IOFS)
+ up_read(&gfp_suspend_sem);
+}
+
+/**
+ * mm_force_noio_allocations - Modify gfp_allowed_mask to disable IO allocations
+ *
+ * Change gfp_allowed_mask by unsetting __GFP_IO and __GFP_FS in it and save the
+ * old value.
+ */
+void mm_force_noio_allocations(void)
+{
+ /* Wait for all slowpath allocations using the old mask to complete */
+ down_write(&gfp_suspend_sem);
+ saved_gfp_allowed_mask = gfp_allowed_mask;
+ gfp_allowed_mask &= ~GFP_IOFS;
+ up_write(&gfp_suspend_sem);
+}
+
+/**
+ * mm_allow_io_allocations - Modify gfp_allowed_mask to allow IO allocations
+ *
+ * If the saved value of gfp_allowed_mask has __GFP_IO set, modify the current
+ * gfp_allowed_mask by setting this bit and anlogously for __GFP_FS.
+ */
+void mm_allow_io_allocations(void)
+{
+ /* Wait for all slowpath allocations using the old mask to complete */
+ down_write(&gfp_suspend_sem);
+ gfp_allowed_mask |= saved_gfp_allowed_mask & GFP_IOFS;
+ up_write(&gfp_suspend_sem);
+}
Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -259,3 +259,6 @@ extern u64 hwpoison_filter_flags_mask;
extern u64 hwpoison_filter_flags_value;
extern u64 hwpoison_filter_memcg;
extern u32 hwpoison_filter_enable;
+
+extern void mm_lock_suspend(gfp_t gfp_mask);
+extern void mm_unlock_suspend(gfp_t gfp_mask);

2010-01-18 02:16:41

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

> Hi,
>
> I thing the snippet below is a good summary of what this is about.
>
> On Saturday 16 January 2010, Rafael J. Wysocki wrote:
> > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > On Sat, 2010-01-16 at 01:57 +0100, Rafael J. Wysocki wrote:
> > > > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > > > On Fri, 2010-01-15 at 23:03 +0100, Rafael J. Wysocki wrote:
> > > > > > On Friday 15 January 2010, Maxim Levitsky wrote:
> > > > > > > Hi,
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > I know that this is very controversial, because here I want to describe
> > > > > > > a problem in a proprietary driver that happens now in 2.6.33-rc3
> > > > > > > I am taking about nvidia driver.
> > > > > > >
> > > > > > > Some time ago I did very long hibernate test and found no errors after
> > > > > > > more that 200 cycles.
> > > > > > >
> > > > > > > Now I update to 2.6.33 and notice that system will hand when nvidia
> > > > > > > driver allocates memory is their .suspend functions.
> > > > > >
> > > > > > They shouldn't do that, there's no guarantee that's going to work at all.
> > > > > >
> > > > > > > This could fail in 2.6.32 if I would run many memory hungry
> > > > > > > applications, but now this happens with most of memory free.
> > > > > >
> > > > > > This sounds a little strange. What's the requested size of the image?
> > > > > Don't know, but system has to be very tight on memory.
> > > >
> > > > Can you send full dmesg, please?
> > >
> > > I deleted it, but for this case I think that hang was somewhere else.
> > > This task was hand on doing forking, which probably happened even before
> > > the freezer.
> > >
> > > Anyway, the problem is clear. Now __get_free_pages blocks more often,
> > > and can block in .suspend even if there is plenty of memory free.
>
> This is suspicious, but I leave it to the MM people for consideration.
>
> > > I now patched nvidia to use GFP_ATOMIC _always_, and problem disappear.
> > > It isn't such great solution when memory is tight though....
> > >
> > > This is going to hit hard all nvidia users...
> >
> > Well, generally speaking, no driver should ever allocate memory using
> > GFP_KERNEL in its .suspend() routine, because that's not going to work, as you
> > can readily see. So this is a NVidia bug, hands down.
> >
> > Now having said that, we've been considering a change that will turn all
> > GFP_KERNEL allocations into GFP_NOIO during suspend/resume, so perhaps I'll
> > prepare a patch to do that and let's see what people think.
>
> If I didn't confuse anything (which is likely, because it's a bit late here
> now), the patch below should do the trick. I have only checked that it doesn't
> break compilation, so please take it with a grain of salt.
>
> Comments welcome.

Hmm..
I don't think this is good idea.

GFP_NOIO mean "Please don't reclaim if the page is dirty". It mean the system
have lots dirty pages, this patch might makes hung up.

If suspend need lots memory, we need to make free memory before starting IO
suspending, I think.


2010-01-18 02:20:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1963,10 +1963,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
> page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> preferred_zone, migratetype);
> - if (unlikely(!page))
> + if (unlikely(!page)) {
> + mm_lock_suspend(gfp_mask);
> page = __alloc_pages_slowpath(gfp_mask, order,
> zonelist, high_zoneidx, nodemask,
> preferred_zone, migratetype);
> + mm_unlock_suspend(gfp_mask);
> + }
>
> trace_mm_page_alloc(page, order, gfp_mask, migratetype);
> return page;

I think we don't need read side lock at all. generally, no lock might makes race.
But in this case, changing gfp_allowed_mask and nvidia suspend method should be
serialized higher level. Why the above two code need to run concurrently?


2010-01-18 07:52:52

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

Am Montag, 18. Januar 2010 00:00:23 schrieb Rafael J. Wysocki:
> On Sunday 17 January 2010, Benjamin Herrenschmidt wrote:
> > On Sun, 2010-01-17 at 14:27 +0100, Rafael J. Wysocki wrote:
> ...
> > However, it's hard to deal with the case of allocations that have
> > already started waiting for IOs. It might be possible to have some VM
> > hook to make them wakeup, re-evaluate the situation and get out of that
> > code path but in any case it would be tricky.
>
> In the second version of the patch I used an rwsem that made us wait for these
> allocations to complete before we changed gfp_allowed_mask.

This will be a very, very hot semaphore. What's the impact on performance?

Regards
Oliver

2010-01-18 16:18:00

by Alan Stern

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

On Mon, 18 Jan 2010, Oliver Neukum wrote:

> Am Montag, 18. Januar 2010 00:00:23 schrieb Rafael J. Wysocki:
> > On Sunday 17 January 2010, Benjamin Herrenschmidt wrote:
> > > On Sun, 2010-01-17 at 14:27 +0100, Rafael J. Wysocki wrote:
> > ...
> > > However, it's hard to deal with the case of allocations that have
> > > already started waiting for IOs. It might be possible to have some VM
> > > hook to make them wakeup, re-evaluate the situation and get out of that
> > > code path but in any case it would be tricky.
> >
> > In the second version of the patch I used an rwsem that made us wait for these
> > allocations to complete before we changed gfp_allowed_mask.
>
> This will be a very, very hot semaphore. What's the impact on performance?

Can it be replaced with something having lower overhead, such as SRCU?

Alan Stern

2010-01-18 16:59:58

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

Am Sonntag, 17. Januar 2010 14:55:55 schrieb Rafael J. Wysocki:
> +void mm_force_noio_allocations(void)
> +{
> + /* Wait for all slowpath allocations using the old mask to complete */
> + down_write(&gfp_allowed_mask_sem);
> + saved_gfp_allowed_mask = gfp_allowed_mask;
> + gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
> + up_write(&gfp_allowed_mask_sem);
> +}

In addition to this you probably want to exhaust all memory reserves
before you fail a memory allocation and forbid the OOM killer to run.

Regards
Oliver

2010-01-18 20:41:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Monday 18 January 2010, Oliver Neukum wrote:
> Am Sonntag, 17. Januar 2010 14:55:55 schrieb Rafael J. Wysocki:
> > +void mm_force_noio_allocations(void)
> > +{
> > + /* Wait for all slowpath allocations using the old mask to complete */
> > + down_write(&gfp_allowed_mask_sem);
> > + saved_gfp_allowed_mask = gfp_allowed_mask;
> > + gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
> > + up_write(&gfp_allowed_mask_sem);
> > +}
>
> In addition to this you probably want to exhaust all memory reserves
> before you fail a memory allocation

I'm not really sure what you mean.

> and forbid the OOM killer to run.

The OOM killer has already been disabled at this point, by the freezer.

Rafael

2010-01-18 20:54:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Monday 18 January 2010, KOSAKI Motohiro wrote:
> > Hi,
> >
> > I thing the snippet below is a good summary of what this is about.
> >
> > On Saturday 16 January 2010, Rafael J. Wysocki wrote:
> > > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > > On Sat, 2010-01-16 at 01:57 +0100, Rafael J. Wysocki wrote:
> > > > > On Saturday 16 January 2010, Maxim Levitsky wrote:
> > > > > > On Fri, 2010-01-15 at 23:03 +0100, Rafael J. Wysocki wrote:
> > > > > > > On Friday 15 January 2010, Maxim Levitsky wrote:
> > > > > > > > Hi,
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > > I know that this is very controversial, because here I want to describe
> > > > > > > > a problem in a proprietary driver that happens now in 2.6.33-rc3
> > > > > > > > I am taking about nvidia driver.
> > > > > > > >
> > > > > > > > Some time ago I did very long hibernate test and found no errors after
> > > > > > > > more that 200 cycles.
> > > > > > > >
> > > > > > > > Now I update to 2.6.33 and notice that system will hand when nvidia
> > > > > > > > driver allocates memory is their .suspend functions.
> > > > > > >
> > > > > > > They shouldn't do that, there's no guarantee that's going to work at all.
> > > > > > >
> > > > > > > > This could fail in 2.6.32 if I would run many memory hungry
> > > > > > > > applications, but now this happens with most of memory free.
> > > > > > >
> > > > > > > This sounds a little strange. What's the requested size of the image?
> > > > > > Don't know, but system has to be very tight on memory.
> > > > >
> > > > > Can you send full dmesg, please?
> > > >
> > > > I deleted it, but for this case I think that hang was somewhere else.
> > > > This task was hand on doing forking, which probably happened even before
> > > > the freezer.
> > > >
> > > > Anyway, the problem is clear. Now __get_free_pages blocks more often,
> > > > and can block in .suspend even if there is plenty of memory free.
> >
> > This is suspicious, but I leave it to the MM people for consideration.
> >
> > > > I now patched nvidia to use GFP_ATOMIC _always_, and problem disappear.
> > > > It isn't such great solution when memory is tight though....
> > > >
> > > > This is going to hit hard all nvidia users...
> > >
> > > Well, generally speaking, no driver should ever allocate memory using
> > > GFP_KERNEL in its .suspend() routine, because that's not going to work, as you
> > > can readily see. So this is a NVidia bug, hands down.
> > >
> > > Now having said that, we've been considering a change that will turn all
> > > GFP_KERNEL allocations into GFP_NOIO during suspend/resume, so perhaps I'll
> > > prepare a patch to do that and let's see what people think.
> >
> > If I didn't confuse anything (which is likely, because it's a bit late here
> > now), the patch below should do the trick. I have only checked that it doesn't
> > break compilation, so please take it with a grain of salt.
> >
> > Comments welcome.
>
> Hmm..
> I don't think this is good idea.
>
> GFP_NOIO mean "Please don't reclaim if the page is dirty". It mean the system
> have lots dirty pages, this patch might makes hung up.

The point is to prevent the mm subsystem from using I/O. If there's a better
way, please tell me. :-)

> If suspend need lots memory, we need to make free memory before starting IO
> suspending, I think.

Suspend as such doesn't need a lot of memory, except for some drivers doing
things they shouldn't do.

However, there are a few problems that need to be addressed in general.

First, we can't really guarantee that there's a lot of free memory available
during suspend and some memory allocations are done indirectly, using
GFP_KERNEL (for example, when new kernel threads are started). If one of
these is done during suspend and it happens to cause the mm subsystem to
start I/O on a suspended devices, the kernel will lock up.

Second, there may be a memory allocation in progress when suspend is started
that causes I/O to happen and races with the suspend process. If the latter
wins the race, the I/O may be attempted on a suspended device and the kernel
will lock up.

My patch attempts to avoid these two problems as well as the problem with
drivers using GFP_KERNEL allocations during suspend which I admit might be
solved by reworking the drivers.

Thanks,
Rafael

2010-01-18 20:55:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Monday 18 January 2010, Oliver Neukum wrote:
> Am Montag, 18. Januar 2010 00:00:23 schrieb Rafael J. Wysocki:
> > On Sunday 17 January 2010, Benjamin Herrenschmidt wrote:
> > > On Sun, 2010-01-17 at 14:27 +0100, Rafael J. Wysocki wrote:
> > ...
> > > However, it's hard to deal with the case of allocations that have
> > > already started waiting for IOs. It might be possible to have some VM
> > > hook to make them wakeup, re-evaluate the situation and get out of that
> > > code path but in any case it would be tricky.
> >
> > In the second version of the patch I used an rwsem that made us wait for these
> > allocations to complete before we changed gfp_allowed_mask.
>
> This will be a very, very hot semaphore. What's the impact on performance?

rwsems are highly optimized AFAICT, but this is a good question of course.

I have no data at the moment.

Rafael

2010-01-18 20:58:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

On Monday 18 January 2010, Alan Stern wrote:
> On Mon, 18 Jan 2010, Oliver Neukum wrote:
>
> > Am Montag, 18. Januar 2010 00:00:23 schrieb Rafael J. Wysocki:
> > > On Sunday 17 January 2010, Benjamin Herrenschmidt wrote:
> > > > On Sun, 2010-01-17 at 14:27 +0100, Rafael J. Wysocki wrote:
> > > ...
> > > > However, it's hard to deal with the case of allocations that have
> > > > already started waiting for IOs. It might be possible to have some VM
> > > > hook to make them wakeup, re-evaluate the situation and get out of that
> > > > code path but in any case it would be tricky.
> > >
> > > In the second version of the patch I used an rwsem that made us wait for these
> > > allocations to complete before we changed gfp_allowed_mask.
> >
> > This will be a very, very hot semaphore. What's the impact on performance?
>
> Can it be replaced with something having lower overhead, such as SRCU?

I'm not sure about that. In principle SRCU shouldn't be used if the reader can
sleep unpredictably long and the memory allocation sutiation is one of these.

Rafael

2010-01-18 21:06:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

On Monday 18 January 2010, KOSAKI Motohiro wrote:
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c
> > +++ linux-2.6/mm/page_alloc.c
> > @@ -1963,10 +1963,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
> > page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> > zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> > preferred_zone, migratetype);
> > - if (unlikely(!page))
> > + if (unlikely(!page)) {
> > + mm_lock_suspend(gfp_mask);
> > page = __alloc_pages_slowpath(gfp_mask, order,
> > zonelist, high_zoneidx, nodemask,
> > preferred_zone, migratetype);
> > + mm_unlock_suspend(gfp_mask);
> > + }
> >
> > trace_mm_page_alloc(page, order, gfp_mask, migratetype);
> > return page;
>
> I think we don't need read side lock at all. generally, no lock might makes race.
> But in this case, changing gfp_allowed_mask and nvidia suspend method should be
> serialized higher level. Why the above two code need to run concurrently?

The changing of gfp_allowed_mask is serialized with the suspend of devices,
so there's no concurrency here.

I was concerned about another problem, though, which is what happens if the
suspend process runs in parallel with a memory allocation that started earlier
and happens to do some I/O. I that case the suspend process doesn't know
about the I/O done by the mm subsystem and may disturb it in principle.

That said, perhaps that should be a concern for the block devices subsystem to
prevent such situations from happening.

So, perhaps I'll remove the reader-side lock altogether and go back to
something like the first version of the patch.

Rafael

2010-01-18 21:57:27

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Mon, 2010-01-18 at 00:00 +0100, Rafael J. Wysocki wrote:
> On Sunday 17 January 2010, Benjamin Herrenschmidt wrote:
> > On Sun, 2010-01-17 at 14:27 +0100, Rafael J. Wysocki wrote:
> ...
> > However, it's hard to deal with the case of allocations that have
> > already started waiting for IOs. It might be possible to have some VM
> > hook to make them wakeup, re-evaluate the situation and get out of that
> > code path but in any case it would be tricky.
>
> In the second version of the patch I used an rwsem that made us wait for these
> allocations to complete before we changed gfp_allowed_mask.
>
> [This is kinda buggy in the version I sent, but I'm going to send an update
> in a minute.]

And nobody screamed due to cache line ping pong caused by this in the
fast path ? :-)

We might want to look at something a bit smarter for that sort of
read-mostly-really-really-mostly construct, though in this case I don't
think RCU is the answer since we are happily scheduling.

I wonder if something per-cpu would do, it's thus the responsibility of
the "writer" to take them all in order for all CPUs.

Cheers,
Ben.

2010-01-18 23:33:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Monday 18 January 2010, Benjamin Herrenschmidt wrote:
> On Mon, 2010-01-18 at 00:00 +0100, Rafael J. Wysocki wrote:
> > On Sunday 17 January 2010, Benjamin Herrenschmidt wrote:
> > > On Sun, 2010-01-17 at 14:27 +0100, Rafael J. Wysocki wrote:
> > ...
> > > However, it's hard to deal with the case of allocations that have
> > > already started waiting for IOs. It might be possible to have some VM
> > > hook to make them wakeup, re-evaluate the situation and get out of that
> > > code path but in any case it would be tricky.
> >
> > In the second version of the patch I used an rwsem that made us wait for these
> > allocations to complete before we changed gfp_allowed_mask.
> >
> > [This is kinda buggy in the version I sent, but I'm going to send an update
> > in a minute.]
>
> And nobody screamed due to cache line ping pong caused by this in the
> fast path ? :-)

Apparently not. :-)

> We might want to look at something a bit smarter for that sort of
> read-mostly-really-really-mostly construct, though in this case I don't
> think RCU is the answer since we are happily scheduling.
>
> I wonder if something per-cpu would do, it's thus the responsibility of
> the "writer" to take them all in order for all CPUs.

I think I'll get back to the first version of the patch which I think is not
going to have side effects (as long as no one will change gfp_allowed_mask
in parallel with suspend/resume), for now.

We can add more complicated things on top of it, then.

Rafael

2010-01-19 01:19:34

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

Hi

> > If suspend need lots memory, we need to make free memory before starting IO
> > suspending, I think.
>
> Suspend as such doesn't need a lot of memory, except for some drivers doing
> things they shouldn't do.
>
> However, there are a few problems that need to be addressed in general.
>
> First, we can't really guarantee that there's a lot of free memory available
> during suspend and some memory allocations are done indirectly, using
> GFP_KERNEL (for example, when new kernel threads are started). If one of
> these is done during suspend and it happens to cause the mm subsystem to
> start I/O on a suspended devices, the kernel will lock up.
>
> Second, there may be a memory allocation in progress when suspend is started
> that causes I/O to happen and races with the suspend process. If the latter
> wins the race, the I/O may be attempted on a suspended device and the kernel
> will lock up.

I think the race happen itself is bad. memory and I/O subsystem can't solve such race
elegantly. These doesn't know enough suspend state knowlege. I think the practical
solution is that higher level design prevent the race happen.


> My patch attempts to avoid these two problems as well as the problem with
> drivers using GFP_KERNEL allocations during suspend which I admit might be
> solved by reworking the drivers.

Agreed. In this case, only drivers change can solve the issue.


2010-01-19 03:20:40

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Tue, 2010-01-19 at 10:19 +0900, KOSAKI Motohiro wrote:
> I think the race happen itself is bad. memory and I/O subsystem can't solve such race
> elegantly. These doesn't know enough suspend state knowlege. I think the practical
> solution is that higher level design prevent the race happen.
>
>
> > My patch attempts to avoid these two problems as well as the problem with
> > drivers using GFP_KERNEL allocations during suspend which I admit might be
> > solved by reworking the drivers.
>
> Agreed. In this case, only drivers change can solve the issue.

As I explained earlier, this is near to impossible since the allocations
are too often burried deep down the call stack or simply because the
driver doesn't know that we started suspending -another- driver...

I don't think trying to solve those problems at the driver level is
realistic to be honest. This is one of those things where we really just
need to make allocators 'just work' from a driver perspective.

It can't be perfect of course, as mentioned earlier, there will be a
problem if too little free memory is really available due to lots of
dirty pages around, but most of this can be somewhat alleviated in
practice, for example by pushing things out a bit at suspend time,
making some more memory free etc... But yeah, nothing replaces proper
error handling in drivers for allocation failures even with
GFP_KERNEL :-)

Cheers,
Ben.





2010-01-19 09:04:58

by Bastien Roucariès

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Tue, Jan 19, 2010 at 4:19 AM, Benjamin Herrenschmidt
<[email protected]> wrote:
> On Tue, 2010-01-19 at 10:19 +0900, KOSAKI Motohiro wrote:
>> I think the race happen itself is bad. memory and I/O subsystem can't solve such race
>> elegantly. These doesn't know enough suspend state knowlege. I think the practical
>> solution is that higher level design prevent the race happen.
>>
>>
>> > My patch attempts to avoid these two problems as well as the problem with
>> > drivers using GFP_KERNEL allocations during suspend which I admit might be
>> > solved by reworking the drivers.
>>
>> Agreed. In this case, only drivers change can solve the issue.
>
> As I explained earlier, this is near to impossible since the allocations
> are too often burried deep down the call stack or simply because the
> driver doesn't know that we started suspending -another- driver...
>
> I don't think trying to solve those problems at the driver level is
> realistic to be honest. This is one of those things where we really just
> need to make allocators 'just work' from a driver perspective.

Instead of masking bit could we only check if incompatible flags are
used during suspend, and warm deeply. Call stack will be therefore
identified, and we could have some metrics about such problem.

It will be a debug option like lockdep but pretty low cost.

My 2 cents.

Bastien

2010-01-19 09:14:21

by Oliver Neukum

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

Am Montag, 18. Januar 2010 22:06:36 schrieb Rafael J. Wysocki:
> I was concerned about another problem, though, which is what happens if the
> suspend process runs in parallel with a memory allocation that started earlier
> and happens to do some I/O. I that case the suspend process doesn't know
> about the I/O done by the mm subsystem and may disturb it in principle.

How could this happen? Who would allocate that memory?
Tasks won't be frozen while they are allocating memory.

Regards
Oliver

2010-01-19 09:24:57

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

Am Montag, 18. Januar 2010 21:41:49 schrieb Rafael J. Wysocki:
> On Monday 18 January 2010, Oliver Neukum wrote:
> > Am Sonntag, 17. Januar 2010 14:55:55 schrieb Rafael J. Wysocki:
> > > +void mm_force_noio_allocations(void)
> > > +{
> > > + /* Wait for all slowpath allocations using the old mask to complete */
> > > + down_write(&gfp_allowed_mask_sem);
> > > + saved_gfp_allowed_mask = gfp_allowed_mask;
> > > + gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
> > > + up_write(&gfp_allowed_mask_sem);
> > > +}
> >
> > In addition to this you probably want to exhaust all memory reserves
> > before you fail a memory allocation
>
> I'm not really sure what you mean.

Forget it, it was foolish. Instead there's a different problem.
Suppose we are tight on memory. The problem is that we must not
exhaust all memory. If we are really out of memory we may be unable
to satisfy memory allocations in resume()

Regards
Oliver

2010-01-19 15:14:04

by Alan Stern

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

On Mon, 18 Jan 2010, Rafael J. Wysocki wrote:

> On Monday 18 January 2010, Alan Stern wrote:
> > On Mon, 18 Jan 2010, Oliver Neukum wrote:
> >
> > > Am Montag, 18. Januar 2010 00:00:23 schrieb Rafael J. Wysocki:
> > > > On Sunday 17 January 2010, Benjamin Herrenschmidt wrote:
> > > > > On Sun, 2010-01-17 at 14:27 +0100, Rafael J. Wysocki wrote:
> > > > ...
> > > > > However, it's hard to deal with the case of allocations that have
> > > > > already started waiting for IOs. It might be possible to have some VM
> > > > > hook to make them wakeup, re-evaluate the situation and get out of that
> > > > > code path but in any case it would be tricky.
> > > >
> > > > In the second version of the patch I used an rwsem that made us wait for these
> > > > allocations to complete before we changed gfp_allowed_mask.
> > >
> > > This will be a very, very hot semaphore. What's the impact on performance?
> >
> > Can it be replaced with something having lower overhead, such as SRCU?
>
> I'm not sure about that. In principle SRCU shouldn't be used if the reader can
> sleep unpredictably long and the memory allocation sutiation is one of these.

I don't think this matters. Each SRCU usage has its own domain, and
different domains don't affect one another. So the only thing that
would be blocked by a long-sleeping reader would be the suspend process
itself, and obviously you won't mind that.

Alan Stern

2010-01-19 20:33:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

On Tuesday 19 January 2010, Oliver Neukum wrote:
> Am Montag, 18. Januar 2010 22:06:36 schrieb Rafael J. Wysocki:
> > I was concerned about another problem, though, which is what happens if the
> > suspend process runs in parallel with a memory allocation that started earlier
> > and happens to do some I/O. I that case the suspend process doesn't know
> > about the I/O done by the mm subsystem and may disturb it in principle.
>
> How could this happen? Who would allocate that memory?
> Tasks won't be frozen while they are allocating memory.

The majority of kernel threads are not freezable, but I agree it's rather low
risk.

Rafael

2010-01-19 20:36:52

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Tuesday 19 January 2010, Oliver Neukum wrote:
> Am Montag, 18. Januar 2010 21:41:49 schrieb Rafael J. Wysocki:
> > On Monday 18 January 2010, Oliver Neukum wrote:
> > > Am Sonntag, 17. Januar 2010 14:55:55 schrieb Rafael J. Wysocki:
> > > > +void mm_force_noio_allocations(void)
> > > > +{
> > > > + /* Wait for all slowpath allocations using the old mask to complete */
> > > > + down_write(&gfp_allowed_mask_sem);
> > > > + saved_gfp_allowed_mask = gfp_allowed_mask;
> > > > + gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
> > > > + up_write(&gfp_allowed_mask_sem);
> > > > +}
> > >
> > > In addition to this you probably want to exhaust all memory reserves
> > > before you fail a memory allocation
> >
> > I'm not really sure what you mean.
>
> Forget it, it was foolish. Instead there's a different problem.
> Suppose we are tight on memory. The problem is that we must not
> exhaust all memory. If we are really out of memory we may be unable
> to satisfy memory allocations in resume()

That doesn't make things any worse than the are already. If we block on
I/O forever during resume, the gross result is pretty much the same.

That said, Maxim reported that in his test case the mm subsystem apparently
attempted to use I/O even if there was a plenty of free memory available and
I'd like prevent _that_ from happening.

Rafael

2010-01-19 20:47:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Tuesday 19 January 2010, Benjamin Herrenschmidt wrote:
> On Tue, 2010-01-19 at 10:19 +0900, KOSAKI Motohiro wrote:
> > I think the race happen itself is bad. memory and I/O subsystem can't solve such race
> > elegantly. These doesn't know enough suspend state knowlege. I think the practical
> > solution is that higher level design prevent the race happen.
> >
> >
> > > My patch attempts to avoid these two problems as well as the problem with
> > > drivers using GFP_KERNEL allocations during suspend which I admit might be
> > > solved by reworking the drivers.
> >
> > Agreed. In this case, only drivers change can solve the issue.
>
> As I explained earlier, this is near to impossible since the allocations
> are too often burried deep down the call stack or simply because the
> driver doesn't know that we started suspending -another- driver...
>
> I don't think trying to solve those problems at the driver level is
> realistic to be honest. This is one of those things where we really just
> need to make allocators 'just work' from a driver perspective.
>
> It can't be perfect of course, as mentioned earlier, there will be a
> problem if too little free memory is really available due to lots of
> dirty pages around, but most of this can be somewhat alleviated in
> practice, for example by pushing things out a bit at suspend time,
> making some more memory free etc... But yeah, nothing replaces proper
> error handling in drivers for allocation failures even with
> GFP_KERNEL :-)

Agreed.

Moreover, I didn't try to do anything about that before, because memory
allocation problems during suspend/resume just didn't happen. We kind of knew
they were possible, but since they didn't show up, it wasn't immediately
necessary to address them.

Now, however, people started to see these problems in testing and I'm quite
confident that this is a result of recent changes in the mm subsystem. Namely,
if you read the Maxim's report carefully, you'll notice that in his test case
the mm subsystem apparently attempted to use I/O even though there was free
memory available in the system. This is the case I want to prevent from
happening in the first place.

Rafael

2010-01-19 23:18:54

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Tue, 2010-01-19 at 10:04 +0100, Bastien ROUCARIES wrote:
> Instead of masking bit could we only check if incompatible flags are
> used during suspend, and warm deeply. Call stack will be therefore
> identified, and we could have some metrics about such problem.
>
> It will be a debug option like lockdep but pretty low cost.

I still believe it would just be a giant can of worms to require every
call site of memory allocators to "know" whether suspend has been
started or not.... Along the same reasons why we added that stuff for
boot time allocs.

Cheers,
Ben.

2010-01-20 00:33:17

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

> On Tuesday 19 January 2010, Benjamin Herrenschmidt wrote:
> > On Tue, 2010-01-19 at 10:19 +0900, KOSAKI Motohiro wrote:
> > > I think the race happen itself is bad. memory and I/O subsystem can't solve such race
> > > elegantly. These doesn't know enough suspend state knowlege. I think the practical
> > > solution is that higher level design prevent the race happen.
> > >
> > >
> > > > My patch attempts to avoid these two problems as well as the problem with
> > > > drivers using GFP_KERNEL allocations during suspend which I admit might be
> > > > solved by reworking the drivers.
> > >
> > > Agreed. In this case, only drivers change can solve the issue.
> >
> > As I explained earlier, this is near to impossible since the allocations
> > are too often burried deep down the call stack or simply because the
> > driver doesn't know that we started suspending -another- driver...
> >
> > I don't think trying to solve those problems at the driver level is
> > realistic to be honest. This is one of those things where we really just
> > need to make allocators 'just work' from a driver perspective.
> >
> > It can't be perfect of course, as mentioned earlier, there will be a
> > problem if too little free memory is really available due to lots of
> > dirty pages around, but most of this can be somewhat alleviated in
> > practice, for example by pushing things out a bit at suspend time,
> > making some more memory free etc... But yeah, nothing replaces proper
> > error handling in drivers for allocation failures even with
> > GFP_KERNEL :-)
>
> Agreed.
>
> Moreover, I didn't try to do anything about that before, because memory
> allocation problems during suspend/resume just didn't happen. We kind of knew
> they were possible, but since they didn't show up, it wasn't immediately
> necessary to address them.
>
> Now, however, people started to see these problems in testing and I'm quite
> confident that this is a result of recent changes in the mm subsystem. Namely,
> if you read the Maxim's report carefully, you'll notice that in his test case
> the mm subsystem apparently attempted to use I/O even though there was free
> memory available in the system. This is the case I want to prevent from
> happening in the first place.

Hi Rafael,

Do you mean this is the unrelated issue of nVidia bug? Probably I haven't
catch your point. I don't find Maxim's original bug report. Can we share
the test-case and your analysis detail?


2010-01-20 12:15:43

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

Am Mittwoch, 20. Januar 2010 00:17:51 schrieb Benjamin Herrenschmidt:
> On Tue, 2010-01-19 at 10:04 +0100, Bastien ROUCARIES wrote:
> > Instead of masking bit could we only check if incompatible flags are
> > used during suspend, and warm deeply. Call stack will be therefore
> > identified, and we could have some metrics about such problem.
> >
> > It will be a debug option like lockdep but pretty low cost.
>
> I still believe it would just be a giant can of worms to require every
> call site of memory allocators to "know" whether suspend has been
> started or not.... Along the same reasons why we added that stuff for
> boot time allocs.

But we have the freezer. So generally we don't require that knowledge.
We can expect no normal IO to happen.
The question is in the suspend paths. We never may use anything
but GFP_NOIO (and GFP_ATOMIC) in the suspend() path. We can
take care of that requirement in the allocator only if the whole system
is suspended. As soon as a driver does runtime power management,
it is on its own.

Regards
Oliver

2010-01-20 14:26:12

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

Am Dienstag, 19. Januar 2010 21:37:35 schrieb Rafael J. Wysocki:
> That said, Maxim reported that in his test case the mm subsystem apparently
> attempted to use I/O even if there was a plenty of free memory available and
> I'd like prevent that from happening.

Hi,

it seems to me that this is caused by the mm subsytem maintaing
a share of clean pages precisely so that GFP_NOIO will work.
Perhaps it is a good idea to
a) launder a number of pages if the system is about to be suspended
between the freezer and notifying drivers
b) set the ration of clean pages to dirty pages to 0 while suspending
the system.

Regards
Oliver

2010-01-20 21:11:38

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Wed, 2010-01-20 at 12:31 +0100, Oliver Neukum wrote:
>
> But we have the freezer. So generally we don't require that knowledge.
> We can expect no normal IO to happen.

That came before and it's just not a safe assumption :-) The freezer to
some extent protects drivers against ioctl's and that sort of stuff but
really that's about it. There's plenty of things in the kernel that can
kick IOs on their own for a reason or another, or do memory allocations
which in turn will try to push something out and do IOs etc... even when
"frozen".

> The question is in the suspend paths. We never may use anything
> but GFP_NOIO (and GFP_ATOMIC) in the suspend() path. We can
> take care of that requirement in the allocator only if the whole
> system
> is suspended. As soon as a driver does runtime power management,
> it is on its own.

I'm not sure I understand what you are trying to say here :-)

First of all, the problem goes beyond what a driver does in its own
suspend() path. If it was just that, we might be able to some extent to
push enough stuff up for the driver to specify the right GFP flags
(though even that could be nasty).

The problem with system suspend also happens when your driver has not
been suspended yet, but another one, which happens to be a block device
with dirty pages for example, has.

Your not-yet-suspended driver might well be blocked somewhere in an
allocation or about to make one with some kind of internal mutex held,
that sort of thing, as part of it's normal operations, and -that- can
hang, causing problems when subsequently that same driver suspend() gets
called and tries to synchronize with the driver operations, for example
by trying to acquire that same mutex.

There's more of similarily nasty scenario. The fact is that it's going
to hit rarely, probably without a bakctrace or a crash, and so will
basically cause one of those rare "my laptop didn't suspend" cases that
may not even be reported, and just contribute to the general
unreliability of suspend/resume.

Cheers,
Ben.

2010-01-20 21:12:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Wednesday 20 January 2010, Oliver Neukum wrote:
> Am Mittwoch, 20. Januar 2010 00:17:51 schrieb Benjamin Herrenschmidt:
> > On Tue, 2010-01-19 at 10:04 +0100, Bastien ROUCARIES wrote:
> > > Instead of masking bit could we only check if incompatible flags are
> > > used during suspend, and warm deeply. Call stack will be therefore
> > > identified, and we could have some metrics about such problem.
> > >
> > > It will be a debug option like lockdep but pretty low cost.
> >
> > I still believe it would just be a giant can of worms to require every
> > call site of memory allocators to "know" whether suspend has been
> > started or not.... Along the same reasons why we added that stuff for
> > boot time allocs.
>
> But we have the freezer. So generally we don't require that knowledge.
> We can expect no normal IO to happen.
> The question is in the suspend paths. We never may use anything
> but GFP_NOIO (and GFP_ATOMIC) in the suspend() path. We can
> take care of that requirement in the allocator only if the whole system
> is suspended. As soon as a driver does runtime power management,
> it is on its own.

If you start new kernel threads using the async framework, for example,
GFP_KERNEL allocations are going to be used.

As I said before, IMnshO , duplicating every piece of code that allocates
memory and can be run during suspend/resume as well as in other circumstances
doesn't make sense.

Rafael

2010-01-20 21:12:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Wednesday 20 January 2010, Oliver Neukum wrote:
> Am Dienstag, 19. Januar 2010 21:37:35 schrieb Rafael J. Wysocki:
> > That said, Maxim reported that in his test case the mm subsystem apparently
> > attempted to use I/O even if there was a plenty of free memory available and
> > I'd like prevent that from happening.
>
> Hi,
>
> it seems to me that this is caused by the mm subsytem maintaing
> a share of clean pages precisely so that GFP_NOIO will work.
> Perhaps it is a good idea to
> a) launder a number of pages if the system is about to be suspended
> between the freezer and notifying drivers

That was tried, didn't work.

> b) set the ration of clean pages to dirty pages to 0 while suspending
> the system.

Patch, please?

Rafael

2010-01-20 21:21:12

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Wednesday 20 January 2010, KOSAKI Motohiro wrote:
> > On Tuesday 19 January 2010, Benjamin Herrenschmidt wrote:
> > > On Tue, 2010-01-19 at 10:19 +0900, KOSAKI Motohiro wrote:
> > > > I think the race happen itself is bad. memory and I/O subsystem can't solve such race
> > > > elegantly. These doesn't know enough suspend state knowlege. I think the practical
> > > > solution is that higher level design prevent the race happen.
> > > >
> > > >
> > > > > My patch attempts to avoid these two problems as well as the problem with
> > > > > drivers using GFP_KERNEL allocations during suspend which I admit might be
> > > > > solved by reworking the drivers.
> > > >
> > > > Agreed. In this case, only drivers change can solve the issue.
> > >
> > > As I explained earlier, this is near to impossible since the allocations
> > > are too often burried deep down the call stack or simply because the
> > > driver doesn't know that we started suspending -another- driver...
> > >
> > > I don't think trying to solve those problems at the driver level is
> > > realistic to be honest. This is one of those things where we really just
> > > need to make allocators 'just work' from a driver perspective.
> > >
> > > It can't be perfect of course, as mentioned earlier, there will be a
> > > problem if too little free memory is really available due to lots of
> > > dirty pages around, but most of this can be somewhat alleviated in
> > > practice, for example by pushing things out a bit at suspend time,
> > > making some more memory free etc... But yeah, nothing replaces proper
> > > error handling in drivers for allocation failures even with
> > > GFP_KERNEL :-)
> >
> > Agreed.
> >
> > Moreover, I didn't try to do anything about that before, because memory
> > allocation problems during suspend/resume just didn't happen. We kind of knew
> > they were possible, but since they didn't show up, it wasn't immediately
> > necessary to address them.
> >
> > Now, however, people started to see these problems in testing and I'm quite
> > confident that this is a result of recent changes in the mm subsystem. Namely,
> > if you read the Maxim's report carefully, you'll notice that in his test case
> > the mm subsystem apparently attempted to use I/O even though there was free
> > memory available in the system. This is the case I want to prevent from
> > happening in the first place.
>
> Hi Rafael,
>
> Do you mean this is the unrelated issue of nVidia bug?

The nvidia driver _is_ buggy, but Maxim said he couldn't reproduce the
problem if all the allocations made by the nvidia driver during suspend
were changed to GFP_ATOMIC.

> Probably I haven't catch your point. I don't find Maxim's original bug
> report. Can we share the test-case and your analysis detail?

The Maxim's original report is here:
https://lists.linux-foundation.org/pipermail/linux-pm/2010-January/023982.html

and the message I'm referring to is at:
https://lists.linux-foundation.org/pipermail/linux-pm/2010-January/023990.html

Thanks,
Rafael

2010-01-21 00:47:26

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

> > Hi Rafael,
> >
> > Do you mean this is the unrelated issue of nVidia bug?
>
> The nvidia driver _is_ buggy, but Maxim said he couldn't reproduce the
> problem if all the allocations made by the nvidia driver during suspend
> were changed to GFP_ATOMIC.
>
> > Probably I haven't catch your point. I don't find Maxim's original bug
> > report. Can we share the test-case and your analysis detail?
>
> The Maxim's original report is here:
> https://lists.linux-foundation.org/pipermail/linux-pm/2010-January/023982.html
>
> and the message I'm referring to is at:
> https://lists.linux-foundation.org/pipermail/linux-pm/2010-January/023990.html

Hmmm...

Usually, Increasing I/O isn't caused MM change. either subsystem change
memory alloc/free pattern and another subsystem receive such effect ;)
I don't think this message indicate MM fault.

And, 2.6.33 MM change is not much. if the fault is in MM change
(note: my guess is no), The most doubtful patch is my "killing shrink_all_zones"
patch. If old shrink_all_zones reclaimed memory much rather than required.
The patch fixed it. IOW, the patch can reduce available free memory to be used
buggy .suspend of the driver. but I don't think it is MM fault.

As I said, drivers can't use memory freely as their demand in suspend method.
It's obvious. They should stop such unrealistic assumption. but How should we fix
this?
- Gurantee suspend I/O device at last?
- Make much much free memory before calling .suspend method? even though
typical drivers don't need.
- Ask all drivers how much they require memory before starting suspend and
Make enough free memory at first?
- Or, do we have an alternative way?


Probably we have multiple option. but I don't think GFP_NOIO is good
option. It assume the system have lots non-dirty cache memory and it isn't
guranteed.


2010-01-21 20:21:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Thursday 21 January 2010, KOSAKI Motohiro wrote:
> > > Hi Rafael,
> > >
> > > Do you mean this is the unrelated issue of nVidia bug?
> >
> > The nvidia driver _is_ buggy, but Maxim said he couldn't reproduce the
> > problem if all the allocations made by the nvidia driver during suspend
> > were changed to GFP_ATOMIC.
> >
> > > Probably I haven't catch your point. I don't find Maxim's original bug
> > > report. Can we share the test-case and your analysis detail?
> >
> > The Maxim's original report is here:
> > https://lists.linux-foundation.org/pipermail/linux-pm/2010-January/023982.html
> >
> > and the message I'm referring to is at:
> > https://lists.linux-foundation.org/pipermail/linux-pm/2010-January/023990.html
>
> Hmmm...
>
> Usually, Increasing I/O isn't caused MM change. either subsystem change
> memory alloc/free pattern and another subsystem receive such effect ;)
> I don't think this message indicate MM fault.
>
> And, 2.6.33 MM change is not much. if the fault is in MM change
> (note: my guess is no), The most doubtful patch is my "killing shrink_all_zones"
> patch. If old shrink_all_zones reclaimed memory much rather than required.
> The patch fixed it. IOW, the patch can reduce available free memory to be used
> buggy .suspend of the driver. but I don't think it is MM fault.
>
> As I said, drivers can't use memory freely as their demand in suspend method.
> It's obvious. They should stop such unrealistic assumption. but How should we fix
> this?
> - Gurantee suspend I/O device at last?
> - Make much much free memory before calling .suspend method? even though
> typical drivers don't need.

That doesn't help already. Maxim tried to increase SPARE_PAGES (in
kernel/power/power.h) and that had no effect.

> - Ask all drivers how much they require memory before starting suspend and
> Make enough free memory at first?

That's equivalent to reworking all drivers to allocate memory before suspend
eg. with the help of PM notifiers. Which IMHO is unrealistic.

> - Or, do we have an alternative way?

The $subject patch?

> Probably we have multiple option. but I don't think GFP_NOIO is good
> option. It assume the system have lots non-dirty cache memory and it isn't
> guranteed.

Basically nothing is guaranteed in this case. However, does it actually make
things _worse_? What _exactly_ does happen without the $subject patch if the
system doesn't have non-dirty cache memory and someone makes a GFP_KERNEL
allocation during suspend?

Rafael

2010-01-21 20:40:54

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

Hi.

Rafael J. Wysocki wrote:
> On Thursday 21 January 2010, KOSAKI Motohiro wrote:
>> - Ask all drivers how much they require memory before starting suspend and
>> Make enough free memory at first?
>
> That's equivalent to reworking all drivers to allocate memory before suspend
> eg. with the help of PM notifiers. Which IMHO is unrealistic.

What's unrealistic about it? I can see that it would be a lot of work,
but unrealistic? To me, at this stage, it sounds like the ideal solution.

Regards,

Nigel

2010-01-21 21:38:35

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: Memory allocations in .suspend became very unreliable)

On Thursday 21 January 2010, Nigel Cunningham wrote:
> Hi.
>
> Rafael J. Wysocki wrote:
> > On Thursday 21 January 2010, KOSAKI Motohiro wrote:
> >> - Ask all drivers how much they require memory before starting suspend and
> >> Make enough free memory at first?
> >
> > That's equivalent to reworking all drivers to allocate memory before suspend
> > eg. with the help of PM notifiers. Which IMHO is unrealistic.
>
> What's unrealistic about it? I can see that it would be a lot of work,
> but unrealistic? To me, at this stage, it sounds like the ideal solution.

First, we'd need to audit the drivers which is quite a task by itself.
Second, we'd need to make changes, preferably test them or find someone with
suitable hardware to do that for us and propagate them upstream.

I don't really think we have the time to do it.

Rafael

2010-01-22 01:31:37

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

> > Probably we have multiple option. but I don't think GFP_NOIO is good
> > option. It assume the system have lots non-dirty cache memory and it isn't
> > guranteed.
>
> Basically nothing is guaranteed in this case. However, does it actually make
> things _worse_?

Hmm..
Do you mean we don't need to prevent accidental suspend failure?
Perhaps, I did misunderstand your intention. If you think your patch solve
this this issue, I still disagree. but If you think your patch mitigate
the pain of this issue, I agree it. I don't have any reason to oppose your
first patch.

> What _exactly_ does happen without the $subject patch if the
> system doesn't have non-dirty cache memory and someone makes a GFP_KERNEL
> allocation during suspend?

Page allocator prefer to spent lots time for reclaimable memory searching than
returning NULL. IOW, it can spent time few second if it doesn't have
reclaimable memory.
In typical case, OOM killer forcely make enough free memory if the system
don't have any memory. But under suspending time, oom killer is disabled.
So, if the caller (probably drivers) call alloc >1000times, the system
spent lots seconds.

In this case, GFP_NOIO doesn't help. slowness behavior is caused by
freeable memory search, not slow i/o.

However, if strange i/o device makes any i/o slowness, GFP_NOIO might help.
In this case, please don't ask me about i/o thing. I don't know ;)



2010-01-22 01:42:17

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

> > > Probably we have multiple option. but I don't think GFP_NOIO is good
> > > option. It assume the system have lots non-dirty cache memory and it isn't
> > > guranteed.
> >
> > Basically nothing is guaranteed in this case. However, does it actually make
> > things _worse_?
>
> Hmm..
> Do you mean we don't need to prevent accidental suspend failure?
> Perhaps, I did misunderstand your intention. If you think your patch solve
> this this issue, I still disagree. but If you think your patch mitigate
> the pain of this issue, I agree it. I don't have any reason to oppose your
> first patch.

One question. Have anyone tested Rafael's $subject patch?
Please post test result. if the issue disapper by the patch, we can
suppose the slowness is caused by i/o layer.


2010-01-22 10:11:16

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Fri, 2010-01-22 at 10:42 +0900, KOSAKI Motohiro wrote:
> > > > Probably we have multiple option. but I don't think GFP_NOIO is good
> > > > option. It assume the system have lots non-dirty cache memory and it isn't
> > > > guranteed.
> > >
> > > Basically nothing is guaranteed in this case. However, does it actually make
> > > things _worse_?
> >
> > Hmm..
> > Do you mean we don't need to prevent accidental suspend failure?
> > Perhaps, I did misunderstand your intention. If you think your patch solve
> > this this issue, I still disagree. but If you think your patch mitigate
> > the pain of this issue, I agree it. I don't have any reason to oppose your
> > first patch.
>
> One question. Have anyone tested Rafael's $subject patch?
> Please post test result. if the issue disapper by the patch, we can
> suppose the slowness is caused by i/o layer.

I did.

As far as I could see, patch does solve the problem I described.

Does it affect speed of suspend? I can't say for sure. It seems to be
the same.

Best regards,
Maxim Levitsky

2010-01-22 20:58:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Friday 22 January 2010, KOSAKI Motohiro wrote:
> > > Probably we have multiple option. but I don't think GFP_NOIO is good
> > > option. It assume the system have lots non-dirty cache memory and it isn't
> > > guranteed.
> >
> > Basically nothing is guaranteed in this case. However, does it actually make
> > things _worse_?
>
> Hmm..
> Do you mean we don't need to prevent accidental suspend failure?
> Perhaps, I did misunderstand your intention. If you think your patch solve
> this this issue, I still disagree.

No, I don't.

> but If you think your patch mitigate the pain of this issue, I agree it.

That's what I wanted to say really.

> I don't have any reason to oppose your first patch.

Great!

> > What _exactly_ does happen without the $subject patch if the
> > system doesn't have non-dirty cache memory and someone makes a GFP_KERNEL
> > allocation during suspend?
>
> Page allocator prefer to spent lots time for reclaimable memory searching than
> returning NULL. IOW, it can spent time few second if it doesn't have
> reclaimable memory.
> In typical case, OOM killer forcely make enough free memory if the system
> don't have any memory. But under suspending time, oom killer is disabled.
> So, if the caller (probably drivers) call alloc >1000times, the system
> spent lots seconds.
>
> In this case, GFP_NOIO doesn't help. slowness behavior is caused by
> freeable memory search, not slow i/o.
>
> However, if strange i/o device makes any i/o slowness, GFP_NOIO might help.
> In this case, please don't ask me about i/o thing. I don't know ;)

OK, thanks for the explanation.

Rafael

2010-01-22 21:18:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Update][PATCH] MM / PM: Force GFP_NOIO during suspend/hibernation and resume

On Friday 22 January 2010, Maxim Levitsky wrote:
> On Fri, 2010-01-22 at 10:42 +0900, KOSAKI Motohiro wrote:
> > > > > Probably we have multiple option. but I don't think GFP_NOIO is good
> > > > > option. It assume the system have lots non-dirty cache memory and it isn't
> > > > > guranteed.
> > > >
> > > > Basically nothing is guaranteed in this case. However, does it actually make
> > > > things _worse_?
> > >
> > > Hmm..
> > > Do you mean we don't need to prevent accidental suspend failure?
> > > Perhaps, I did misunderstand your intention. If you think your patch solve
> > > this this issue, I still disagree. but If you think your patch mitigate
> > > the pain of this issue, I agree it. I don't have any reason to oppose your
> > > first patch.
> >
> > One question. Have anyone tested Rafael's $subject patch?
> > Please post test result. if the issue disapper by the patch, we can
> > suppose the slowness is caused by i/o layer.
>
> I did.
>
> As far as I could see, patch does solve the problem I described.
>
> Does it affect speed of suspend? I can't say for sure. It seems to be
> the same.

Thanks for testing.

Below is a slightly different version of that patch. Can you please give it a
try?

Rafael

---
From: Rafael J. Wysocki <[email protected]>
Subject: MM / PM: Force GFP_NOIO during suspend/hibernation and resume

There are quite a few GFP_KERNEL memory allocations made during
suspend/hibernation and resume that may cause the system to hang,
because the I/O operations they depend on cannot be completed due to
the underlying devices being suspended.

Mitigate this problem by clearing the __GFP_IO and __GFP_FS bits in
gfp_allowed_mask before suspend/hibernation and restoring the
original values of these bits in gfp_allowed_mask durig the
subsequent resume.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Reported-by: Maxim Levitsky <[email protected]>
---
kernel/power/hibernate.c | 6 ++++++
kernel/power/power.h | 3 +++
kernel/power/suspend.c | 2 ++
mm/Makefile | 1 +
mm/pm.c | 28 ++++++++++++++++++++++++++++
5 files changed, 40 insertions(+)

Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -334,6 +334,7 @@ int hibernation_snapshot(int platform_mo
goto Close;

suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_FREEZE);
if (error)
goto Recover_platform;
@@ -351,6 +352,7 @@ int hibernation_snapshot(int platform_mo

dpm_resume_end(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
+ mm_allow_io_allocations();
resume_console();
Close:
platform_end(platform_mode);
@@ -448,11 +450,13 @@ int hibernation_restore(int platform_mod

pm_prepare_console();
suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_QUIESCE);
if (!error) {
error = resume_target_kernel(platform_mode);
dpm_resume_end(PMSG_RECOVER);
}
+ mm_allow_io_allocations();
resume_console();
pm_restore_console();
return error;
@@ -481,6 +485,7 @@ int hibernation_platform_enter(void)

entering_platform_hibernation = true;
suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_HIBERNATE);
if (error) {
if (hibernation_ops->recover)
@@ -518,6 +523,7 @@ int hibernation_platform_enter(void)
Resume_devices:
entering_platform_hibernation = false;
dpm_resume_end(PMSG_RESTORE);
+ mm_allow_io_allocations();
resume_console();

Close:
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -187,6 +187,9 @@ static inline void suspend_test_finish(c
#ifdef CONFIG_PM_SLEEP
/* kernel/power/main.c */
extern int pm_notifier_call_chain(unsigned long val);
+/* mm/pm.c */
+extern void mm_force_noio_allocations(void);
+extern void mm_allow_io_allocations(void);
#endif

#ifdef CONFIG_HIGHMEM
Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -208,6 +208,7 @@ int suspend_devices_and_enter(suspend_st
goto Close;
}
suspend_console();
+ mm_force_noio_allocations();
suspend_test_start();
error = dpm_suspend_start(PMSG_SUSPEND);
if (error) {
@@ -224,6 +225,7 @@ int suspend_devices_and_enter(suspend_st
suspend_test_start();
dpm_resume_end(PMSG_RESUME);
suspend_test_finish("resume devices");
+ mm_allow_io_allocations();
resume_console();
Close:
if (suspend_ops->end)
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_PM_SLEEP) += pm.o
Index: linux-2.6/mm/pm.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/pm.c
@@ -0,0 +1,28 @@
+#include <linux/gfp.h>
+
+static gfp_t saved_gfp_allowed_mask;
+
+#define GFP_IOFS (__GFP_IO | __GFP_FS)
+
+/**
+ * mm_force_noio_allocations - Modify gfp_allowed_mask to disable IO allocations
+ *
+ * Change gfp_allowed_mask by unsetting __GFP_IO and __GFP_FS in it and save the
+ * old value.
+ */
+void mm_force_noio_allocations(void)
+{
+ saved_gfp_allowed_mask = gfp_allowed_mask;
+ gfp_allowed_mask &= ~GFP_IOFS;
+}
+
+/**
+ * mm_allow_io_allocations - Modify gfp_allowed_mask to allow IO allocations
+ *
+ * If the saved value of gfp_allowed_mask has __GFP_IO set, modify the current
+ * gfp_allowed_mask by setting this bit and anlogously for __GFP_FS.
+ */
+void mm_allow_io_allocations(void)
+{
+ gfp_allowed_mask |= saved_gfp_allowed_mask & GFP_IOFS;
+}

2010-01-23 09:29:28

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [Update][PATCH] MM / PM: Force GFP_NOIO during suspend/hibernation and resume

On Fri, 2010-01-22 at 22:19 +0100, Rafael J. Wysocki wrote:
> On Friday 22 January 2010, Maxim Levitsky wrote:
> > On Fri, 2010-01-22 at 10:42 +0900, KOSAKI Motohiro wrote:
> > > > > > Probably we have multiple option. but I don't think GFP_NOIO is good
> > > > > > option. It assume the system have lots non-dirty cache memory and it isn't
> > > > > > guranteed.
> > > > >
> > > > > Basically nothing is guaranteed in this case. However, does it actually make
> > > > > things _worse_?
> > > >
> > > > Hmm..
> > > > Do you mean we don't need to prevent accidental suspend failure?
> > > > Perhaps, I did misunderstand your intention. If you think your patch solve
> > > > this this issue, I still disagree. but If you think your patch mitigate
> > > > the pain of this issue, I agree it. I don't have any reason to oppose your
> > > > first patch.
> > >
> > > One question. Have anyone tested Rafael's $subject patch?
> > > Please post test result. if the issue disapper by the patch, we can
> > > suppose the slowness is caused by i/o layer.
> >
> > I did.
> >
> > As far as I could see, patch does solve the problem I described.
> >
> > Does it affect speed of suspend? I can't say for sure. It seems to be
> > the same.
>
> Thanks for testing.

I'll test that too, soon.
Just to note that I left my hibernate loop run overnight, and now I am
posting from my notebook after it did 590 hibernate cycles.


Offtopic, but Note that to achieve that I had to stop using global acpi
hardware lock. I tried all kinds of things, but for now it just hands
from time to time.
See http://bugzilla.kernel.org/show_bug.cgi?id=14668

Best regards,
Maxim Levitsky



2010-01-25 21:48:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Update][PATCH] MM / PM: Force GFP_NOIO during suspend/hibernation and resume

On Saturday 23 January 2010, Maxim Levitsky wrote:
> On Fri, 2010-01-22 at 22:19 +0100, Rafael J. Wysocki wrote:
> > On Friday 22 January 2010, Maxim Levitsky wrote:
> > > On Fri, 2010-01-22 at 10:42 +0900, KOSAKI Motohiro wrote:
> > > > > > > Probably we have multiple option. but I don't think GFP_NOIO is good
> > > > > > > option. It assume the system have lots non-dirty cache memory and it isn't
> > > > > > > guranteed.
> > > > > >
> > > > > > Basically nothing is guaranteed in this case. However, does it actually make
> > > > > > things _worse_?
> > > > >
> > > > > Hmm..
> > > > > Do you mean we don't need to prevent accidental suspend failure?
> > > > > Perhaps, I did misunderstand your intention. If you think your patch solve
> > > > > this this issue, I still disagree. but If you think your patch mitigate
> > > > > the pain of this issue, I agree it. I don't have any reason to oppose your
> > > > > first patch.
> > > >
> > > > One question. Have anyone tested Rafael's $subject patch?
> > > > Please post test result. if the issue disapper by the patch, we can
> > > > suppose the slowness is caused by i/o layer.
> > >
> > > I did.
> > >
> > > As far as I could see, patch does solve the problem I described.
> > >
> > > Does it affect speed of suspend? I can't say for sure. It seems to be
> > > the same.
> >
> > Thanks for testing.
>
> I'll test that too, soon.
> Just to note that I left my hibernate loop run overnight, and now I am
> posting from my notebook after it did 590 hibernate cycles.

Did you have a chance to test it?

> Offtopic, but Note that to achieve that I had to stop using global acpi
> hardware lock. I tried all kinds of things, but for now it just hands
> from time to time.
> See http://bugzilla.kernel.org/show_bug.cgi?id=14668

I'm going to look at that later this week, although I'm not sure I can do more
than Alex about that.

Rafael

2010-01-25 21:52:14

by Alexey Starikovskiy

[permalink] [raw]
Subject: Re: [Update][PATCH] MM / PM: Force GFP_NOIO during suspend/hibernation and resume

Rafael J. Wysocki пишет:
> On Saturday 23 January 2010, Maxim Levitsky wrote:
>> On Fri, 2010-01-22 at 22:19 +0100, Rafael J. Wysocki wrote:
>>> On Friday 22 January 2010, Maxim Levitsky wrote:
>>>> On Fri, 2010-01-22 at 10:42 +0900, KOSAKI Motohiro wrote:
>>>>>>>> Probably we have multiple option. but I don't think GFP_NOIO is good
>>>>>>>> option. It assume the system have lots non-dirty cache memory and it isn't
>>>>>>>> guranteed.
>>>>>>> Basically nothing is guaranteed in this case. However, does it actually make
>>>>>>> things _worse_?
>>>>>> Hmm..
>>>>>> Do you mean we don't need to prevent accidental suspend failure?
>>>>>> Perhaps, I did misunderstand your intention. If you think your patch solve
>>>>>> this this issue, I still disagree. but If you think your patch mitigate
>>>>>> the pain of this issue, I agree it. I don't have any reason to oppose your
>>>>>> first patch.
>>>>> One question. Have anyone tested Rafael's $subject patch?
>>>>> Please post test result. if the issue disapper by the patch, we can
>>>>> suppose the slowness is caused by i/o layer.
>>>> I did.
>>>>
>>>> As far as I could see, patch does solve the problem I described.
>>>>
>>>> Does it affect speed of suspend? I can't say for sure. It seems to be
>>>> the same.
>>> Thanks for testing.
>> I'll test that too, soon.
>> Just to note that I left my hibernate loop run overnight, and now I am
>> posting from my notebook after it did 590 hibernate cycles.
>
> Did you have a chance to test it?
>
>> Offtopic, but Note that to achieve that I had to stop using global acpi
>> hardware lock. I tried all kinds of things, but for now it just hands
>> from time to time.
>> See http://bugzilla.kernel.org/show_bug.cgi?id=14668
>
> I'm going to look at that later this week, although I'm not sure I can do more
> than Alex about that.
>
> Rafael
Rafael,
If you can point to where one may insert callback to be called just before handing control to resume kernel,
it may help...

Regards,
Alex.


2010-01-30 15:47:10

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Sun, 2010-01-17 at 01:38 +0100, Rafael J. Wysocki wrote:
> Hi,
>
> I thing the snippet below is a good summary of what this is about.

Any progress on that?

Best regards,
Maxim Levitsky

2010-01-30 18:46:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Saturday 30 January 2010, Maxim Levitsky wrote:
> On Sun, 2010-01-17 at 01:38 +0100, Rafael J. Wysocki wrote:
> > Hi,
> >
> > I thing the snippet below is a good summary of what this is about.
>
> Any progress on that?

Well, I'm waiting for you to report back:
http://patchwork.kernel.org/patch/74740/

The patch is appended once again for convenience.

Rafael

---
From: Rafael J. Wysocki <[email protected]>
Subject: MM / PM: Force GFP_NOIO during suspend/hibernation and resume

There are quite a few GFP_KERNEL memory allocations made during
suspend/hibernation and resume that may cause the system to hang,
because the I/O operations they depend on cannot be completed due to
the underlying devices being suspended.

Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
gfp_allowed_mask before suspend/hibernation and restoring the
original values of these bits in gfp_allowed_mask durig the
subsequent resume.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Reported-by: Maxim Levitsky <[email protected]>
---
kernel/power/hibernate.c | 6 ++++++
kernel/power/power.h | 3 +++
kernel/power/suspend.c | 2 ++
mm/Makefile | 1 +
mm/pm.c | 28 ++++++++++++++++++++++++++++
5 files changed, 40 insertions(+)

Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -334,6 +334,7 @@ int hibernation_snapshot(int platform_mo
goto Close;

suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_FREEZE);
if (error)
goto Recover_platform;
@@ -351,6 +352,7 @@ int hibernation_snapshot(int platform_mo

dpm_resume_end(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
+ mm_allow_io_allocations();
resume_console();
Close:
platform_end(platform_mode);
@@ -448,11 +450,13 @@ int hibernation_restore(int platform_mod

pm_prepare_console();
suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_QUIESCE);
if (!error) {
error = resume_target_kernel(platform_mode);
dpm_resume_end(PMSG_RECOVER);
}
+ mm_allow_io_allocations();
resume_console();
pm_restore_console();
return error;
@@ -481,6 +485,7 @@ int hibernation_platform_enter(void)

entering_platform_hibernation = true;
suspend_console();
+ mm_force_noio_allocations();
error = dpm_suspend_start(PMSG_HIBERNATE);
if (error) {
if (hibernation_ops->recover)
@@ -518,6 +523,7 @@ int hibernation_platform_enter(void)
Resume_devices:
entering_platform_hibernation = false;
dpm_resume_end(PMSG_RESTORE);
+ mm_allow_io_allocations();
resume_console();

Close:
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -187,6 +187,9 @@ static inline void suspend_test_finish(c
#ifdef CONFIG_PM_SLEEP
/* kernel/power/main.c */
extern int pm_notifier_call_chain(unsigned long val);
+/* mm/pm.c */
+extern void mm_force_noio_allocations(void);
+extern void mm_allow_io_allocations(void);
#endif

#ifdef CONFIG_HIGHMEM
Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -208,6 +208,7 @@ int suspend_devices_and_enter(suspend_st
goto Close;
}
suspend_console();
+ mm_force_noio_allocations();
suspend_test_start();
error = dpm_suspend_start(PMSG_SUSPEND);
if (error) {
@@ -224,6 +225,7 @@ int suspend_devices_and_enter(suspend_st
suspend_test_start();
dpm_resume_end(PMSG_RESUME);
suspend_test_finish("resume devices");
+ mm_allow_io_allocations();
resume_console();
Close:
if (suspend_ops->end)
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_PM_SLEEP) += pm.o
Index: linux-2.6/mm/pm.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/pm.c
@@ -0,0 +1,28 @@
+#include <linux/gfp.h>
+
+static gfp_t saved_gfp_allowed_mask;
+
+#define GFP_IOFS (__GFP_IO | __GFP_FS)
+
+/**
+ * mm_force_noio_allocations - Modify gfp_allowed_mask to disable IO allocations
+ *
+ * Change gfp_allowed_mask by unsetting __GFP_IO and __GFP_FS in it and save the
+ * old value.
+ */
+void mm_force_noio_allocations(void)
+{
+ saved_gfp_allowed_mask = gfp_allowed_mask;
+ gfp_allowed_mask &= ~GFP_IOFS;
+}
+
+/**
+ * mm_allow_io_allocations - Modify gfp_allowed_mask to allow IO allocations
+ *
+ * If the saved value of gfp_allowed_mask has __GFP_IO set, modify the current
+ * gfp_allowed_mask by setting this bit and anlogously for __GFP_FS.
+ */
+void mm_allow_io_allocations(void)
+{
+ gfp_allowed_mask |= saved_gfp_allowed_mask & GFP_IOFS;
+}

2010-01-30 18:56:14

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Update][PATCH] MM / PM: Force GFP_NOIO during suspend/hibernation and resume

On Monday 25 January 2010, Alexey Starikovskiy wrote:
> Rafael J. Wysocki пишет:
> > On Saturday 23 January 2010, Maxim Levitsky wrote:
> >> On Fri, 2010-01-22 at 22:19 +0100, Rafael J. Wysocki wrote:
> >>> On Friday 22 January 2010, Maxim Levitsky wrote:
> >>>> On Fri, 2010-01-22 at 10:42 +0900, KOSAKI Motohiro wrote:
> >>>>>>>> Probably we have multiple option. but I don't think GFP_NOIO is good
> >>>>>>>> option. It assume the system have lots non-dirty cache memory and it isn't
> >>>>>>>> guranteed.
> >>>>>>> Basically nothing is guaranteed in this case. However, does it actually make
> >>>>>>> things _worse_?
> >>>>>> Hmm..
> >>>>>> Do you mean we don't need to prevent accidental suspend failure?
> >>>>>> Perhaps, I did misunderstand your intention. If you think your patch solve
> >>>>>> this this issue, I still disagree. but If you think your patch mitigate
> >>>>>> the pain of this issue, I agree it. I don't have any reason to oppose your
> >>>>>> first patch.
> >>>>> One question. Have anyone tested Rafael's $subject patch?
> >>>>> Please post test result. if the issue disapper by the patch, we can
> >>>>> suppose the slowness is caused by i/o layer.
> >>>> I did.
> >>>>
> >>>> As far as I could see, patch does solve the problem I described.
> >>>>
> >>>> Does it affect speed of suspend? I can't say for sure. It seems to be
> >>>> the same.
> >>> Thanks for testing.
> >> I'll test that too, soon.
> >> Just to note that I left my hibernate loop run overnight, and now I am
> >> posting from my notebook after it did 590 hibernate cycles.
> >
> > Did you have a chance to test it?
> >
> >> Offtopic, but Note that to achieve that I had to stop using global acpi
> >> hardware lock. I tried all kinds of things, but for now it just hands
> >> from time to time.
> >> See http://bugzilla.kernel.org/show_bug.cgi?id=14668
> >
> > I'm going to look at that later this week, although I'm not sure I can do more
> > than Alex about that.
> >
> > Rafael
> Rafael,
> If you can point to where one may insert callback to be called just before handing control to resume kernel,
> it may help...

Generally speaking, I'd do that in a .suspend() callback of one of devices.

If that's inconvenient, you can also place it in the .pre_restore() platform
hibernate callback (drivers/acpi/sleep.c). It only disables GPEs right now,
it might release the global lock as well.

The .pre_restore() callback is executed after all devices have been suspended,
so there's no danger any driver would re-acquire the global lock after that.

Rafael

2010-01-30 20:37:52

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Sat, 2010-01-30 at 19:47 +0100, Rafael J. Wysocki wrote:
> On Saturday 30 January 2010, Maxim Levitsky wrote:
> > On Sun, 2010-01-17 at 01:38 +0100, Rafael J. Wysocki wrote:
> > > Hi,
> > >
> > > I thing the snippet below is a good summary of what this is about.
> >
> > Any progress on that?
>
> Well, I'm waiting for you to report back:
> http://patchwork.kernel.org/patch/74740/
>
> The patch is appended once again for convenience.

Ah, sorry!

I used the second version (with the locks) and it works for sure (~500
cycles)

However, as I discovered today, it takes the lock also for GFP_ATOMIC,
and thats why I see several backtraces in the kernel log. Anyway this
isn't important.

I forgot all about this patch, and I am compiling the kernel right away.
Will put the kernel through the hibernate loop tonight.

Best regards,
Maxim Levitsky

2010-01-30 20:42:28

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [Update][PATCH] MM / PM: Force GFP_NOIO during suspend/hibernation and resume

On Sat, 2010-01-30 at 19:56 +0100, Rafael J. Wysocki wrote:
> On Monday 25 January 2010, Alexey Starikovskiy wrote:
> > Rafael J. Wysocki пишет:
> > > On Saturday 23 January 2010, Maxim Levitsky wrote:
> > >> On Fri, 2010-01-22 at 22:19 +0100, Rafael J. Wysocki wrote:
> > >>> On Friday 22 January 2010, Maxim Levitsky wrote:
> > >>>> On Fri, 2010-01-22 at 10:42 +0900, KOSAKI Motohiro wrote:
> > >>>>>>>> Probably we have multiple option. but I don't think GFP_NOIO is good
> > >>>>>>>> option. It assume the system have lots non-dirty cache memory and it isn't
> > >>>>>>>> guranteed.
> > >>>>>>> Basically nothing is guaranteed in this case. However, does it actually make
> > >>>>>>> things _worse_?
> > >>>>>> Hmm..
> > >>>>>> Do you mean we don't need to prevent accidental suspend failure?
> > >>>>>> Perhaps, I did misunderstand your intention. If you think your patch solve
> > >>>>>> this this issue, I still disagree. but If you think your patch mitigate
> > >>>>>> the pain of this issue, I agree it. I don't have any reason to oppose your
> > >>>>>> first patch.
> > >>>>> One question. Have anyone tested Rafael's $subject patch?
> > >>>>> Please post test result. if the issue disapper by the patch, we can
> > >>>>> suppose the slowness is caused by i/o layer.
> > >>>> I did.
> > >>>>
> > >>>> As far as I could see, patch does solve the problem I described.
> > >>>>
> > >>>> Does it affect speed of suspend? I can't say for sure. It seems to be
> > >>>> the same.
> > >>> Thanks for testing.
> > >> I'll test that too, soon.
> > >> Just to note that I left my hibernate loop run overnight, and now I am
> > >> posting from my notebook after it did 590 hibernate cycles.
> > >
> > > Did you have a chance to test it?
> > >
> > >> Offtopic, but Note that to achieve that I had to stop using global acpi
> > >> hardware lock. I tried all kinds of things, but for now it just hands
> > >> from time to time.
> > >> See http://bugzilla.kernel.org/show_bug.cgi?id=14668
> > >
> > > I'm going to look at that later this week, although I'm not sure I can do more
> > > than Alex about that.
> > >
> > > Rafael
> > Rafael,
> > If you can point to where one may insert callback to be called just before handing control to resume kernel,
> > it may help...
>
> Generally speaking, I'd do that in a .suspend() callback of one of devices.
>
> If that's inconvenient, you can also place it in the .pre_restore() platform
> hibernate callback (drivers/acpi/sleep.c). It only disables GPEs right now,
> it might release the global lock as well.
>
> The .pre_restore() callback is executed after all devices have been suspended,
> so there's no danger any driver would re-acquire the global lock after that.


Well, I did that very late, very close to image restore.
Still, it didn't work (It hung after the resume, in the kernel that was
just restored, on access to the hardware lock, or in other words in same
way)

Here is what I did:

commit 71e0be39531ac01b99020ea139ef3c23aa6de415
Author: Maxim Levitsky <[email protected]>
Date: Wed Jan 20 21:52:21 2010 +0200

Kernel that does the resume from disk can take global hardware
acpi lock, but not release it (for example if it is in middle of access to locked field)
Always release it before passing control to resumed kernel

diff --git a/drivers/acpi/acpica/acglobal.h b/drivers/acpi/acpica/acglobal.h
index 29ba66d..29e1be0 100644
--- a/drivers/acpi/acpica/acglobal.h
+++ b/drivers/acpi/acpica/acglobal.h
@@ -195,6 +195,7 @@ ACPI_EXTERN acpi_semaphore acpi_gbl_global_lock_semaphore;
ACPI_EXTERN u16 acpi_gbl_global_lock_handle;
ACPI_EXTERN u8 acpi_gbl_global_lock_acquired;
ACPI_EXTERN u8 acpi_gbl_global_lock_present;
+ACPI_EXTERN u8 acpi_gbl_global_lock_suspended;

/*
* Spinlocks are used for interfaces that can be possibly called at
diff --git a/drivers/acpi/acpica/evmisc.c b/drivers/acpi/acpica/evmisc.c
index ce224e1..f3569db 100644
--- a/drivers/acpi/acpica/evmisc.c
+++ b/drivers/acpi/acpica/evmisc.c
@@ -456,7 +456,7 @@ acpi_status acpi_ev_acquire_global_lock(u16 timeout)
* Make sure that a global lock actually exists. If not, just treat the
* lock as a standard mutex.
*/
- if (!acpi_gbl_global_lock_present) {
+ if (!acpi_gbl_global_lock_present || acpi_gbl_global_lock_suspended) {
acpi_gbl_global_lock_acquired = TRUE;
return_ACPI_STATUS(AE_OK);
}
diff --git a/drivers/acpi/acpica/evxface.c b/drivers/acpi/acpica/evxface.c
index 2fe0809..b27904e 100644
--- a/drivers/acpi/acpica/evxface.c
+++ b/drivers/acpi/acpica/evxface.c
@@ -820,3 +820,29 @@ acpi_status acpi_release_global_lock(u32 handle)
}

ACPI_EXPORT_SYMBOL(acpi_release_global_lock)
+
+
+/*******************************************************************************
+ *
+ * FUNCTION: acpi_bust_global_lock
+ *
+ * PARAMETERS: None
+ *
+ * RETURN: Status
+ *
+ * DESCRIPTION: Release global lock and ignore it from now on
+ *
+ ******************************************************************************/
+
+acpi_status acpi_bust_global_lock(void)
+{
+ acpi_status status;
+
+ status = acpi_ex_acquire_mutex_object(ACPI_WAIT_FOREVER,
+ acpi_gbl_global_lock_mutex,
+ acpi_os_get_thread_id());
+
+ acpi_gbl_global_lock_suspended = TRUE;
+
+ status = acpi_ex_release_mutex_object(acpi_gbl_global_lock_mutex);
+}
diff --git a/drivers/acpi/acpica/utglobal.c b/drivers/acpi/acpica/utglobal.c
index 3f2c68f..e4cafea 100644
--- a/drivers/acpi/acpica/utglobal.c
+++ b/drivers/acpi/acpica/utglobal.c
@@ -782,6 +782,7 @@ acpi_status acpi_ut_init_globals(void)
acpi_gbl_global_lock_acquired = FALSE;
acpi_gbl_global_lock_handle = 0;
acpi_gbl_global_lock_present = FALSE;
+ acpi_gbl_global_lock_suspended = FALSE;

/* Miscellaneous variables */

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 79d33d9..6e41954 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -119,6 +119,23 @@ static int acpi_pm_disable_gpes(void)
}

/**
+ * acpi_pm_pre_restore - called just before the restore
+ */
+
+static int acpi_pm_pre_restore(void)
+{
+ u32 glk;
+
+ acpi_pm_disable_gpes();
+
+ /* We can't let the resuming kernel see the global
+ lock locked, so bust it */
+
+ acpi_bust_global_lock();
+ return 0;
+}
+
+/**
* __acpi_pm_prepare - Prepare the platform to enter the target state.
*
* If necessary, set the firmware waking vector and do arch-specific
@@ -565,7 +582,7 @@ static struct platform_hibernation_ops acpi_hibernation_ops = {
.prepare = acpi_pm_prepare,
.enter = acpi_hibernation_enter,
.leave = acpi_hibernation_leave,
- .pre_restore = acpi_pm_disable_gpes,
+ .pre_restore = acpi_pm_pre_restore,
.restore_cleanup = acpi_pm_enable_gpes,
};

diff --git a/include/acpi/acpixf.h b/include/acpi/acpixf.h
index 86e9735..63132b0 100644
--- a/include/acpi/acpixf.h
+++ b/include/acpi/acpixf.h
@@ -270,6 +270,8 @@ acpi_status acpi_acquire_global_lock(u16 timeout, u32 * handle);

acpi_status acpi_release_global_lock(u32 handle);

+acpi_status acpi_bust_global_lock(void);
+
acpi_status acpi_enable_event(u32 event, u32 flags);

acpi_status acpi_disable_event(u32 event, u32 flags);

2010-01-30 20:53:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Update][PATCH] MM / PM: Force GFP_NOIO during suspend/hibernation and resume

On Saturday 30 January 2010, Maxim Levitsky wrote:
> On Sat, 2010-01-30 at 19:56 +0100, Rafael J. Wysocki wrote:
> > On Monday 25 January 2010, Alexey Starikovskiy wrote:
> > > Rafael J. Wysocki пишет:
> > > > On Saturday 23 January 2010, Maxim Levitsky wrote:
> > > >> On Fri, 2010-01-22 at 22:19 +0100, Rafael J. Wysocki wrote:
> > > >>> On Friday 22 January 2010, Maxim Levitsky wrote:
> > > >>>> On Fri, 2010-01-22 at 10:42 +0900, KOSAKI Motohiro wrote:
> > > >>>>>>>> Probably we have multiple option. but I don't think GFP_NOIO is good
> > > >>>>>>>> option. It assume the system have lots non-dirty cache memory and it isn't
> > > >>>>>>>> guranteed.
> > > >>>>>>> Basically nothing is guaranteed in this case. However, does it actually make
> > > >>>>>>> things _worse_?
> > > >>>>>> Hmm..
> > > >>>>>> Do you mean we don't need to prevent accidental suspend failure?
> > > >>>>>> Perhaps, I did misunderstand your intention. If you think your patch solve
> > > >>>>>> this this issue, I still disagree. but If you think your patch mitigate
> > > >>>>>> the pain of this issue, I agree it. I don't have any reason to oppose your
> > > >>>>>> first patch.
> > > >>>>> One question. Have anyone tested Rafael's $subject patch?
> > > >>>>> Please post test result. if the issue disapper by the patch, we can
> > > >>>>> suppose the slowness is caused by i/o layer.
> > > >>>> I did.
> > > >>>>
> > > >>>> As far as I could see, patch does solve the problem I described.
> > > >>>>
> > > >>>> Does it affect speed of suspend? I can't say for sure. It seems to be
> > > >>>> the same.
> > > >>> Thanks for testing.
> > > >> I'll test that too, soon.
> > > >> Just to note that I left my hibernate loop run overnight, and now I am
> > > >> posting from my notebook after it did 590 hibernate cycles.
> > > >
> > > > Did you have a chance to test it?
> > > >
> > > >> Offtopic, but Note that to achieve that I had to stop using global acpi
> > > >> hardware lock. I tried all kinds of things, but for now it just hands
> > > >> from time to time.
> > > >> See http://bugzilla.kernel.org/show_bug.cgi?id=14668
> > > >
> > > > I'm going to look at that later this week, although I'm not sure I can do more
> > > > than Alex about that.
> > > >
> > > > Rafael
> > > Rafael,
> > > If you can point to where one may insert callback to be called just before handing control to resume kernel,
> > > it may help...
> >
> > Generally speaking, I'd do that in a .suspend() callback of one of devices.
> >
> > If that's inconvenient, you can also place it in the .pre_restore() platform
> > hibernate callback (drivers/acpi/sleep.c). It only disables GPEs right now,
> > it might release the global lock as well.
> >
> > The .pre_restore() callback is executed after all devices have been suspended,
> > so there's no danger any driver would re-acquire the global lock after that.
>
>
> Well, I did that very late, very close to image restore.
> Still, it didn't work (It hung after the resume, in the kernel that was
> just restored, on access to the hardware lock, or in other words in same
> way)
>
> Here is what I did:

I saw the patch in the bug entry
(http://bugzilla.kernel.org/show_bug.cgi?id=14668).
Please see the comments in there.

Please also test the patch I attached and let's use the bug entry for the
tracking of this issue from now on.

Rafael

2010-02-01 19:52:05

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Force GFP_NOIO during suspend/resume (was: Re: [linux-pm] Memory allocations in .suspend became very unreliable)

On Sat, 2010-01-30 at 22:37 +0200, Maxim Levitsky wrote:
> On Sat, 2010-01-30 at 19:47 +0100, Rafael J. Wysocki wrote:
> > On Saturday 30 January 2010, Maxim Levitsky wrote:
> > > On Sun, 2010-01-17 at 01:38 +0100, Rafael J. Wysocki wrote:
> > > > Hi,
> > > >
> > > > I thing the snippet below is a good summary of what this is about.
> > >
> > > Any progress on that?
> >
> > Well, I'm waiting for you to report back:
> > http://patchwork.kernel.org/patch/74740/
> >
> > The patch is appended once again for convenience.
>
> Ah, sorry!
>
> I used the second version (with the locks) and it works for sure (~500
> cycles)
>
> However, as I discovered today, it takes the lock also for GFP_ATOMIC,
> and thats why I see several backtraces in the kernel log. Anyway this
> isn't important.
>
> I forgot all about this patch, and I am compiling the kernel right away.
> Will put the kernel through the hibernate loop tonight.

I did 123 hibernate cycles on my notebook. Everything is fine.
This patch very very likely is working.

Best regards,
Maxim Levitsky