Hi,
When memory pressure is high, readahead could cause oom killing.
IMHO we should stop readaheading under such circumstances。If it's true
how to fix it?
--
Regards
dave
On Tue, Apr 26, 2011 at 01:49:25PM +0800, Dave Young wrote:
> Hi,
>
> When memory pressure is high, readahead could cause oom killing.
> IMHO we should stop readaheading under such circumstances。If it's true
> how to fix it?
Good question. Before OOM there will be readahead thrashings, which
can be addressed by this patch:
http://lkml.org/lkml/2010/2/2/229
However there seems no much interest on that feature.. I can separate
that out and resubmit it standalone if necessary.
Thanks,
Fengguang
On Tue, Apr 26, 2011 at 1:55 PM, Wu Fengguang <[email protected]> wrote:
> On Tue, Apr 26, 2011 at 01:49:25PM +0800, Dave Young wrote:
>> Hi,
>>
>> When memory pressure is high, readahead could cause oom killing.
>> IMHO we should stop readaheading under such circumstances。If it's true
>> how to fix it?
>
> Good question. Before OOM there will be readahead thrashings, which
> can be addressed by this patch:
>
> http://lkml.org/lkml/2010/2/2/229
Hi, I'm not clear about the patch, could be regard as below cases?
1) readahead alloc fail due to low memory such as other large allocation
2) readahead thrashing caused by itself
>
> However there seems no much interest on that feature.. I can separate
> that out and resubmit it standalone if necessary.
>
> Thanks,
> Fengguang
>
--
Regards
dave
On Tue, Apr 26, 2011 at 2:05 PM, Dave Young <[email protected]> wrote:
> On Tue, Apr 26, 2011 at 1:55 PM, Wu Fengguang <[email protected]> wrote:
>> On Tue, Apr 26, 2011 at 01:49:25PM +0800, Dave Young wrote:
>>> Hi,
>>>
>>> When memory pressure is high, readahead could cause oom killing.
>>> IMHO we should stop readaheading under such circumstances。If it's true
>>> how to fix it?
>>
>> Good question. Before OOM there will be readahead thrashings, which
>> can be addressed by this patch:
>>
>> http://lkml.org/lkml/2010/2/2/229
>
> Hi, I'm not clear about the patch, could be regard as below cases?
> 1) readahead alloc fail due to low memory such as other large allocation
For example vm balloon allocate lots of memory, then readahead could
fail immediately and then oom
> 2) readahead thrashing caused by itself
>
>>
>> However there seems no much interest on that feature.. I can separate
>> that out and resubmit it standalone if necessary.
>>
>> Thanks,
>> Fengguang
>>
>
>
>
> --
> Regards
> dave
>
--
Regards
dave
On Tue, Apr 26, 2011 at 02:05:12PM +0800, Dave Young wrote:
> On Tue, Apr 26, 2011 at 1:55 PM, Wu Fengguang <[email protected]> wrote:
> > On Tue, Apr 26, 2011 at 01:49:25PM +0800, Dave Young wrote:
> >> Hi,
> >>
> >> When memory pressure is high, readahead could cause oom killing.
> >> IMHO we should stop readaheading under such circumstances。If it's true
> >> how to fix it?
> >
> > Good question. Before OOM there will be readahead thrashings, which
> > can be addressed by this patch:
> >
> > http://lkml.org/lkml/2010/2/2/229
>
> Hi, I'm not clear about the patch, could be regard as below cases?
> 1) readahead alloc fail due to low memory such as other large allocation
> 2) readahead thrashing caused by itself
When memory pressure goes up (not as much as allocation failures and OOM),
the readahead pages may be reclaimed before they are read() accessed
by the user space. At the time read() asks for the page, it will have
to be read from disk _again_. This is called readahead thrashing.
What the patch does is to automatically detect readahead thrashing and
shrink the readahead size adaptively, which will the reduce memory
consumption by readahead buffers.
Thanks,
Fengguang
> >
> > However there seems no much interest on that feature.. I can separate
> > that out and resubmit it standalone if necessary.
> >
> > Thanks,
> > Fengguang
> >
>
>
>
> --
> Regards
> dave
On Tue, Apr 26, 2011 at 2:13 PM, Wu Fengguang <[email protected]> wrote:
> On Tue, Apr 26, 2011 at 02:05:12PM +0800, Dave Young wrote:
>> On Tue, Apr 26, 2011 at 1:55 PM, Wu Fengguang <[email protected]> wrote:
>> > On Tue, Apr 26, 2011 at 01:49:25PM +0800, Dave Young wrote:
>> >> Hi,
>> >>
>> >> When memory pressure is high, readahead could cause oom killing.
>> >> IMHO we should stop readaheading under such circumstances。If it's true
>> >> how to fix it?
>> >
>> > Good question. Before OOM there will be readahead thrashings, which
>> > can be addressed by this patch:
>> >
>> > http://lkml.org/lkml/2010/2/2/229
>>
>> Hi, I'm not clear about the patch, could be regard as below cases?
>> 1) readahead alloc fail due to low memory such as other large allocation
>> 2) readahead thrashing caused by itself
>
> When memory pressure goes up (not as much as allocation failures and OOM),
> the readahead pages may be reclaimed before they are read() accessed
> by the user space. At the time read() asks for the page, it will have
> to be read from disk _again_. This is called readahead thrashing.
>
> What the patch does is to automatically detect readahead thrashing and
> shrink the readahead size adaptively, which will the reduce memory
> consumption by readahead buffers.
Thanks for the explanation.
But still there's the question, if the allocation storm occurs when
system startup, the allocation is so quick that the detection of
thrashing is too late to avoid readahead. Is this possible?
>
> Thanks,
> Fengguang
>
>> >
>> > However there seems no much interest on that feature.. I can separate
>> > that out and resubmit it standalone if necessary.
Would like to test your new patch
>> >
>> > Thanks,
>> > Fengguang
>> >
>>
>>
>>
>> --
>> Regards
>> dave
>
--
Regards
dave
On Tue, Apr 26, 2011 at 02:07:17PM +0800, Dave Young wrote:
> On Tue, Apr 26, 2011 at 2:05 PM, Dave Young <[email protected]> wrote:
> > On Tue, Apr 26, 2011 at 1:55 PM, Wu Fengguang <[email protected]> wrote:
> >> On Tue, Apr 26, 2011 at 01:49:25PM +0800, Dave Young wrote:
> >>> Hi,
> >>>
> >>> When memory pressure is high, readahead could cause oom killing.
> >>> IMHO we should stop readaheading under such circumstances。If it's true
> >>> how to fix it?
> >>
> >> Good question. Before OOM there will be readahead thrashings, which
> >> can be addressed by this patch:
> >>
> >> http://lkml.org/lkml/2010/2/2/229
> >
> > Hi, I'm not clear about the patch, could be regard as below cases?
> > 1) readahead alloc fail due to low memory such as other large allocation
>
> For example vm balloon allocate lots of memory, then readahead could
> fail immediately and then oom
If true, that would be the problem of vm balloon. It's not good to
consume lots of memory all of a sudden, which will likely impact lots
of kernel subsystems.
btw readahead page allocations are completely optional. They are OK to
fail and in theory shall not trigger OOM on themselves. We may
consider passing __GFP_NORETRY for readahead page allocations.
Thanks,
Fengguang
On Tue, Apr 26, 2011 at 2:25 PM, Wu Fengguang <[email protected]> wrote:
> On Tue, Apr 26, 2011 at 02:07:17PM +0800, Dave Young wrote:
>> On Tue, Apr 26, 2011 at 2:05 PM, Dave Young <[email protected]> wrote:
>> > On Tue, Apr 26, 2011 at 1:55 PM, Wu Fengguang <[email protected]> wrote:
>> >> On Tue, Apr 26, 2011 at 01:49:25PM +0800, Dave Young wrote:
>> >>> Hi,
>> >>>
>> >>> When memory pressure is high, readahead could cause oom killing.
>> >>> IMHO we should stop readaheading under such circumstances。If it's true
>> >>> how to fix it?
>> >>
>> >> Good question. Before OOM there will be readahead thrashings, which
>> >> can be addressed by this patch:
>> >>
>> >> http://lkml.org/lkml/2010/2/2/229
>> >
>> > Hi, I'm not clear about the patch, could be regard as below cases?
>> > 1) readahead alloc fail due to low memory such as other large allocation
>>
>> For example vm balloon allocate lots of memory, then readahead could
>> fail immediately and then oom
>
> If true, that would be the problem of vm balloon. It's not good to
> consume lots of memory all of a sudden, which will likely impact lots
> of kernel subsystems.
>
> btw readahead page allocations are completely optional. They are OK to
> fail and in theory shall not trigger OOM on themselves. We may
> consider passing __GFP_NORETRY for readahead page allocations.
Good idea, care to submit a patch?
--
Regards
dave
On Tue, Apr 26, 2011 at 02:29:15PM +0800, Dave Young wrote:
> On Tue, Apr 26, 2011 at 2:25 PM, Wu Fengguang <[email protected]> wrote:
> > On Tue, Apr 26, 2011 at 02:07:17PM +0800, Dave Young wrote:
> >> On Tue, Apr 26, 2011 at 2:05 PM, Dave Young <[email protected]> wrote:
> >> > On Tue, Apr 26, 2011 at 1:55 PM, Wu Fengguang <[email protected]> wrote:
> >> >> On Tue, Apr 26, 2011 at 01:49:25PM +0800, Dave Young wrote:
> >> >>> Hi,
> >> >>>
> >> >>> When memory pressure is high, readahead could cause oom killing.
> >> >>> IMHO we should stop readaheading under such circumstances。If it's true
> >> >>> how to fix it?
> >> >>
> >> >> Good question. Before OOM there will be readahead thrashings, which
> >> >> can be addressed by this patch:
> >> >>
> >> >> http://lkml.org/lkml/2010/2/2/229
> >> >
> >> > Hi, I'm not clear about the patch, could be regard as below cases?
> >> > 1) readahead alloc fail due to low memory such as other large allocation
> >>
> >> For example vm balloon allocate lots of memory, then readahead could
> >> fail immediately and then oom
> >
> > If true, that would be the problem of vm balloon. It's not good to
> > consume lots of memory all of a sudden, which will likely impact lots
> > of kernel subsystems.
> >
> > btw readahead page allocations are completely optional. They are OK to
> > fail and in theory shall not trigger OOM on themselves. We may
> > consider passing __GFP_NORETRY for readahead page allocations.
>
> Good idea, care to submit a patch?
Here it is :)
Thanks,
Fengguang
---
readahead: readahead page allocations is OK to fail
Pass __GFP_NORETRY for readahead page allocations.
readahead page allocations are completely optional. They are OK to
fail and in particular shall not trigger OOM on themselves.
Reported-by: Dave Young <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/pagemap.h | 5 +++++
mm/readahead.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)
--- linux-next.orig/include/linux/pagemap.h 2011-04-26 14:27:46.000000000 +0800
+++ linux-next/include/linux/pagemap.h 2011-04-26 14:29:31.000000000 +0800
@@ -219,6 +219,11 @@ static inline struct page *page_cache_al
return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
}
+static inline struct page *page_cache_alloc_cold_noretry(struct address_space *x)
+{
+ return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD|__GFP_NORETRY);
+}
+
typedef int filler_t(void *, struct page *);
extern struct page * find_get_page(struct address_space *mapping,
--- linux-next.orig/mm/readahead.c 2011-04-26 14:27:02.000000000 +0800
+++ linux-next/mm/readahead.c 2011-04-26 14:27:24.000000000 +0800
@@ -180,7 +180,7 @@ __do_page_cache_readahead(struct address
if (page)
continue;
- page = page_cache_alloc_cold(mapping);
+ page = page_cache_alloc_cold_noretry(mapping);
if (!page)
break;
page->index = page_offset;
> > > btw readahead page allocations are completely optional. They are OK to
> > > fail and in theory shall not trigger OOM on themselves. We may
> > > consider passing __GFP_NORETRY for readahead page allocations.
> >
> > Good idea, care to submit a patch?
>
> Here it is :)
>
> Thanks,
> Fengguang
> ---
> readahead: readahead page allocations is OK to fail
>
> Pass __GFP_NORETRY for readahead page allocations.
>
> readahead page allocations are completely optional. They are OK to
> fail and in particular shall not trigger OOM on themselves.
>
> Reported-by: Dave Young <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> include/linux/pagemap.h | 5 +++++
> mm/readahead.c | 2 +-
> 2 files changed, 6 insertions(+), 1 deletion(-)
>
> --- linux-next.orig/include/linux/pagemap.h 2011-04-26 14:27:46.000000000 +0800
> +++ linux-next/include/linux/pagemap.h 2011-04-26 14:29:31.000000000 +0800
> @@ -219,6 +219,11 @@ static inline struct page *page_cache_al
> return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
> }
>
> +static inline struct page *page_cache_alloc_cold_noretry(struct address_space *x)
> +{
> + return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD|__GFP_NORETRY);
> +}
> +
> typedef int filler_t(void *, struct page *);
>
> extern struct page * find_get_page(struct address_space *mapping,
> --- linux-next.orig/mm/readahead.c 2011-04-26 14:27:02.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-04-26 14:27:24.000000000 +0800
> @@ -180,7 +180,7 @@ __do_page_cache_readahead(struct address
> if (page)
> continue;
>
> - page = page_cache_alloc_cold(mapping);
> + page = page_cache_alloc_cold_noretry(mapping);
> if (!page)
> break;
> page->index = page_offset;
I like this patch.
Reviewed-by: KOSAKI Motohiro <[email protected]>
Hi Wu,
On Tue, Apr 26, 2011 at 3:34 PM, Wu Fengguang <[email protected]> wrote:
> On Tue, Apr 26, 2011 at 02:29:15PM +0800, Dave Young wrote:
>> On Tue, Apr 26, 2011 at 2:25 PM, Wu Fengguang <[email protected]> wrote:
>> > On Tue, Apr 26, 2011 at 02:07:17PM +0800, Dave Young wrote:
>> >> On Tue, Apr 26, 2011 at 2:05 PM, Dave Young <[email protected]> wrote:
>> >> > On Tue, Apr 26, 2011 at 1:55 PM, Wu Fengguang <[email protected]> wrote:
>> >> >> On Tue, Apr 26, 2011 at 01:49:25PM +0800, Dave Young wrote:
>> >> >>> Hi,
>> >> >>>
>> >> >>> When memory pressure is high, readahead could cause oom killing.
>> >> >>> IMHO we should stop readaheading under such circumstances。If it's true
>> >> >>> how to fix it?
>> >> >>
>> >> >> Good question. Before OOM there will be readahead thrashings, which
>> >> >> can be addressed by this patch:
>> >> >>
>> >> >> http://lkml.org/lkml/2010/2/2/229
>> >> >
>> >> > Hi, I'm not clear about the patch, could be regard as below cases?
>> >> > 1) readahead alloc fail due to low memory such as other large allocation
>> >>
>> >> For example vm balloon allocate lots of memory, then readahead could
>> >> fail immediately and then oom
>> >
>> > If true, that would be the problem of vm balloon. It's not good to
>> > consume lots of memory all of a sudden, which will likely impact lots
>> > of kernel subsystems.
>> >
>> > btw readahead page allocations are completely optional. They are OK to
>> > fail and in theory shall not trigger OOM on themselves. We may
>> > consider passing __GFP_NORETRY for readahead page allocations.
>>
>> Good idea, care to submit a patch?
>
> Here it is :)
>
> Thanks,
> Fengguang
> ---
> readahead: readahead page allocations is OK to fail
>
> Pass __GFP_NORETRY for readahead page allocations.
>
> readahead page allocations are completely optional. They are OK to
> fail and in particular shall not trigger OOM on themselves.
>
> Reported-by: Dave Young <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> include/linux/pagemap.h | 5 +++++
> mm/readahead.c | 2 +-
> 2 files changed, 6 insertions(+), 1 deletion(-)
>
> --- linux-next.orig/include/linux/pagemap.h 2011-04-26 14:27:46.000000000 +0800
> +++ linux-next/include/linux/pagemap.h 2011-04-26 14:29:31.000000000 +0800
> @@ -219,6 +219,11 @@ static inline struct page *page_cache_al
> return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
> }
>
> +static inline struct page *page_cache_alloc_cold_noretry(struct address_space *x)
> +{
> + return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD|__GFP_NORETRY);
It makes sense to me but it could make a noise about page allocation
failure. I think it's not desirable.
How about adding __GFP_NOWARAN?
--
Kind regards,
Minchan Kim
Minchan,
> > +static inline struct page *page_cache_alloc_cold_noretry(struct address_space *x)
> > +{
> > + return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD|__GFP_NORETRY);
>
> It makes sense to me but it could make a noise about page allocation
> failure. I think it's not desirable.
> How about adding __GFP_NOWARAN?
Yeah it makes sense. Here is the new version.
Thanks,
Fengguang
---
Subject: readahead: readahead page allocations is OK to fail
Date: Tue Apr 26 14:29:40 CST 2011
Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.
readahead page allocations are completely optional. They are OK to
fail and in particular shall not trigger OOM on themselves.
Reported-by: Dave Young <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/pagemap.h | 6 ++++++
mm/readahead.c | 2 +-
2 files changed, 7 insertions(+), 1 deletion(-)
--- linux-next.orig/include/linux/pagemap.h 2011-04-26 14:27:46.000000000 +0800
+++ linux-next/include/linux/pagemap.h 2011-04-26 17:17:13.000000000 +0800
@@ -219,6 +219,12 @@ static inline struct page *page_cache_al
return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
}
+static inline struct page *page_cache_alloc_readahead(struct address_space *x)
+{
+ return __page_cache_alloc(mapping_gfp_mask(x) |
+ __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
+}
+
typedef int filler_t(void *, struct page *);
extern struct page * find_get_page(struct address_space *mapping,
--- linux-next.orig/mm/readahead.c 2011-04-26 14:27:02.000000000 +0800
+++ linux-next/mm/readahead.c 2011-04-26 17:17:25.000000000 +0800
@@ -180,7 +180,7 @@ __do_page_cache_readahead(struct address
if (page)
continue;
- page = page_cache_alloc_cold(mapping);
+ page = page_cache_alloc_readahead(mapping);
if (!page)
break;
page->index = page_offset;
On Tue, Apr 26, 2011 at 6:20 PM, Wu Fengguang <[email protected]> wrote:
> Minchan,
>
>> > +static inline struct page *page_cache_alloc_cold_noretry(struct address_space *x)
>> > +{
>> > + return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD|__GFP_NORETRY);
>>
>> It makes sense to me but it could make a noise about page allocation
>> failure. I think it's not desirable.
>> How about adding __GFP_NOWARAN?
>
> Yeah it makes sense. Here is the new version.
>
> Thanks,
> Fengguang
> ---
> Subject: readahead: readahead page allocations is OK to fail
> Date: Tue Apr 26 14:29:40 CST 2011
>
> Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.
>
> readahead page allocations are completely optional. They are OK to
> fail and in particular shall not trigger OOM on themselves.
>
> Reported-by: Dave Young <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
--
Kind regards,
Minchan Kim
Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.
readahead page allocations are completely optional. They are OK to
fail and in particular shall not trigger OOM on themselves.
Reported-by: Dave Young <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/pagemap.h | 6 ++++++
mm/readahead.c | 2 +-
2 files changed, 7 insertions(+), 1 deletion(-)
--- linux-next.orig/include/linux/pagemap.h 2011-04-26 14:27:46.000000000 +0800
+++ linux-next/include/linux/pagemap.h 2011-04-26 17:17:13.000000000 +0800
@@ -219,6 +219,12 @@ static inline struct page *page_cache_al
return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
}
+static inline struct page *page_cache_alloc_readahead(struct address_space *x)
+{
+ return __page_cache_alloc(mapping_gfp_mask(x) |
+ __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
+}
+
typedef int filler_t(void *, struct page *);
extern struct page * find_get_page(struct address_space *mapping,
--- linux-next.orig/mm/readahead.c 2011-04-26 14:27:02.000000000 +0800
+++ linux-next/mm/readahead.c 2011-04-26 17:17:25.000000000 +0800
@@ -180,7 +180,7 @@ __do_page_cache_readahead(struct address
if (page)
continue;
- page = page_cache_alloc_cold(mapping);
+ page = page_cache_alloc_readahead(mapping);
if (!page)
break;
page->index = page_offset;
On Tue, Apr 26, 2011 at 12:28 PM, Minchan Kim <[email protected]> wrote:
> On Tue, Apr 26, 2011 at 6:20 PM, Wu Fengguang <[email protected]> wrote:
>> Minchan,
>>
>>> > +static inline struct page *page_cache_alloc_cold_noretry(struct address_space *x)
>>> > +{
>>> > + ? ? ? return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD|__GFP_NORETRY);
>>>
>>> It makes sense to me but it could make a noise about page allocation
>>> failure. I think it's not desirable.
>>> How about adding __GFP_NOWARAN?
>>
>> Yeah it makes sense. Here is the new version.
>>
>> Thanks,
>> Fengguang
>> ---
>> Subject: readahead: readahead page allocations is OK to fail
>> Date: Tue Apr 26 14:29:40 CST 2011
>>
>> Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.
>>
>> readahead page allocations are completely optional. They are OK to
>> fail and in particular shall not trigger OOM on themselves.
>>
>> Reported-by: Dave Young <[email protected]>
>> Reviewed-by: KOSAKI Motohiro <[email protected]>
>> Signed-off-by: Wu Fengguang <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
On Tue, 26 Apr 2011 17:20:29 +0800
Wu Fengguang <[email protected]> wrote:
> Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.
>
> readahead page allocations are completely optional. They are OK to
> fail and in particular shall not trigger OOM on themselves.
I have distinct recollections of trying this many years ago, finding
that it caused problems then deciding not to do it. But I can't find
an email trail and I don't remember the reasons :(
If the system is so stressed for memory that the oom-killer might get
involved then the readahead pages may well be getting reclaimed before
the application actually gets to use them. But that's just an aside.
Ho hum. The patch *seems* good (as it did 5-10 years ago ;)) but there
may be surprising side-effects which could be exposed under heavy
testing. Testing which I'm sure hasn't been performed...
On Wed, Apr 27, 2011 at 03:47:43AM +0800, Andrew Morton wrote:
> On Tue, 26 Apr 2011 17:20:29 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.
> >
> > readahead page allocations are completely optional. They are OK to
> > fail and in particular shall not trigger OOM on themselves.
>
> I have distinct recollections of trying this many years ago, finding
> that it caused problems then deciding not to do it. But I can't find
> an email trail and I don't remember the reasons :(
The most possible reason can be page allocation failures even if there
are plenty of _global_ reclaimable pages.
> If the system is so stressed for memory that the oom-killer might get
> involved then the readahead pages may well be getting reclaimed before
> the application actually gets to use them. But that's just an aside.
Yes, when direct reclaim is working as expected, readahead thrashing
should happen long before NORETRY page allocation failures and OOM.
With that assumption I think it's OK to do this patch. As for
readahead, sporadic allocation failures are acceptable. But there is a
problem, see below.
> Ho hum. The patch *seems* good (as it did 5-10 years ago ;)) but there
> may be surprising side-effects which could be exposed under heavy
> testing. Testing which I'm sure hasn't been performed...
The NORETRY direct reclaim does tend to fail a lot more on concurrent
reclaims, where one task's reclaimed pages can be stoled by others
before it's able to get it.
__alloc_pages_direct_reclaim()
{
did_some_progress = try_to_free_pages();
// pages stolen by others
page = get_page_from_freelist();
}
Here are the tests to demonstrate this problem.
Out of 1000GB reads and page allocations,
test-ra-thrash.sh: read 1000 1G files interleaved in 1 single task:
nr_alloc_fail 733
test-dd-sparse.sh: read 1000 1G files concurrently in 1000 tasks:
nr_alloc_fail 11799
Thanks,
Fengguang
---
--- linux-next.orig/include/linux/mmzone.h 2011-04-27 21:58:27.000000000 +0800
+++ linux-next/include/linux/mmzone.h 2011-04-27 21:58:39.000000000 +0800
@@ -106,6 +106,7 @@ enum zone_stat_item {
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
+ NR_ALLOC_FAIL,
#ifdef CONFIG_NUMA
NUMA_HIT, /* allocated in intended node */
NUMA_MISS, /* allocated in non intended node */
--- linux-next.orig/mm/page_alloc.c 2011-04-27 21:58:27.000000000 +0800
+++ linux-next/mm/page_alloc.c 2011-04-27 21:58:39.000000000 +0800
@@ -2176,6 +2176,8 @@ rebalance:
}
nopage:
+ inc_zone_state(preferred_zone, NR_ALLOC_FAIL);
+ /* count_zone_vm_events(PGALLOCFAIL, preferred_zone, 1 << order); */
if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
unsigned int filter = SHOW_MEM_FILTER_NODES;
--- linux-next.orig/mm/vmstat.c 2011-04-27 21:58:27.000000000 +0800
+++ linux-next/mm/vmstat.c 2011-04-27 21:58:53.000000000 +0800
@@ -879,6 +879,7 @@ static const char * const vmstat_text[]
"nr_shmem",
"nr_dirtied",
"nr_written",
+ "nr_alloc_fail",
#ifdef CONFIG_NUMA
"numa_hit",
Concurrent page allocations are suffering from high failure rates.
On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
the page allocation failures are
nr_alloc_fail 733 # interleaved reads by 1 single task
nr_alloc_fail 11799 # concurrent reads by 1000 tasks
The concurrent read test script is:
for i in `seq 1000`
do
truncate -s 1G /fs/sparse-$i
dd if=/fs/sparse-$i of=/dev/null &
done
In order for get_page_from_freelist() to get free page,
(1) try_to_free_pages() should use much higher .nr_to_reclaim than the
current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
possible low watermark state as well as fill the pcp with enough free
pages to overflow its high watermark.
(2) the get_page_from_freelist() _after_ direct reclaim should use lower
watermark than its normal invocations, so that it can reasonably
"reserve" some free pages for itself and prevent other concurrent
page allocators stealing all its reclaimed pages.
Some notes:
- commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
reclaim allocation fails") has the same target, however is obviously
costly and less effective. It seems more clean to just remove the
retry and drain code than to retain it.
- it's a bit hacky to reclaim more than requested pages inside
do_try_to_free_page(), and it won't help cgroup for now
- it only aims to reduce failures when there are plenty of reclaimable
pages, so it stops the opportunistic reclaim when scanned 2 times pages
Test results:
- the failure rate is pretty sensible to the page reclaim size,
from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
- the IPIs are reduced by over 100 times
base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
-------------------------------------------------------------------------------
nr_alloc_fail 10496
allocstall 1576602
slabs_scanned 21632
kswapd_steal 4393382
kswapd_inodesteal 124
kswapd_low_wmark_hit_quickly 885
kswapd_high_wmark_hit_quickly 2321
kswapd_skip_congestion_wait 0
pageoutrun 29426
CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts
RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts
TLB: 189 15 13 17 64 294 97 63 TLB shootdowns
patched (WMARK_MIN)
-------------------
nr_alloc_fail 704
allocstall 105551
slabs_scanned 33280
kswapd_steal 4525537
kswapd_inodesteal 187
kswapd_low_wmark_hit_quickly 4980
kswapd_high_wmark_hit_quickly 2573
kswapd_skip_congestion_wait 0
pageoutrun 35429
CAL: 93 286 396 754 272 297 275 281 Function call interrupts
LOC: 520550 517751 517043 522016 520302 518479 519329 517179 Local timer interrupts
RES: 2131 1371 1376 1269 1390 1181 1409 1280 Rescheduling interrupts
TLB: 280 26 27 30 65 305 134 75 TLB shootdowns
patched (WMARK_HIGH)
--------------------
nr_alloc_fail 282
allocstall 53860
slabs_scanned 23936
kswapd_steal 4561178
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 2760
kswapd_high_wmark_hit_quickly 1748
kswapd_skip_congestion_wait 0
pageoutrun 32639
CAL: 93 463 410 540 298 282 272 306 Function call interrupts
LOC: 513956 510749 509890 514897 514300 512392 512825 510574 Local timer interrupts
RES: 1174 2081 1411 1320 1742 2683 1380 1230 Rescheduling interrupts
TLB: 274 21 19 22 57 317 131 61 TLB shootdowns
this patch (WMARK_HIGH, limited scan)
-------------------------------------
nr_alloc_fail 276
allocstall 54034
slabs_scanned 24320
kswapd_steal 4507482
kswapd_inodesteal 262
kswapd_low_wmark_hit_quickly 2638
kswapd_high_wmark_hit_quickly 1710
kswapd_skip_congestion_wait 0
pageoutrun 32182
CAL: 69 443 421 567 273 279 269 334 Function call interrupts
LOC: 514736 511698 510993 514069 514185 512986 513838 511229 Local timer interrupts
RES: 2153 1556 1126 1351 3047 1554 1131 1560 Rescheduling interrupts
TLB: 209 26 20 15 71 315 117 71 TLB shootdowns
CC: Mel Gorman <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
mm/page_alloc.c | 17 +++--------------
mm/vmscan.c | 6 ++++++
2 files changed, 9 insertions(+), 14 deletions(-)
--- linux-next.orig/mm/vmscan.c 2011-04-28 21:16:16.000000000 +0800
+++ linux-next/mm/vmscan.c 2011-04-28 21:28:57.000000000 +0800
@@ -1978,6 +1978,8 @@ static void shrink_zones(int priority, s
continue;
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
+ sc->nr_to_reclaim = max(sc->nr_to_reclaim,
+ zone->watermark[WMARK_HIGH]);
}
shrink_zone(priority, zone, sc);
@@ -2034,6 +2036,7 @@ static unsigned long do_try_to_free_page
struct zoneref *z;
struct zone *zone;
unsigned long writeback_threshold;
+ unsigned long min_reclaim = sc->nr_to_reclaim;
get_mems_allowed();
delayacct_freepages_start();
@@ -2067,6 +2070,9 @@ static unsigned long do_try_to_free_page
}
}
total_scanned += sc->nr_scanned;
+ if (sc->nr_reclaimed >= min_reclaim &&
+ total_scanned > 2 * sc->nr_to_reclaim)
+ goto out;
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;
--- linux-next.orig/mm/page_alloc.c 2011-04-28 21:16:16.000000000 +0800
+++ linux-next/mm/page_alloc.c 2011-04-28 21:16:18.000000000 +0800
@@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int migratetype, unsigned long *did_some_progress)
{
- struct page *page = NULL;
+ struct page *page;
struct reclaim_state reclaim_state;
- bool drained = false;
cond_resched();
@@ -1912,22 +1911,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
if (unlikely(!(*did_some_progress)))
return NULL;
-retry:
+ alloc_flags |= ALLOC_HARDER;
+
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
alloc_flags, preferred_zone,
migratetype);
-
- /*
- * If an allocation failed after direct reclaim, it could be because
- * pages are pinned on the per-cpu lists. Drain them and try again
- */
- if (!page && !drained) {
- drain_all_pages();
- drained = true;
- goto retry;
- }
-
return page;
}
Just for your reference.
It seems not necessary given that page allocation failure rate is no long high.
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 2 ++
mm/vmstat.c | 1 +
3 files changed, 4 insertions(+)
--- linux-next.orig/include/linux/mmzone.h 2011-04-28 21:34:30.000000000 +0800
+++ linux-next/include/linux/mmzone.h 2011-04-28 21:34:35.000000000 +0800
@@ -106,6 +106,7 @@ enum zone_stat_item {
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
+ NR_ALLOC_FAIL,
#ifdef CONFIG_NUMA
NUMA_HIT, /* allocated in intended node */
NUMA_MISS, /* allocated in non intended node */
--- linux-next.orig/mm/page_alloc.c 2011-04-28 21:34:34.000000000 +0800
+++ linux-next/mm/page_alloc.c 2011-04-28 21:34:35.000000000 +0800
@@ -2165,6 +2165,8 @@ rebalance:
}
nopage:
+ inc_zone_state(preferred_zone, NR_ALLOC_FAIL);
+ /* count_zone_vm_events(PGALLOCFAIL, preferred_zone, 1 << order); */
if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
unsigned int filter = SHOW_MEM_FILTER_NODES;
--- linux-next.orig/mm/vmstat.c 2011-04-28 21:34:30.000000000 +0800
+++ linux-next/mm/vmstat.c 2011-04-28 21:34:35.000000000 +0800
@@ -879,6 +879,7 @@ static const char * const vmstat_text[]
"nr_shmem",
"nr_dirtied",
"nr_written",
+ "nr_alloc_fail",
#ifdef CONFIG_NUMA
"numa_hit",
> nopage:
> + inc_zone_state(preferred_zone, NR_ALLOC_FAIL);
> + /* count_zone_vm_events(PGALLOCFAIL, preferred_zone, 1 << order); */
> if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
> unsigned int filter = SHOW_MEM_FILTER_NODES;
>
> --- linux-next.orig/mm/vmstat.c 2011-04-28 21:34:30.000000000 +0800
> +++ linux-next/mm/vmstat.c 2011-04-28 21:34:35.000000000 +0800
> @@ -879,6 +879,7 @@ static const char * const vmstat_text[]
> "nr_shmem",
> "nr_dirtied",
> "nr_written",
> + "nr_alloc_fail",
I'm using very similar patch for debugging. However, this is useless for
admins because typical linux load have plenty GFP_ATOMIC allocation failure.
So, typical user have no way that failure rate is high or not.
> Test results:
>
> - the failure rate is pretty sensible to the page reclaim size,
> from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>
> - the IPIs are reduced by over 100 times
It's reduced by 500 times indeed.
CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
CAL: 93 463 410 540 298 282 272 306 Function call interrupts
> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> -------------------------------------------------------------------------------
> nr_alloc_fail 10496
> allocstall 1576602
> patched (WMARK_MIN)
> -------------------
> nr_alloc_fail 704
> allocstall 105551
> patched (WMARK_HIGH)
> --------------------
> nr_alloc_fail 282
> allocstall 53860
> this patch (WMARK_HIGH, limited scan)
> -------------------------------------
> nr_alloc_fail 276
> allocstall 54034
There is a bad side effect though: the much reduced "allocstall" means
each direct reclaim will take much more time to complete. A simple solution
is to terminate direct reclaim after 10ms. I noticed that an 100ms
time threshold can reduce the reclaim latency from 621ms to 358ms.
Further lowering the time threshold to 20ms does not help reducing the
real latencies though.
However the very subjective perception is, in such heavy 1000-dd
workload, the reduced reclaim latency hardly improves the overall
responsiveness.
base kernel
-----------
start time: 243
total time: 529
wfg@fat ~% getdelays -dip 3971
print delayacct stats ON
printing IO accounting
PID 3971
CPU count real total virtual total delay total
961 3176517096 3158468847 313952766099
IO count delay total delay average
2 181251847 60ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1205 38120615476 31ms
dd: read=16384, write=0, cancelled_write=0
wfg@fat ~% getdelays -dip 3383
print delayacct stats ON
printing IO accounting
PID 3383
CPU count real total virtual total delay total
1270 4206360536 4181445838 358641985177
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1606 39897314399 24ms
dd: read=0, write=0, cancelled_write=0
no time limit
-------------
wfg@fat ~% getdelays -dip `pidof dd`
print delayacct stats ON
printing IO accounting
PID 9609
CPU count real total virtual total delay total
865 2792575464 2779071029 235345541230
IO count delay total delay average
4 300247552 60ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
32 20504634169 621ms
dd: read=106496, write=0, cancelled_write=0
100ms limit
-----------
start time: 288
total time: 514
nr_alloc_fail 1269
allocstall 128915
wfg@fat ~% getdelays -dip `pidof dd`
print delayacct stats ON
printing IO accounting
PID 5077
CPU count real total virtual total delay total
937 2949551600 2935087806 207877301298
IO count delay total delay average
1 151891691 151ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
71 25475514278 358ms
dd: read=507904, write=0, cancelled_write=0
PID 5101
CPU count real total virtual total delay total
1201 3827418144 3805399187 221075772599
IO count delay total delay average
4 300331997 60ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
94 31996779648 336ms
dd: read=618496, write=0, cancelled_write=0
nr_alloc_fail 937
allocstall 128684
slabs_scanned 63616
kswapd_steal 4616011
kswapd_inodesteal 5
kswapd_low_wmark_hit_quickly 5394
kswapd_high_wmark_hit_quickly 2826
kswapd_skip_congestion_wait 0
pageoutrun 36679
20ms limit
----------
start time: 294
total time: 516
nr_alloc_fail 1662
allocstall 132101
CPU count real total virtual total delay total
839 2750581848 2734464704 198489159459
IO count delay total delay average
1 43566814 43ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
95 35234061367 370ms
dd: read=20480, write=0, cancelled_write=0
test script
-----------
tic=$(date +'%s')
for i in `seq 1000`
do
truncate -s 1G /fs/sparse-$i
dd if=/fs/sparse-$i of=/dev/null &>/dev/null &
done
tac=$(date +'%s')
echo start time: $((tac-tic))
wait
tac=$(date +'%s')
echo total time: $((tac-tic))
egrep '(nr_alloc_fail|allocstall)' /proc/vmstat
Thanks,
Fengguang
---
Subject: mm: limit direct reclaim delays
Date: Fri Apr 29 09:04:11 CST 2011
Signed-off-by: Wu Fengguang <[email protected]>
---
mm/vmscan.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
--- linux-next.orig/mm/vmscan.c 2011-04-29 09:02:42.000000000 +0800
+++ linux-next/mm/vmscan.c 2011-04-29 09:04:10.000000000 +0800
@@ -2037,6 +2037,7 @@ static unsigned long do_try_to_free_page
struct zone *zone;
unsigned long writeback_threshold;
unsigned long min_reclaim = sc->nr_to_reclaim;
+ unsigned long start_time = jiffies;
get_mems_allowed();
delayacct_freepages_start();
@@ -2070,11 +2071,14 @@ static unsigned long do_try_to_free_page
}
}
total_scanned += sc->nr_scanned;
- if (sc->nr_reclaimed >= min_reclaim &&
- total_scanned > 2 * sc->nr_to_reclaim)
- goto out;
- if (sc->nr_reclaimed >= sc->nr_to_reclaim)
- goto out;
+ if (sc->nr_reclaimed >= min_reclaim) {
+ if (sc->nr_reclaimed >= sc->nr_to_reclaim)
+ goto out;
+ if (total_scanned > 2 * sc->nr_to_reclaim)
+ goto out;
+ if (jiffies - start_time > HZ / 100)
+ goto out;
+ }
/*
* Try to write back as many pages as we just scanned. This
Andrew,
I tested the more realistic 100 dd case. The results are
- nr_alloc_fail: 892 => 146
- reclaim delay: 4ms => 68ms
Thanks,
Fengguang
---
base kernel, 100 dd
-------------------
start time: 3
total time: 52
nr_alloc_fail 892
allocstall 131341
2nd run (no reboot):
start time: 3
total time: 53
nr_alloc_fail 1555
allocstall 265718
CPU count real total virtual total delay total
962 3125524848 3113269116 37972729582
IO count delay total delay average
3 25204838 8ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1032 5130797747 4ms
(IPIs accumulated in two runs)
CAL: 34898 35428 35182 35553 35320 35291 35298 35102 Function call interrupts
10ms limit, 100 dd
------------------
start time: 2
total time: 50
nr_alloc_fail 146
allocstall 10598
CPU count real total virtual total delay total
1038 3349490800 3331087137 40156395960
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
84 5795410854 68ms
dd: read=0, write=0, cancelled_write=0
Thanks,
Fengguang
On Fri, Apr 29, 2011 at 10:28:24AM +0800, Wu Fengguang wrote:
> > Test results:
> >
> > - the failure rate is pretty sensible to the page reclaim size,
> > from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
> >
> > - the IPIs are reduced by over 100 times
>
> It's reduced by 500 times indeed.
>
> CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
> CAL: 93 463 410 540 298 282 272 306 Function call interrupts
>
> > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> > -------------------------------------------------------------------------------
> > nr_alloc_fail 10496
> > allocstall 1576602
>
> > patched (WMARK_MIN)
> > -------------------
> > nr_alloc_fail 704
> > allocstall 105551
>
> > patched (WMARK_HIGH)
> > --------------------
> > nr_alloc_fail 282
> > allocstall 53860
>
> > this patch (WMARK_HIGH, limited scan)
> > -------------------------------------
> > nr_alloc_fail 276
> > allocstall 54034
>
> There is a bad side effect though: the much reduced "allocstall" means
> each direct reclaim will take much more time to complete. A simple solution
> is to terminate direct reclaim after 10ms. I noticed that an 100ms
> time threshold can reduce the reclaim latency from 621ms to 358ms.
> Further lowering the time threshold to 20ms does not help reducing the
> real latencies though.
Experiments going on...
I tried the more reasonable terminate condition: stop direct reclaim
when the preferred zone is above high watermark (see the below chunk).
This helps reduce the average reclaim latency to under 100ms in the
1000-dd case.
However nr_alloc_fail is around 5000 and not ideal. The interesting
thing is, even if zone watermark is high, the task still may fail to
get a free page..
@@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
}
}
total_scanned += sc->nr_scanned;
- if (sc->nr_reclaimed >= sc->nr_to_reclaim)
- goto out;
+ if (sc->nr_reclaimed >= min_reclaim) {
+ if (sc->nr_reclaimed >= sc->nr_to_reclaim)
+ goto out;
+ if (total_scanned > 2 * sc->nr_to_reclaim)
+ goto out;
+ if (preferred_zone &&
+ zone_watermark_ok_safe(preferred_zone, sc->order,
+ high_wmark_pages(preferred_zone),
+ zone_idx(preferred_zone), 0))
+ goto out;
+ }
/*
* Try to write back as many pages as we just scanned. This
Thanks,
Fengguang
---
Subject: mm: cut down __GFP_NORETRY page allocation failures
Date: Thu Apr 28 13:46:39 CST 2011
Concurrent page allocations are suffering from high failure rates.
On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
the page allocation failures are
nr_alloc_fail 733 # interleaved reads by 1 single task
nr_alloc_fail 11799 # concurrent reads by 1000 tasks
The concurrent read test script is:
for i in `seq 1000`
do
truncate -s 1G /fs/sparse-$i
dd if=/fs/sparse-$i of=/dev/null &
done
In order for get_page_from_freelist() to get free page,
(1) try_to_free_pages() should use much higher .nr_to_reclaim than the
current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
possible low watermark state as well as fill the pcp with enough free
pages to overflow its high watermark.
(2) the get_page_from_freelist() _after_ direct reclaim should use lower
watermark than its normal invocations, so that it can reasonably
"reserve" some free pages for itself and prevent other concurrent
page allocators stealing all its reclaimed pages.
Some notes:
- commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
reclaim allocation fails") has the same target, however is obviously
costly and less effective. It seems more clean to just remove the
retry and drain code than to retain it.
- it's a bit hacky to reclaim more than requested pages inside
do_try_to_free_page(), and it won't help cgroup for now
- it only aims to reduce failures when there are plenty of reclaimable
pages, so it stops the opportunistic reclaim when scanned 2 times pages
Test results:
- the failure rate is pretty sensible to the page reclaim size,
from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
- the IPIs are reduced by over 100 times
base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
-------------------------------------------------------------------------------
nr_alloc_fail 10496
allocstall 1576602
slabs_scanned 21632
kswapd_steal 4393382
kswapd_inodesteal 124
kswapd_low_wmark_hit_quickly 885
kswapd_high_wmark_hit_quickly 2321
kswapd_skip_congestion_wait 0
pageoutrun 29426
CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts
RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts
TLB: 189 15 13 17 64 294 97 63 TLB shootdowns
patched (WMARK_MIN)
-------------------
nr_alloc_fail 704
allocstall 105551
slabs_scanned 33280
kswapd_steal 4525537
kswapd_inodesteal 187
kswapd_low_wmark_hit_quickly 4980
kswapd_high_wmark_hit_quickly 2573
kswapd_skip_congestion_wait 0
pageoutrun 35429
CAL: 93 286 396 754 272 297 275 281 Function call interrupts
LOC: 520550 517751 517043 522016 520302 518479 519329 517179 Local timer interrupts
RES: 2131 1371 1376 1269 1390 1181 1409 1280 Rescheduling interrupts
TLB: 280 26 27 30 65 305 134 75 TLB shootdowns
patched (WMARK_HIGH)
--------------------
nr_alloc_fail 282
allocstall 53860
slabs_scanned 23936
kswapd_steal 4561178
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 2760
kswapd_high_wmark_hit_quickly 1748
kswapd_skip_congestion_wait 0
pageoutrun 32639
CAL: 93 463 410 540 298 282 272 306 Function call interrupts
LOC: 513956 510749 509890 514897 514300 512392 512825 510574 Local timer interrupts
RES: 1174 2081 1411 1320 1742 2683 1380 1230 Rescheduling interrupts
TLB: 274 21 19 22 57 317 131 61 TLB shootdowns
patched (WMARK_HIGH, limited scan)
----------------------------------
nr_alloc_fail 276
allocstall 54034
slabs_scanned 24320
kswapd_steal 4507482
kswapd_inodesteal 262
kswapd_low_wmark_hit_quickly 2638
kswapd_high_wmark_hit_quickly 1710
kswapd_skip_congestion_wait 0
pageoutrun 32182
CAL: 69 443 421 567 273 279 269 334 Function call interrupts
LOC: 514736 511698 510993 514069 514185 512986 513838 511229 Local timer interrupts
RES: 2153 1556 1126 1351 3047 1554 1131 1560 Rescheduling interrupts
TLB: 209 26 20 15 71 315 117 71 TLB shootdowns
patched (WMARK_HIGH, limited scan, stop on watermark OK), 100 dd
----------------------------------------------------------------
start time: 3
total time: 50
nr_alloc_fail 162
allocstall 45523
CPU count real total virtual total delay total
921 3024540200 3009244668 37123129525
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
357 4891766796 13ms
dd: read=0, write=0, cancelled_write=0
patched (WMARK_HIGH, limited scan, stop on watermark OK), 1000 dd
-----------------------------------------------------------------
start time: 272
total time: 509
nr_alloc_fail 3913
allocstall 541789
CPU count real total virtual total delay total
1044 3445476208 3437200482 229919915202
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
452 34691441605 76ms
dd: read=0, write=0, cancelled_write=0
patched (WMARK_HIGH, limited scan, stop on watermark OK, no time limit), 1000 dd
--------------------------------------------------------------------------------
start time: 278
total time: 513
nr_alloc_fail 4737
allocstall 436392
CPU count real total virtual total delay total
1024 3371487456 3359441487 225088210977
IO count delay total delay average
1 160631171 160ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
367 30809994722 83ms
dd: read=20480, write=0, cancelled_write=0
no cond_resched():
start time: 263
total time: 516
nr_alloc_fail 5144
allocstall 436787
CPU count real total virtual total delay total
1018 3305497488 3283831119 241982934044
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
328 31398481378 95ms
dd: read=0, write=0, cancelled_write=0
zone_watermark_ok_safe():
start time: 266
total time: 513
nr_alloc_fail 4526
allocstall 440246
CPU count real total virtual total delay total
1119 3640446568 3619184439 240945024724
IO count delay total delay average
3 303620082 101ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
372 27320731898 73ms
dd: read=77824, write=0, cancelled_write=0
start time: 275
total time: 517
nr_alloc_fail 4694
allocstall 431021
CPU count real total virtual total delay total
1073 3534462680 3512544928 234056498221
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
386 34751778363 89ms
dd: read=0, write=0, cancelled_write=0
CC: Mel Gorman <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/buffer.c | 4 ++--
include/linux/swap.h | 3 ++-
mm/page_alloc.c | 20 +++++---------------
mm/vmscan.c | 31 +++++++++++++++++++++++--------
4 files changed, 32 insertions(+), 26 deletions(-)
--- linux-next.orig/mm/vmscan.c 2011-04-29 10:42:14.000000000 +0800
+++ linux-next/mm/vmscan.c 2011-04-30 21:59:33.000000000 +0800
@@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon
* returns: 0, if no pages reclaimed
* else, the number of pages reclaimed
*/
-static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
- struct scan_control *sc)
+static unsigned long do_try_to_free_pages(struct zone *preferred_zone,
+ struct zonelist *zonelist,
+ struct scan_control *sc)
{
int priority;
unsigned long total_scanned = 0;
@@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page
struct zoneref *z;
struct zone *zone;
unsigned long writeback_threshold;
+ unsigned long min_reclaim = sc->nr_to_reclaim;
get_mems_allowed();
delayacct_freepages_start();
@@ -2041,6 +2043,9 @@ static unsigned long do_try_to_free_page
if (scanning_global_lru(sc))
count_vm_event(ALLOCSTALL);
+ if (preferred_zone)
+ sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH];
+
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
sc->nr_scanned = 0;
if (!priority)
@@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
}
}
total_scanned += sc->nr_scanned;
- if (sc->nr_reclaimed >= sc->nr_to_reclaim)
- goto out;
+ if (sc->nr_reclaimed >= min_reclaim) {
+ if (sc->nr_reclaimed >= sc->nr_to_reclaim)
+ goto out;
+ if (total_scanned > 2 * sc->nr_to_reclaim)
+ goto out;
+ if (preferred_zone &&
+ zone_watermark_ok_safe(preferred_zone, sc->order,
+ high_wmark_pages(preferred_zone),
+ zone_idx(preferred_zone), 0))
+ goto out;
+ }
/*
* Try to write back as many pages as we just scanned. This
@@ -2117,7 +2131,8 @@ out:
return 0;
}
-unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+unsigned long try_to_free_pages(struct zone *preferred_zone,
+ struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
unsigned long nr_reclaimed;
@@ -2137,7 +2152,7 @@ unsigned long try_to_free_pages(struct z
sc.may_writepage,
gfp_mask);
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+ nr_reclaimed = do_try_to_free_pages(preferred_zone, zonelist, &sc);
trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
@@ -2207,7 +2222,7 @@ unsigned long try_to_free_mem_cgroup_pag
sc.may_writepage,
sc.gfp_mask);
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+ nr_reclaimed = do_try_to_free_pages(NULL, zonelist, &sc);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
@@ -2796,7 +2811,7 @@ unsigned long shrink_all_memory(unsigned
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+ nr_reclaimed = do_try_to_free_pages(NULL, zonelist, &sc);
p->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
--- linux-next.orig/mm/page_alloc.c 2011-04-29 10:42:15.000000000 +0800
+++ linux-next/mm/page_alloc.c 2011-04-30 21:29:40.000000000 +0800
@@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int migratetype, unsigned long *did_some_progress)
{
- struct page *page = NULL;
+ struct page *page;
struct reclaim_state reclaim_state;
- bool drained = false;
cond_resched();
@@ -1901,7 +1900,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
reclaim_state.reclaimed_slab = 0;
current->reclaim_state = &reclaim_state;
- *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
+ *did_some_progress = try_to_free_pages(preferred_zone, zonelist, order,
+ gfp_mask, nodemask);
current->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
@@ -1912,22 +1912,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
if (unlikely(!(*did_some_progress)))
return NULL;
-retry:
+ alloc_flags |= ALLOC_HARDER;
+
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
alloc_flags, preferred_zone,
migratetype);
-
- /*
- * If an allocation failed after direct reclaim, it could be because
- * pages are pinned on the per-cpu lists. Drain them and try again
- */
- if (!page && !drained) {
- drain_all_pages();
- drained = true;
- goto retry;
- }
-
return page;
}
--- linux-next.orig/fs/buffer.c 2011-04-30 13:26:57.000000000 +0800
+++ linux-next/fs/buffer.c 2011-04-30 13:29:08.000000000 +0800
@@ -288,8 +288,8 @@ static void free_more_memory(void)
gfp_zone(GFP_NOFS), NULL,
&zone);
if (zone)
- try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
- GFP_NOFS, NULL);
+ try_to_free_pages(zone, node_zonelist(nid, GFP_NOFS),
+ 0, GFP_NOFS, NULL);
}
}
--- linux-next.orig/include/linux/swap.h 2011-04-30 13:30:36.000000000 +0800
+++ linux-next/include/linux/swap.h 2011-04-30 13:31:03.000000000 +0800
@@ -249,7 +249,8 @@ static inline void lru_cache_add_file(st
#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
/* linux/mm/vmscan.c */
-extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+extern unsigned long try_to_free_pages(struct zone *preferred_zone,
+ struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
Hi Wu,
On Sat, Apr 30, 2011 at 10:17:41PM +0800, Wu Fengguang wrote:
> On Fri, Apr 29, 2011 at 10:28:24AM +0800, Wu Fengguang wrote:
> > > Test results:
> > >
> > > - the failure rate is pretty sensible to the page reclaim size,
> > > from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
> > >
> > > - the IPIs are reduced by over 100 times
> >
> > It's reduced by 500 times indeed.
> >
> > CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
> > CAL: 93 463 410 540 298 282 272 306 Function call interrupts
> >
> > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> > > -------------------------------------------------------------------------------
> > > nr_alloc_fail 10496
> > > allocstall 1576602
> >
> > > patched (WMARK_MIN)
> > > -------------------
> > > nr_alloc_fail 704
> > > allocstall 105551
> >
> > > patched (WMARK_HIGH)
> > > --------------------
> > > nr_alloc_fail 282
> > > allocstall 53860
> >
> > > this patch (WMARK_HIGH, limited scan)
> > > -------------------------------------
> > > nr_alloc_fail 276
> > > allocstall 54034
> >
> > There is a bad side effect though: the much reduced "allocstall" means
> > each direct reclaim will take much more time to complete. A simple solution
> > is to terminate direct reclaim after 10ms. I noticed that an 100ms
> > time threshold can reduce the reclaim latency from 621ms to 358ms.
> > Further lowering the time threshold to 20ms does not help reducing the
> > real latencies though.
>
> Experiments going on...
>
> I tried the more reasonable terminate condition: stop direct reclaim
> when the preferred zone is above high watermark (see the below chunk).
>
> This helps reduce the average reclaim latency to under 100ms in the
> 1000-dd case.
>
> However nr_alloc_fail is around 5000 and not ideal. The interesting
> thing is, even if zone watermark is high, the task still may fail to
> get a free page..
>
> @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
> }
> }
> total_scanned += sc->nr_scanned;
> - if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> - goto out;
> + if (sc->nr_reclaimed >= min_reclaim) {
> + if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> + goto out;
> + if (total_scanned > 2 * sc->nr_to_reclaim)
> + goto out;
> + if (preferred_zone &&
> + zone_watermark_ok_safe(preferred_zone, sc->order,
> + high_wmark_pages(preferred_zone),
> + zone_idx(preferred_zone), 0))
> + goto out;
> + }
>
> /*
> * Try to write back as many pages as we just scanned. This
>
> Thanks,
> Fengguang
> ---
> Subject: mm: cut down __GFP_NORETRY page allocation failures
> Date: Thu Apr 28 13:46:39 CST 2011
>
> Concurrent page allocations are suffering from high failure rates.
>
> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
> the page allocation failures are
>
> nr_alloc_fail 733 # interleaved reads by 1 single task
> nr_alloc_fail 11799 # concurrent reads by 1000 tasks
>
> The concurrent read test script is:
>
> for i in `seq 1000`
> do
> truncate -s 1G /fs/sparse-$i
> dd if=/fs/sparse-$i of=/dev/null &
> done
>
> In order for get_page_from_freelist() to get free page,
>
> (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
> current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
> possible low watermark state as well as fill the pcp with enough free
> pages to overflow its high watermark.
>
> (2) the get_page_from_freelist() _after_ direct reclaim should use lower
> watermark than its normal invocations, so that it can reasonably
> "reserve" some free pages for itself and prevent other concurrent
> page allocators stealing all its reclaimed pages.
Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
http://marc.info/?l=linux-mm&m=129187231129887&w=4
The idea is to keep a page at leat for direct reclaimed process.
Could it mitigate your problem or could you enhacne the idea?
I think it's very simple and fair solution.
>
> Some notes:
>
> - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
> reclaim allocation fails") has the same target, however is obviously
> costly and less effective. It seems more clean to just remove the
> retry and drain code than to retain it.
Tend to agree.
My old patch can solve it, I think.
>
> - it's a bit hacky to reclaim more than requested pages inside
> do_try_to_free_page(), and it won't help cgroup for now
>
> - it only aims to reduce failures when there are plenty of reclaimable
> pages, so it stops the opportunistic reclaim when scanned 2 times pages
>
> Test results:
>
> - the failure rate is pretty sensible to the page reclaim size,
> from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>
> - the IPIs are reduced by over 100 times
>
> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> -------------------------------------------------------------------------------
> nr_alloc_fail 10496
> allocstall 1576602
>
> slabs_scanned 21632
> kswapd_steal 4393382
> kswapd_inodesteal 124
> kswapd_low_wmark_hit_quickly 885
> kswapd_high_wmark_hit_quickly 2321
> kswapd_skip_congestion_wait 0
> pageoutrun 29426
>
> CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
>
> LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts
> RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts
> TLB: 189 15 13 17 64 294 97 63 TLB shootdowns
>
> patched (WMARK_MIN)
> -------------------
> nr_alloc_fail 704
> allocstall 105551
>
> slabs_scanned 33280
> kswapd_steal 4525537
> kswapd_inodesteal 187
> kswapd_low_wmark_hit_quickly 4980
> kswapd_high_wmark_hit_quickly 2573
> kswapd_skip_congestion_wait 0
> pageoutrun 35429
>
> CAL: 93 286 396 754 272 297 275 281 Function call interrupts
>
> LOC: 520550 517751 517043 522016 520302 518479 519329 517179 Local timer interrupts
> RES: 2131 1371 1376 1269 1390 1181 1409 1280 Rescheduling interrupts
> TLB: 280 26 27 30 65 305 134 75 TLB shootdowns
>
> patched (WMARK_HIGH)
> --------------------
> nr_alloc_fail 282
> allocstall 53860
>
> slabs_scanned 23936
> kswapd_steal 4561178
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 2760
> kswapd_high_wmark_hit_quickly 1748
> kswapd_skip_congestion_wait 0
> pageoutrun 32639
>
> CAL: 93 463 410 540 298 282 272 306 Function call interrupts
>
> LOC: 513956 510749 509890 514897 514300 512392 512825 510574 Local timer interrupts
> RES: 1174 2081 1411 1320 1742 2683 1380 1230 Rescheduling interrupts
> TLB: 274 21 19 22 57 317 131 61 TLB shootdowns
>
> patched (WMARK_HIGH, limited scan)
> ----------------------------------
> nr_alloc_fail 276
> allocstall 54034
>
> slabs_scanned 24320
> kswapd_steal 4507482
> kswapd_inodesteal 262
> kswapd_low_wmark_hit_quickly 2638
> kswapd_high_wmark_hit_quickly 1710
> kswapd_skip_congestion_wait 0
> pageoutrun 32182
>
> CAL: 69 443 421 567 273 279 269 334 Function call interrupts
Looks amazing.
>
> LOC: 514736 511698 510993 514069 514185 512986 513838 511229 Local timer interrupts
> RES: 2153 1556 1126 1351 3047 1554 1131 1560 Rescheduling interrupts
> TLB: 209 26 20 15 71 315 117 71 TLB shootdowns
>
> patched (WMARK_HIGH, limited scan, stop on watermark OK), 100 dd
> ----------------------------------------------------------------
>
> start time: 3
> total time: 50
> nr_alloc_fail 162
> allocstall 45523
>
> CPU count real total virtual total delay total
> 921 3024540200 3009244668 37123129525
> IO count delay total delay average
> 0 0 0ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 357 4891766796 13ms
> dd: read=0, write=0, cancelled_write=0
>
> patched (WMARK_HIGH, limited scan, stop on watermark OK), 1000 dd
> -----------------------------------------------------------------
>
> start time: 272
> total time: 509
> nr_alloc_fail 3913
> allocstall 541789
>
> CPU count real total virtual total delay total
> 1044 3445476208 3437200482 229919915202
> IO count delay total delay average
> 0 0 0ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 452 34691441605 76ms
> dd: read=0, write=0, cancelled_write=0
>
> patched (WMARK_HIGH, limited scan, stop on watermark OK, no time limit), 1000 dd
> --------------------------------------------------------------------------------
>
> start time: 278
> total time: 513
> nr_alloc_fail 4737
> allocstall 436392
>
>
> CPU count real total virtual total delay total
> 1024 3371487456 3359441487 225088210977
> IO count delay total delay average
> 1 160631171 160ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 367 30809994722 83ms
> dd: read=20480, write=0, cancelled_write=0
>
>
> no cond_resched():
What's this?
>
> start time: 263
> total time: 516
> nr_alloc_fail 5144
> allocstall 436787
>
> CPU count real total virtual total delay total
> 1018 3305497488 3283831119 241982934044
> IO count delay total delay average
> 0 0 0ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 328 31398481378 95ms
> dd: read=0, write=0, cancelled_write=0
>
> zone_watermark_ok_safe():
>
> start time: 266
> total time: 513
> nr_alloc_fail 4526
> allocstall 440246
>
> CPU count real total virtual total delay total
> 1119 3640446568 3619184439 240945024724
> IO count delay total delay average
> 3 303620082 101ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 372 27320731898 73ms
> dd: read=77824, write=0, cancelled_write=0
>
>
> start time: 275
What's meaing of start time?
> total time: 517
Total time is elapsed time on your experiment?
> nr_alloc_fail 4694
> allocstall 431021
>
>
> CPU count real total virtual total delay total
> 1073 3534462680 3512544928 234056498221
What's meaning of CPU fields?
> IO count delay total delay average
> 0 0 0ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 386 34751778363 89ms
> dd: read=0, write=0, cancelled_write=0
>
Where is vanilla data for comparing latency?
Personally, It's hard to parse your data.
> CC: Mel Gorman <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> fs/buffer.c | 4 ++--
> include/linux/swap.h | 3 ++-
> mm/page_alloc.c | 20 +++++---------------
> mm/vmscan.c | 31 +++++++++++++++++++++++--------
> 4 files changed, 32 insertions(+), 26 deletions(-)
> --- linux-next.orig/mm/vmscan.c 2011-04-29 10:42:14.000000000 +0800
> +++ linux-next/mm/vmscan.c 2011-04-30 21:59:33.000000000 +0800
> @@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon
> * returns: 0, if no pages reclaimed
> * else, the number of pages reclaimed
> */
> -static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> - struct scan_control *sc)
> +static unsigned long do_try_to_free_pages(struct zone *preferred_zone,
> + struct zonelist *zonelist,
> + struct scan_control *sc)
> {
> int priority;
> unsigned long total_scanned = 0;
> @@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page
> struct zoneref *z;
> struct zone *zone;
> unsigned long writeback_threshold;
> + unsigned long min_reclaim = sc->nr_to_reclaim;
Hmm,
>
> get_mems_allowed();
> delayacct_freepages_start();
> @@ -2041,6 +2043,9 @@ static unsigned long do_try_to_free_page
> if (scanning_global_lru(sc))
> count_vm_event(ALLOCSTALL);
>
> + if (preferred_zone)
> + sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH];
> +
Hmm, I don't like this idea.
The goal of direct reclaim path is to reclaim pages asap, I beleive.
Many thing should be achieve of background kswapd.
If admin changes min_free_kbytes, it can affect latency of direct reclaim.
It doesn't make sense to me.
> for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> sc->nr_scanned = 0;
> if (!priority)
> @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
> }
> }
> total_scanned += sc->nr_scanned;
> - if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> - goto out;
> + if (sc->nr_reclaimed >= min_reclaim) {
> + if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> + goto out;
I can't understand the logic.
if nr_reclaimed is bigger than min_reclaim, it's always greater than
nr_to_reclaim. What's meaning of min_reclaim?
> + if (total_scanned > 2 * sc->nr_to_reclaim)
> + goto out;
If there are lots of dirty pages in LRU?
If there are lots of unevictable pages in LRU?
If there are lots of mapped page in LRU but may_unmap = 0 cases?
I means it's rather risky early conclusion.
> + if (preferred_zone &&
> + zone_watermark_ok_safe(preferred_zone, sc->order,
> + high_wmark_pages(preferred_zone),
> + zone_idx(preferred_zone), 0))
> + goto out;
> + }
As I said, I think direct reclaim path sould be fast if possbile and
it should not a function of min_free_kbytes.
Of course, there are lots of tackle for keep direct reclaim path's consistent
latency but at least, I don't want to add another source.
--
Kind regards,
Minchan Kim
On Mon, May 02, 2011 at 01:35:42AM +0900, Minchan Kim wrote:
> Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
^^^^^^^^^^^^^^^^
typo : wasn't complete
--
Kind regards,
Minchan Kim
> On Mon, May 02, 2011 at 01:35:42AM +0900, Minchan Kim wrote:
>
> > Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
> ^^^^^^^^^^^^^^^^
> typo : wasn't complete
I think your idea is eligible. Wu's approach may increase throughput but
may decrease latency. So, do you have a plan to finish the work?
Hi Minchan,
On Mon, May 02, 2011 at 12:35:42AM +0800, Minchan Kim wrote:
> Hi Wu,
>
> On Sat, Apr 30, 2011 at 10:17:41PM +0800, Wu Fengguang wrote:
> > On Fri, Apr 29, 2011 at 10:28:24AM +0800, Wu Fengguang wrote:
> > > > Test results:
> > > >
> > > > - the failure rate is pretty sensible to the page reclaim size,
> > > > from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
> > > >
> > > > - the IPIs are reduced by over 100 times
> > >
> > > It's reduced by 500 times indeed.
> > >
> > > CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
> > > CAL: 93 463 410 540 298 282 272 306 Function call interrupts
> > >
> > > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> > > > -------------------------------------------------------------------------------
> > > > nr_alloc_fail 10496
> > > > allocstall 1576602
> > >
> > > > patched (WMARK_MIN)
> > > > -------------------
> > > > nr_alloc_fail 704
> > > > allocstall 105551
> > >
> > > > patched (WMARK_HIGH)
> > > > --------------------
> > > > nr_alloc_fail 282
> > > > allocstall 53860
> > >
> > > > this patch (WMARK_HIGH, limited scan)
> > > > -------------------------------------
> > > > nr_alloc_fail 276
> > > > allocstall 54034
> > >
> > > There is a bad side effect though: the much reduced "allocstall" means
> > > each direct reclaim will take much more time to complete. A simple solution
> > > is to terminate direct reclaim after 10ms. I noticed that an 100ms
> > > time threshold can reduce the reclaim latency from 621ms to 358ms.
> > > Further lowering the time threshold to 20ms does not help reducing the
> > > real latencies though.
> >
> > Experiments going on...
> >
> > I tried the more reasonable terminate condition: stop direct reclaim
> > when the preferred zone is above high watermark (see the below chunk).
> >
> > This helps reduce the average reclaim latency to under 100ms in the
> > 1000-dd case.
> >
> > However nr_alloc_fail is around 5000 and not ideal. The interesting
> > thing is, even if zone watermark is high, the task still may fail to
> > get a free page..
> >
> > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
> > }
> > }
> > total_scanned += sc->nr_scanned;
> > - if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> > - goto out;
> > + if (sc->nr_reclaimed >= min_reclaim) {
> > + if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> > + goto out;
> > + if (total_scanned > 2 * sc->nr_to_reclaim)
> > + goto out;
> > + if (preferred_zone &&
> > + zone_watermark_ok_safe(preferred_zone, sc->order,
> > + high_wmark_pages(preferred_zone),
> > + zone_idx(preferred_zone), 0))
> > + goto out;
> > + }
> >
> > /*
> > * Try to write back as many pages as we just scanned. This
> >
> > Thanks,
> > Fengguang
> > ---
> > Subject: mm: cut down __GFP_NORETRY page allocation failures
> > Date: Thu Apr 28 13:46:39 CST 2011
> >
> > Concurrent page allocations are suffering from high failure rates.
> >
> > On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
> > the page allocation failures are
> >
> > nr_alloc_fail 733 # interleaved reads by 1 single task
> > nr_alloc_fail 11799 # concurrent reads by 1000 tasks
> >
> > The concurrent read test script is:
> >
> > for i in `seq 1000`
> > do
> > truncate -s 1G /fs/sparse-$i
> > dd if=/fs/sparse-$i of=/dev/null &
> > done
> >
> > In order for get_page_from_freelist() to get free page,
> >
> > (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
> > current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
> > possible low watermark state as well as fill the pcp with enough free
> > pages to overflow its high watermark.
> >
> > (2) the get_page_from_freelist() _after_ direct reclaim should use lower
> > watermark than its normal invocations, so that it can reasonably
> > "reserve" some free pages for itself and prevent other concurrent
> > page allocators stealing all its reclaimed pages.
>
> Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
> http://marc.info/?l=linux-mm&m=129187231129887&w=4
> The idea is to keep a page at leat for direct reclaimed process.
> Could it mitigate your problem or could you enhacne the idea?
> I think it's very simple and fair solution.
No it's not helping my problem, nr_alloc_fail and CAL are still high:
root@fat /home/wfg# ./test-dd-sparse.sh
start time: 246
total time: 531
nr_alloc_fail 14097
allocstall 1578332
LOC: 542698 538947 536986 567118 552114 539605 541201 537623 Local timer interrupts
RES: 3368 1908 1474 1476 2809 1602 1500 1509 Rescheduling interrupts
CAL: 223844 224198 224268 224436 223952 224056 223700 223743 Function call interrupts
TLB: 381 27 22 19 96 404 111 67 TLB shootdowns
root@fat /home/wfg# getdelays -dip `pidof dd`
print delayacct stats ON
printing IO accounting
PID 5202
CPU count real total virtual total delay total
1132 3635447328 3627947550 276722091605
IO count delay total delay average
2 187809974 62ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1334 35304580824 26ms
dd: read=278528, write=0, cancelled_write=0
I guess your patch is mainly fixing the high order allocations while
my workload is mainly order 0 readahead page allocations. There are
1000 forks, however the "start time: 246" seems to indicate that the
order-1 reclaim latency is not improved.
I'll try modifying your patch and see how it works out. The obvious
change is to apply it to the order-0 case. Hope this won't create much
more isolated pages.
Attached is your patch rebased to 2.6.39-rc3, after resolving some
merge conflicts and fixing a trivial NULL pointer bug.
> >
> > Some notes:
> >
> > - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
> > reclaim allocation fails") has the same target, however is obviously
> > costly and less effective. It seems more clean to just remove the
> > retry and drain code than to retain it.
>
> Tend to agree.
> My old patch can solve it, I think.
Sadly nope. See above.
> >
> > - it's a bit hacky to reclaim more than requested pages inside
> > do_try_to_free_page(), and it won't help cgroup for now
> >
> > - it only aims to reduce failures when there are plenty of reclaimable
> > pages, so it stops the opportunistic reclaim when scanned 2 times pages
> >
> > Test results:
> >
> > - the failure rate is pretty sensible to the page reclaim size,
> > from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
> >
> > - the IPIs are reduced by over 100 times
> >
> > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> > -------------------------------------------------------------------------------
> > nr_alloc_fail 10496
> > allocstall 1576602
> >
> > slabs_scanned 21632
> > kswapd_steal 4393382
> > kswapd_inodesteal 124
> > kswapd_low_wmark_hit_quickly 885
> > kswapd_high_wmark_hit_quickly 2321
> > kswapd_skip_congestion_wait 0
> > pageoutrun 29426
> >
> > CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
> >
> > LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts
> > RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts
> > TLB: 189 15 13 17 64 294 97 63 TLB shootdowns
> >
> > patched (WMARK_MIN)
> > -------------------
> > nr_alloc_fail 704
> > allocstall 105551
> >
> > slabs_scanned 33280
> > kswapd_steal 4525537
> > kswapd_inodesteal 187
> > kswapd_low_wmark_hit_quickly 4980
> > kswapd_high_wmark_hit_quickly 2573
> > kswapd_skip_congestion_wait 0
> > pageoutrun 35429
> >
> > CAL: 93 286 396 754 272 297 275 281 Function call interrupts
> >
> > LOC: 520550 517751 517043 522016 520302 518479 519329 517179 Local timer interrupts
> > RES: 2131 1371 1376 1269 1390 1181 1409 1280 Rescheduling interrupts
> > TLB: 280 26 27 30 65 305 134 75 TLB shootdowns
> >
> > patched (WMARK_HIGH)
> > --------------------
> > nr_alloc_fail 282
> > allocstall 53860
> >
> > slabs_scanned 23936
> > kswapd_steal 4561178
> > kswapd_inodesteal 0
> > kswapd_low_wmark_hit_quickly 2760
> > kswapd_high_wmark_hit_quickly 1748
> > kswapd_skip_congestion_wait 0
> > pageoutrun 32639
> >
> > CAL: 93 463 410 540 298 282 272 306 Function call interrupts
> >
> > LOC: 513956 510749 509890 514897 514300 512392 512825 510574 Local timer interrupts
> > RES: 1174 2081 1411 1320 1742 2683 1380 1230 Rescheduling interrupts
> > TLB: 274 21 19 22 57 317 131 61 TLB shootdowns
> >
> > patched (WMARK_HIGH, limited scan)
> > ----------------------------------
> > nr_alloc_fail 276
> > allocstall 54034
> >
> > slabs_scanned 24320
> > kswapd_steal 4507482
> > kswapd_inodesteal 262
> > kswapd_low_wmark_hit_quickly 2638
> > kswapd_high_wmark_hit_quickly 1710
> > kswapd_skip_congestion_wait 0
> > pageoutrun 32182
> >
> > CAL: 69 443 421 567 273 279 269 334 Function call interrupts
>
> Looks amazing.
Yeah, I have strong feelings against drain_all_pages() in the direct
reclaim path. The intuition is, once drain_all_pages() is called, the
later on direct reclaims will have less chance to fill the drained
buffers and therefore forced into drain_all_pages() again and again.
drain_all_pages() is probably an overkill for preventing OOM.
Generally speaking, it's questionable to "squeeze the last page before
OOM".
A typical desktop enters thrashing storms before OOM, as Hugh pointed
out, this may well not the end users wanted. I agree with him and
personally prefer some applications to be OOM killed rather than the
whole system goes unusable thrashing like mad.
> > LOC: 514736 511698 510993 514069 514185 512986 513838 511229 Local timer interrupts
> > RES: 2153 1556 1126 1351 3047 1554 1131 1560 Rescheduling interrupts
> > TLB: 209 26 20 15 71 315 117 71 TLB shootdowns
> >
> > patched (WMARK_HIGH, limited scan, stop on watermark OK), 100 dd
> > ----------------------------------------------------------------
> >
> > start time: 3
> > total time: 50
> > nr_alloc_fail 162
> > allocstall 45523
> >
> > CPU count real total virtual total delay total
> > 921 3024540200 3009244668 37123129525
> > IO count delay total delay average
> > 0 0 0ms
> > SWAP count delay total delay average
> > 0 0 0ms
> > RECLAIM count delay total delay average
> > 357 4891766796 13ms
> > dd: read=0, write=0, cancelled_write=0
> >
> > patched (WMARK_HIGH, limited scan, stop on watermark OK), 1000 dd
> > -----------------------------------------------------------------
> >
> > start time: 272
> > total time: 509
> > nr_alloc_fail 3913
> > allocstall 541789
> >
> > CPU count real total virtual total delay total
> > 1044 3445476208 3437200482 229919915202
> > IO count delay total delay average
> > 0 0 0ms
> > SWAP count delay total delay average
> > 0 0 0ms
> > RECLAIM count delay total delay average
> > 452 34691441605 76ms
> > dd: read=0, write=0, cancelled_write=0
> >
> > patched (WMARK_HIGH, limited scan, stop on watermark OK, no time limit), 1000 dd
> > --------------------------------------------------------------------------------
> >
> > start time: 278
> > total time: 513
> > nr_alloc_fail 4737
> > allocstall 436392
> >
> >
> > CPU count real total virtual total delay total
> > 1024 3371487456 3359441487 225088210977
> > IO count delay total delay average
> > 1 160631171 160ms
> > SWAP count delay total delay average
> > 0 0 0ms
> > RECLAIM count delay total delay average
> > 367 30809994722 83ms
> > dd: read=20480, write=0, cancelled_write=0
> >
> >
> > no cond_resched():
>
> What's this?
I tried a modified patch that also removes the cond_resched() call in
__alloc_pages_direct_reclaim(), between try_to_free_pages() and
get_page_from_freelist(). It seems not helping noticeably.
It looks safe to remove that cond_resched() as we already have such
calls in shrink_page_list().
> >
> > start time: 263
> > total time: 516
> > nr_alloc_fail 5144
> > allocstall 436787
> >
> > CPU count real total virtual total delay total
> > 1018 3305497488 3283831119 241982934044
> > IO count delay total delay average
> > 0 0 0ms
> > SWAP count delay total delay average
> > 0 0 0ms
> > RECLAIM count delay total delay average
> > 328 31398481378 95ms
> > dd: read=0, write=0, cancelled_write=0
> >
> > zone_watermark_ok_safe():
> >
> > start time: 266
> > total time: 513
> > nr_alloc_fail 4526
> > allocstall 440246
> >
> > CPU count real total virtual total delay total
> > 1119 3640446568 3619184439 240945024724
> > IO count delay total delay average
> > 3 303620082 101ms
> > SWAP count delay total delay average
> > 0 0 0ms
> > RECLAIM count delay total delay average
> > 372 27320731898 73ms
> > dd: read=77824, write=0, cancelled_write=0
> >
> > start time: 275
>
> What's meaing of start time?
It's the time taken to start 1000 dd's.
> > total time: 517
>
> Total time is elapsed time on your experiment?
Yeah. They are generated with this script.
$ cat ~/bin/test-dd-sparse.sh
#!/bin/sh
mount /dev/sda7 /fs
tic=$(date +'%s')
for i in `seq 1000`
do
truncate -s 1G /fs/sparse-$i
dd if=/fs/sparse-$i of=/dev/null &>/dev/null &
done
tac=$(date +'%s')
echo start time: $((tac-tic))
wait
tac=$(date +'%s')
echo total time: $((tac-tic))
egrep '(nr_alloc_fail|allocstall)' /proc/vmstat
egrep '(CAL|RES|LOC|TLB)' /proc/interrupts
> > nr_alloc_fail 4694
> > allocstall 431021
> >
> >
> > CPU count real total virtual total delay total
> > 1073 3534462680 3512544928 234056498221
>
> What's meaning of CPU fields?
It's "waiting for a CPU (while being runnable)" as described in
Documentation/accounting/delay-accounting.txt.
> > IO count delay total delay average
> > 0 0 0ms
> > SWAP count delay total delay average
> > 0 0 0ms
> > RECLAIM count delay total delay average
> > 386 34751778363 89ms
> > dd: read=0, write=0, cancelled_write=0
> >
>
> Where is vanilla data for comparing latency?
> Personally, It's hard to parse your data.
Sorry it's somehow too much data and kernel revisions.. The base kernel's
average latency is 29ms:
base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
-------------------------------------------------------------------------------
CPU count real total virtual total delay total
1122 3676441096 3656793547 274182127286
IO count delay total delay average
3 291765493 97ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1350 39229752193 29ms
dd: read=45056, write=0, cancelled_write=0
start time: 245
total time: 526
nr_alloc_fail 14586
allocstall 1578343
LOC: 533981 529210 528283 532346 533392 531314 531705 528983 Local timer interrupts
RES: 3123 2177 1676 1580 2157 1974 1606 1696 Rescheduling interrupts
CAL: 218392 218631 219167 219217 218840 218985 218429 218440 Function call interrupts
TLB: 175 13 21 18 62 309 119 42 TLB shootdowns
>
> > CC: Mel Gorman <[email protected]>
> > Signed-off-by: Wu Fengguang <[email protected]>
> > ---
> > fs/buffer.c | 4 ++--
> > include/linux/swap.h | 3 ++-
> > mm/page_alloc.c | 20 +++++---------------
> > mm/vmscan.c | 31 +++++++++++++++++++++++--------
> > 4 files changed, 32 insertions(+), 26 deletions(-)
> > --- linux-next.orig/mm/vmscan.c 2011-04-29 10:42:14.000000000 +0800
> > +++ linux-next/mm/vmscan.c 2011-04-30 21:59:33.000000000 +0800
> > @@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon
> > * returns: 0, if no pages reclaimed
> > * else, the number of pages reclaimed
> > */
> > -static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> > - struct scan_control *sc)
> > +static unsigned long do_try_to_free_pages(struct zone *preferred_zone,
> > + struct zonelist *zonelist,
> > + struct scan_control *sc)
> > {
> > int priority;
> > unsigned long total_scanned = 0;
> > @@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page
> > struct zoneref *z;
> > struct zone *zone;
> > unsigned long writeback_threshold;
> > + unsigned long min_reclaim = sc->nr_to_reclaim;
>
> Hmm,
>
> >
> > get_mems_allowed();
> > delayacct_freepages_start();
> > @@ -2041,6 +2043,9 @@ static unsigned long do_try_to_free_page
> > if (scanning_global_lru(sc))
> > count_vm_event(ALLOCSTALL);
> >
> > + if (preferred_zone)
> > + sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH];
> > +
>
> Hmm, I don't like this idea.
> The goal of direct reclaim path is to reclaim pages asap, I beleive.
> Many thing should be achieve of background kswapd.
> If admin changes min_free_kbytes, it can affect latency of direct reclaim.
> It doesn't make sense to me.
Yeah, it does increase delays.. in the 1000 dd case, roughly from 30ms
to 90ms. This is a major drawback.
> > for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > sc->nr_scanned = 0;
> > if (!priority)
> > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
> > }
> > }
> > total_scanned += sc->nr_scanned;
> > - if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> > - goto out;
> > + if (sc->nr_reclaimed >= min_reclaim) {
> > + if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> > + goto out;
>
> I can't understand the logic.
> if nr_reclaimed is bigger than min_reclaim, it's always greater than
> nr_to_reclaim. What's meaning of min_reclaim?
In direct reclaim, min_reclaim will be the legacy SWAP_CLUSTER_MAX and
sc->nr_to_reclaim will be increased to the zone's high watermark and
is kind of "max to reclaim".
>
> > + if (total_scanned > 2 * sc->nr_to_reclaim)
> > + goto out;
>
> If there are lots of dirty pages in LRU?
> If there are lots of unevictable pages in LRU?
> If there are lots of mapped page in LRU but may_unmap = 0 cases?
> I means it's rather risky early conclusion.
That test means to avoid scanning too much on __GFP_NORETRY direct
reclaims. My assumption for __GFP_NORETRY is, it should fail fast when
the LRU pages seem hard to reclaim. And the problem in the 1000 dd
case is, it's all easy to reclaim LRU pages but __GFP_NORETRY still
fails from time to time, with lots of IPIs that may hurt large
machines a lot.
>
> > + if (preferred_zone &&
> > + zone_watermark_ok_safe(preferred_zone, sc->order,
> > + high_wmark_pages(preferred_zone),
> > + zone_idx(preferred_zone), 0))
> > + goto out;
> > + }
>
> As I said, I think direct reclaim path sould be fast if possbile and
> it should not a function of min_free_kbytes.
Right.
> Of course, there are lots of tackle for keep direct reclaim path's consistent
> latency but at least, I don't want to add another source.
OK.
Thanks,
Fengguang
> > Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
> > http://marc.info/?l=linux-mm&m=129187231129887&w=4
> > The idea is to keep a page at leat for direct reclaimed process.
> > Could it mitigate your problem or could you enhacne the idea?
> > I think it's very simple and fair solution.
>
> No it's not helping my problem, nr_alloc_fail and CAL are still high:
>
> root@fat /home/wfg# ./test-dd-sparse.sh
> start time: 246
> total time: 531
> nr_alloc_fail 14097
> allocstall 1578332
> LOC: 542698 538947 536986 567118 552114 539605 541201 537623 Local timer interrupts
> RES: 3368 1908 1474 1476 2809 1602 1500 1509 Rescheduling interrupts
> CAL: 223844 224198 224268 224436 223952 224056 223700 223743 Function call interrupts
> TLB: 381 27 22 19 96 404 111 67 TLB shootdowns
>
> root@fat /home/wfg# getdelays -dip `pidof dd`
> print delayacct stats ON
> printing IO accounting
> PID 5202
>
>
> CPU count real total virtual total delay total
> 1132 3635447328 3627947550 276722091605
> IO count delay total delay average
> 2 187809974 62ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 1334 35304580824 26ms
> dd: read=278528, write=0, cancelled_write=0
>
> I guess your patch is mainly fixing the high order allocations while
> my workload is mainly order 0 readahead page allocations. There are
> 1000 forks, however the "start time: 246" seems to indicate that the
> order-1 reclaim latency is not improved.
>
> I'll try modifying your patch and see how it works out. The obvious
> change is to apply it to the order-0 case. Hope this won't create much
> more isolated pages.
I tried the below modified patch, removing the high order test and the
drain_all_pages() call. The results are not idea either:
root@fat /home/wfg# ./test-dd-sparse.sh
start time: 246
total time: 526
nr_alloc_fail 15582
allocstall 1583727
LOC: 532518 528880 528184 533426 532765 530526 531177 528757 Local timer interrupts
RES: 2350 1929 1538 1430 3359 1547 1422 1502 Rescheduling interrupts
CAL: 200017 200384 200336 199763 200369 199776 199504 199407 Function call interrupts
TLB: 285 19 24 10 121 306 113 69 TLB shootdowns
CPU count real total virtual total delay total
1154 3767427264 3742671454 273770720370
IO count delay total delay average
1 279795961 279ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1385 27228068276 19ms
dd: read=12288, write=0, cancelled_write=0
Thanks,
Fengguang
---
Subject: Keep freed pages in direct reclaim
Date: Thu, 9 Dec 2010 14:01:32 +0900
From: Minchan Kim <[email protected]>
direct reclaimed process often sleep and race with other processes.
Although direct reclaim proceess requires high order pags(order > 0) and
reclaims page successfully, other processes which require order-0 page
could steal the high order page for direct reclaimed process.
After all, direct reclaimed process try it again and it still has a
possibility of above scenario. It can make bad effects following as
1. direct reclaimed process latency is big
2. eviction working set page due to lumpy reclaim
3. continue to wake up kswapd
This patch solves it.
Fengguang:
fix
[ 1514.892933] BUG: unable to handle kernel
[ 1514.892958] ---[ end trace be7cb17861e1d25b ]---
[ 1514.893589] NULL pointer dereference at (null)
[ 1514.893968] IP: [<ffffffff81101b2e>] shrink_page_list+0x3dc/0x501
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/buffer.c | 2 +-
include/linux/swap.h | 4 +++-
mm/page_alloc.c | 25 +++++++++++++++++++++----
mm/vmscan.c | 23 +++++++++++++++++++----
4 files changed, 44 insertions(+), 10 deletions(-)
--- linux-next.orig/fs/buffer.c 2011-05-02 17:18:01.000000000 +0800
+++ linux-next/fs/buffer.c 2011-05-02 18:30:17.000000000 +0800
@@ -289,7 +289,7 @@ static void free_more_memory(void)
&zone);
if (zone)
try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
- GFP_NOFS, NULL);
+ GFP_NOFS, NULL, NULL);
}
}
--- linux-next.orig/include/linux/swap.h 2011-05-02 17:18:01.000000000 +0800
+++ linux-next/include/linux/swap.h 2011-05-02 18:30:17.000000000 +0800
@@ -249,8 +249,10 @@ static inline void lru_cache_add_file(st
#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
/* linux/mm/vmscan.c */
+extern noinline_for_stack void free_page_list(struct list_head *free_pages);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
- gfp_t gfp_mask, nodemask_t *mask);
+ gfp_t gfp_mask, nodemask_t *mask,
+ struct list_head *freed_pages);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
unsigned int swappiness);
--- linux-next.orig/mm/page_alloc.c 2011-05-02 17:18:01.000000000 +0800
+++ linux-next/mm/page_alloc.c 2011-05-02 18:31:30.000000000 +0800
@@ -1891,6 +1891,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
struct page *page = NULL;
struct reclaim_state reclaim_state;
bool drained = false;
+ LIST_HEAD(freed_pages);
cond_resched();
@@ -1901,16 +1902,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
reclaim_state.reclaimed_slab = 0;
current->reclaim_state = &reclaim_state;
- *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
-
+ /*
+ * If request is high order, keep the pages which are reclaimed
+ * in own list for preventing the lose by other processes.
+ */
+ *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask,
+ nodemask, &freed_pages);
current->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
current->flags &= ~PF_MEMALLOC;
+ if (!list_empty(&freed_pages)) {
+ free_page_list(&freed_pages);
+ /* drain_all_pages(); */
+ /* drained = true; */
+ page = get_page_from_freelist(gfp_mask, nodemask, order,
+ zonelist, high_zoneidx,
+ alloc_flags, preferred_zone,
+ migratetype);
+ if (page)
+ goto out;
+ }
cond_resched();
if (unlikely(!(*did_some_progress)))
- return NULL;
+ goto out;
retry:
page = get_page_from_freelist(gfp_mask, nodemask, order,
@@ -1927,7 +1943,8 @@ retry:
drained = true;
goto retry;
}
-
+out:
+ VM_BUG_ON(!list_empty(&freed_pages));
return page;
}
--- linux-next.orig/mm/vmscan.c 2011-05-02 17:18:01.000000000 +0800
+++ linux-next/mm/vmscan.c 2011-05-02 18:30:17.000000000 +0800
@@ -112,6 +112,9 @@ struct scan_control {
* are scanned.
*/
nodemask_t *nodemask;
+
+ /* keep freed pages */
+ struct list_head *freed_pages;
};
#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -681,7 +684,7 @@ static enum page_references page_check_r
return PAGEREF_RECLAIM;
}
-static noinline_for_stack void free_page_list(struct list_head *free_pages)
+noinline_for_stack void free_page_list(struct list_head *free_pages)
{
struct pagevec freed_pvec;
struct page *page, *tmp;
@@ -712,6 +715,10 @@ static unsigned long shrink_page_list(st
unsigned long nr_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
+ struct list_head *free_list = &free_pages;
+
+ if (sc->freed_pages)
+ free_list = sc->freed_pages;
cond_resched();
@@ -904,7 +911,7 @@ free_it:
* Is there need to periodically free_page_list? It would
* appear not as the counts should be low
*/
- list_add(&page->lru, &free_pages);
+ list_add(&page->lru, free_list);
continue;
cull_mlocked:
@@ -940,7 +947,13 @@ keep_lumpy:
if (nr_dirty == nr_congested && nr_dirty != 0)
zone_set_flag(zone, ZONE_CONGESTED);
- free_page_list(&free_pages);
+ /*
+ * If reclaim is direct path and high order, caller should
+ * free reclaimed pages. It is for preventing reclaimed pages
+ * lose by other processes.
+ */
+ if (!sc->freed_pages)
+ free_page_list(&free_pages);
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
@@ -2118,7 +2131,8 @@ out:
}
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
- gfp_t gfp_mask, nodemask_t *nodemask)
+ gfp_t gfp_mask, nodemask_t *nodemask,
+ struct list_head *freed_pages)
{
unsigned long nr_reclaimed;
struct scan_control sc = {
@@ -2131,6 +2145,7 @@ unsigned long try_to_free_pages(struct z
.order = order,
.mem_cgroup = NULL,
.nodemask = nodemask,
+ .freed_pages = freed_pages,
};
trace_mm_vmscan_direct_reclaim_begin(order,
> > + if (preferred_zone &&
> > + zone_watermark_ok_safe(preferred_zone, sc->order,
> > + high_wmark_pages(preferred_zone),
> > + zone_idx(preferred_zone), 0))
> > + goto out;
> > + }
>
> As I said, I think direct reclaim path sould be fast if possbile and
> it should not a function of min_free_kbytes.
It can be made not a function of min_free_kbytes by simply changing
high_wmark_pages() to low_wmark_pages() in the above chunk, since
direct reclaim is triggered when ALLOC_WMARK_LOW cannot be satisfied,
ie. it just dropped below low_wmark_pages().
But still, it costs 62ms reclaim latency (base kernel is 29ms).
Thanks,
Fengguang
---
Subject: mm: cut down __GFP_NORETRY page allocation failures
Date: Thu Apr 28 13:46:39 CST 2011
Concurrent page allocations are suffering from high failure rates.
On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
the page allocation failures are
nr_alloc_fail 733 # interleaved reads by 1 single task
nr_alloc_fail 11799 # concurrent reads by 1000 tasks
The concurrent read test script is:
for i in `seq 1000`
do
truncate -s 1G /fs/sparse-$i
dd if=/fs/sparse-$i of=/dev/null &
done
In order for get_page_from_freelist() to get free page,
(1) try_to_free_pages() should use much higher .nr_to_reclaim than the
current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
possible low watermark state
(2) the get_page_from_freelist() _after_ direct reclaim should use lower
watermark than its normal invocations, so that it can reasonably
"reserve" some free pages for itself and prevent other concurrent
page allocators stealing all its reclaimed pages.
Some notes:
- commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
reclaim allocation fails") has the same target, however is obviously
costly and less effective. It seems more clean to just remove the
retry and drain code than to retain it.
- it's a bit hacky to reclaim more than requested pages inside
do_try_to_free_page(), and it won't help cgroup for now
- it only aims to reduce failures when there are plenty of reclaimable
pages, so it stops the opportunistic reclaim when scanned 2 times pages
Test results (1000 dd case):
- the failure rate is pretty sensible to the page reclaim size,
from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 5004 (WMARK_HIGH, stop on low
watermark ok) to 10496 (SWAP_CLUSTER_MAX)
- the IPIs are reduced by over 500 times
- the reclaim delay is doubled, from 29ms to 62ms
Base kernel is vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocations.
base kernel, 1000 dd
--------------------
start time: 245
total time: 526
nr_alloc_fail 14586
allocstall 1578343
LOC: 533981 529210 528283 532346 533392 531314 531705 528983 Local timer interrupts
RES: 3123 2177 1676 1580 2157 1974 1606 1696 Rescheduling interrupts
CAL: 218392 218631 219167 219217 218840 218985 218429 218440 Function call interrupts
TLB: 175 13 21 18 62 309 119 42 TLB shootdowns
CPU count real total virtual total delay total
1122 3676441096 3656793547 274182127286
IO count delay total delay average
3 291765493 97ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1350 39229752193 29ms
dd: read=45056, write=0, cancelled_write=0
patched, 1000 dd
----------------
root@fat /home/wfg# ./test-dd-sparse.sh
start time: 260
total time: 519
nr_alloc_fail 5004
allocstall 551429
LOC: 524861 521832 520945 524632 524666 523334 523797 521562 Local timer interrupts
RES: 1323 1976 2505 1610 1544 1848 3310 1644 Rescheduling interrupts
CAL: 67 335 353 614 289 287 293 325 Function call interrupts
TLB: 288 29 26 34 103 321 123 70 TLB shootdowns
CPU count real total virtual total delay total
1177 3797422704 3775174301 253228435955
IO count delay total delay average
1 198528820 198ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
508 31660219699 62ms
base kernel, 100 dd
-------------------
root@fat /home/wfg# ./test-dd-sparse.sh
start time: 3
total time: 53
nr_alloc_fail 849
allocstall 131330
LOC: 59843 56506 55838 65283 61774 57929 58880 56246 Local timer interrupts
RES: 376 308 372 239 374 307 491 239 Rescheduling interrupts
CAL: 17737 18083 17948 18192 17929 17845 17893 17906 Function call interrupts
TLB: 307 26 25 21 80 324 137 79 TLB shootdowns
CPU count real total virtual total delay total
974 3197513904 3180727460 38504429363
IO count delay total delay average
1 18156696 18ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1036 3439387298 3ms
dd: read=12288, write=0, cancelled_write=0
patched, 100 dd
---------------
root@fat /home/wfg# ./test-dd-sparse.sh
start time: 3
total time: 52
nr_alloc_fail 307
allocstall 48178
LOC: 56486 53514 52792 55879 56317 55383 55311 53168 Local timer interrupts
RES: 604 345 257 250 775 371 272 252 Rescheduling interrupts
CAL: 75 373 369 543 272 278 295 296 Function call interrupts
TLB: 259 24 19 24 82 306 139 53 TLB shootdowns
CPU count real total virtual total delay total
974 3177516944 3161771347 38508053977
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
393 5389030889 13ms
dd: read=0, write=0, cancelled_write=0
CC: Mel Gorman <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/buffer.c | 4 ++--
include/linux/swap.h | 3 ++-
mm/page_alloc.c | 22 +++++-----------------
mm/vmscan.c | 38 ++++++++++++++++++++++++++++++--------
4 files changed, 39 insertions(+), 28 deletions(-)
--- linux-next.orig/mm/vmscan.c 2011-05-02 19:15:21.000000000 +0800
+++ linux-next/mm/vmscan.c 2011-05-02 19:47:05.000000000 +0800
@@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon
* returns: 0, if no pages reclaimed
* else, the number of pages reclaimed
*/
-static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
- struct scan_control *sc)
+static unsigned long do_try_to_free_pages(struct zone *preferred_zone,
+ struct zonelist *zonelist,
+ struct scan_control *sc)
{
int priority;
unsigned long total_scanned = 0;
@@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page
struct zoneref *z;
struct zone *zone;
unsigned long writeback_threshold;
+ unsigned long min_reclaim = sc->nr_to_reclaim;
get_mems_allowed();
delayacct_freepages_start();
@@ -2041,6 +2043,16 @@ static unsigned long do_try_to_free_page
if (scanning_global_lru(sc))
count_vm_event(ALLOCSTALL);
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ gfp_zone(sc->gfp_mask), sc->nodemask) {
+ if (!populated_zone(zone))
+ continue;
+ preferred_zone = zone;
+ break;
+ }
+ if (preferred_zone)
+ sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH];
+
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
sc->nr_scanned = 0;
if (!priority)
@@ -2067,8 +2079,17 @@ static unsigned long do_try_to_free_page
}
}
total_scanned += sc->nr_scanned;
- if (sc->nr_reclaimed >= sc->nr_to_reclaim)
- goto out;
+ if (sc->nr_reclaimed >= min_reclaim) {
+ if (sc->nr_reclaimed >= sc->nr_to_reclaim)
+ goto out;
+ if (total_scanned > 2 * sc->nr_to_reclaim)
+ goto out;
+ if (preferred_zone &&
+ zone_watermark_ok(preferred_zone, sc->order,
+ low_wmark_pages(preferred_zone),
+ zone_idx(preferred_zone), 0))
+ goto out;
+ }
/*
* Try to write back as many pages as we just scanned. This
@@ -2117,7 +2138,8 @@ out:
return 0;
}
-unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+unsigned long try_to_free_pages(struct zone *preferred_zone,
+ struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
unsigned long nr_reclaimed;
@@ -2137,7 +2159,7 @@ unsigned long try_to_free_pages(struct z
sc.may_writepage,
gfp_mask);
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+ nr_reclaimed = do_try_to_free_pages(preferred_zone, zonelist, &sc);
trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
@@ -2207,7 +2229,7 @@ unsigned long try_to_free_mem_cgroup_pag
sc.may_writepage,
sc.gfp_mask);
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+ nr_reclaimed = do_try_to_free_pages(NULL, zonelist, &sc);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
@@ -2796,7 +2818,7 @@ unsigned long shrink_all_memory(unsigned
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+ nr_reclaimed = do_try_to_free_pages(NULL, zonelist, &sc);
p->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
--- linux-next.orig/mm/page_alloc.c 2011-05-02 19:15:21.000000000 +0800
+++ linux-next/mm/page_alloc.c 2011-05-02 19:39:51.000000000 +0800
@@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int migratetype, unsigned long *did_some_progress)
{
- struct page *page = NULL;
+ struct page *page;
struct reclaim_state reclaim_state;
- bool drained = false;
cond_resched();
@@ -1901,33 +1900,22 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
reclaim_state.reclaimed_slab = 0;
current->reclaim_state = &reclaim_state;
- *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
+ *did_some_progress = try_to_free_pages(preferred_zone, zonelist, order,
+ gfp_mask, nodemask);
current->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
current->flags &= ~PF_MEMALLOC;
- cond_resched();
-
if (unlikely(!(*did_some_progress)))
return NULL;
-retry:
+ alloc_flags |= ALLOC_HARDER;
+
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
alloc_flags, preferred_zone,
migratetype);
-
- /*
- * If an allocation failed after direct reclaim, it could be because
- * pages are pinned on the per-cpu lists. Drain them and try again
- */
- if (!page && !drained) {
- drain_all_pages();
- drained = true;
- goto retry;
- }
-
return page;
}
--- linux-next.orig/fs/buffer.c 2011-05-02 19:15:21.000000000 +0800
+++ linux-next/fs/buffer.c 2011-05-02 19:15:33.000000000 +0800
@@ -288,8 +288,8 @@ static void free_more_memory(void)
gfp_zone(GFP_NOFS), NULL,
&zone);
if (zone)
- try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
- GFP_NOFS, NULL);
+ try_to_free_pages(zone, node_zonelist(nid, GFP_NOFS),
+ 0, GFP_NOFS, NULL);
}
}
--- linux-next.orig/include/linux/swap.h 2011-05-02 19:15:21.000000000 +0800
+++ linux-next/include/linux/swap.h 2011-05-02 19:15:33.000000000 +0800
@@ -249,7 +249,8 @@ static inline void lru_cache_add_file(st
#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
/* linux/mm/vmscan.c */
-extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+extern unsigned long try_to_free_pages(struct zone *preferred_zone,
+ struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
On Mon, May 02, 2011 at 09:29:58PM +0800, Wu Fengguang wrote:
> > > + if (preferred_zone &&
> > > + zone_watermark_ok_safe(preferred_zone, sc->order,
> > > + high_wmark_pages(preferred_zone),
> > > + zone_idx(preferred_zone), 0))
> > > + goto out;
> > > + }
> >
> > As I said, I think direct reclaim path sould be fast if possbile and
> > it should not a function of min_free_kbytes.
>
> It can be made not a function of min_free_kbytes by simply changing
> high_wmark_pages() to low_wmark_pages() in the above chunk, since
> direct reclaim is triggered when ALLOC_WMARK_LOW cannot be satisfied,
> ie. it just dropped below low_wmark_pages().
>
> But still, it costs 62ms reclaim latency (base kernel is 29ms).
I got new findings: the CPU schedule delays are much larger than
reclaim delays. It does make the "direct reclaim until low watermark
OK" latency less a problem :)
1000 dd test case:
RECLAIM delay CPU delay nr_alloc_fail CAL (last CPU)
base kernel 29ms 244ms 14586 218440
patched 62ms 215ms 5004 325
Thanks,
Fengguang
Hi Wu,
> On Mon, May 02, 2011 at 09:29:58PM +0800, Wu Fengguang wrote:
> > > > + if (preferred_zone &&
> > > > + zone_watermark_ok_safe(preferred_zone, sc->order,
> > > > + high_wmark_pages(preferred_zone),
> > > > + zone_idx(preferred_zone), 0))
> > > > + goto out;
> > > > + }
> > >
> > > As I said, I think direct reclaim path sould be fast if possbile and
> > > it should not a function of min_free_kbytes.
> >
> > It can be made not a function of min_free_kbytes by simply changing
> > high_wmark_pages() to low_wmark_pages() in the above chunk, since
> > direct reclaim is triggered when ALLOC_WMARK_LOW cannot be satisfied,
> > ie. it just dropped below low_wmark_pages().
> >
> > But still, it costs 62ms reclaim latency (base kernel is 29ms).
>
> I got new findings: the CPU schedule delays are much larger than
> reclaim delays. It does make the "direct reclaim until low watermark
> OK" latency less a problem :)
>
> 1000 dd test case:
> RECLAIM delay CPU delay nr_alloc_fail CAL (last CPU)
> base kernel 29ms 244ms 14586 218440
> patched 62ms 215ms 5004 325
Hmm, in your system, the latency of direct reclaim may be a less problem.
But, generally speaking, in a latency sensitive system in enterprise area
there are two kind of processes. One is latency sensitive -(A) the other
is not-latency sensitive -(B). And usually we set cpu affinity for both processes
to avoid scheduling issue in (A). In this situation, CPU delay tends to be lower
than the above and a less problem but reclaim delay is more critical.
Regards,
Satoru
>
> Thanks,
> Fengguang
>
Hi Wu, Sorry for slow response.
I guess you know why I am slow. :)
On Mon, May 2, 2011 at 7:29 PM, Wu Fengguang <[email protected]> wrote:
> Hi Minchan,
>
> On Mon, May 02, 2011 at 12:35:42AM +0800, Minchan Kim wrote:
>> Hi Wu,
>>
>> On Sat, Apr 30, 2011 at 10:17:41PM +0800, Wu Fengguang wrote:
>> > On Fri, Apr 29, 2011 at 10:28:24AM +0800, Wu Fengguang wrote:
>> > > > Test results:
>> > > >
>> > > > - the failure rate is pretty sensible to the page reclaim size,
>> > > > from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>> > > >
>> > > > - the IPIs are reduced by over 100 times
>> > >
>> > > It's reduced by 500 times indeed.
>> > >
>> > > CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
>> > > CAL: 93 463 410 540 298 282 272 306 Function call interrupts
>> > >
>> > > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
>> > > > -------------------------------------------------------------------------------
>> > > > nr_alloc_fail 10496
>> > > > allocstall 1576602
>> > >
>> > > > patched (WMARK_MIN)
>> > > > -------------------
>> > > > nr_alloc_fail 704
>> > > > allocstall 105551
>> > >
>> > > > patched (WMARK_HIGH)
>> > > > --------------------
>> > > > nr_alloc_fail 282
>> > > > allocstall 53860
>> > >
>> > > > this patch (WMARK_HIGH, limited scan)
>> > > > -------------------------------------
>> > > > nr_alloc_fail 276
>> > > > allocstall 54034
>> > >
>> > > There is a bad side effect though: the much reduced "allocstall" means
>> > > each direct reclaim will take much more time to complete. A simple solution
>> > > is to terminate direct reclaim after 10ms. I noticed that an 100ms
>> > > time threshold can reduce the reclaim latency from 621ms to 358ms.
>> > > Further lowering the time threshold to 20ms does not help reducing the
>> > > real latencies though.
>> >
>> > Experiments going on...
>> >
>> > I tried the more reasonable terminate condition: stop direct reclaim
>> > when the preferred zone is above high watermark (see the below chunk).
>> >
>> > This helps reduce the average reclaim latency to under 100ms in the
>> > 1000-dd case.
>> >
>> > However nr_alloc_fail is around 5000 and not ideal. The interesting
>> > thing is, even if zone watermark is high, the task still may fail to
>> > get a free page..
>> >
>> > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
>> > }
>> > }
>> > total_scanned += sc->nr_scanned;
>> > - if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>> > - goto out;
>> > + if (sc->nr_reclaimed >= min_reclaim) {
>> > + if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>> > + goto out;
>> > + if (total_scanned > 2 * sc->nr_to_reclaim)
>> > + goto out;
>> > + if (preferred_zone &&
>> > + zone_watermark_ok_safe(preferred_zone, sc->order,
>> > + high_wmark_pages(preferred_zone),
>> > + zone_idx(preferred_zone), 0))
>> > + goto out;
>> > + }
>> >
>> > /*
>> > * Try to write back as many pages as we just scanned. This
>> >
>> > Thanks,
>> > Fengguang
>> > ---
>> > Subject: mm: cut down __GFP_NORETRY page allocation failures
>> > Date: Thu Apr 28 13:46:39 CST 2011
>> >
>> > Concurrent page allocations are suffering from high failure rates.
>> >
>> > On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
>> > the page allocation failures are
>> >
>> > nr_alloc_fail 733 # interleaved reads by 1 single task
>> > nr_alloc_fail 11799 # concurrent reads by 1000 tasks
>> >
>> > The concurrent read test script is:
>> >
>> > for i in `seq 1000`
>> > do
>> > truncate -s 1G /fs/sparse-$i
>> > dd if=/fs/sparse-$i of=/dev/null &
>> > done
>> >
>> > In order for get_page_from_freelist() to get free page,
>> >
>> > (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
>> > current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
>> > possible low watermark state as well as fill the pcp with enough free
>> > pages to overflow its high watermark.
>> >
>> > (2) the get_page_from_freelist() _after_ direct reclaim should use lower
>> > watermark than its normal invocations, so that it can reasonably
>> > "reserve" some free pages for itself and prevent other concurrent
>> > page allocators stealing all its reclaimed pages.
>>
>> Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
>> http://marc.info/?l=linux-mm&m=129187231129887&w=4
>> The idea is to keep a page at leat for direct reclaimed process.
>> Could it mitigate your problem or could you enhacne the idea?
>> I think it's very simple and fair solution.
>
> No it's not helping my problem, nr_alloc_fail and CAL are still high:
Unfortunately, my patch doesn't consider order-0 pages, as you mentioned below.
I read your mail which states it doesn't help although it considers
order-0 pages and drain.
Actually, I tried to look into that but in my poor system(core2duo, 2G
ram), nr_alloc_fail never happens. :(
I will try it in other desktop but I am not sure I can reproduce it.
>
> root@fat /home/wfg# ./test-dd-sparse.sh
> start time: 246
> total time: 531
> nr_alloc_fail 14097
> allocstall 1578332
> LOC: 542698 538947 536986 567118 552114 539605 541201 537623 Local timer interrupts
> RES: 3368 1908 1474 1476 2809 1602 1500 1509 Rescheduling interrupts
> CAL: 223844 224198 224268 224436 223952 224056 223700 223743 Function call interrupts
> TLB: 381 27 22 19 96 404 111 67 TLB shootdowns
>
> root@fat /home/wfg# getdelays -dip `pidof dd`
> print delayacct stats ON
> printing IO accounting
> PID 5202
>
>
> CPU count real total virtual total delay total
> 1132 3635447328 3627947550 276722091605
> IO count delay total delay average
> 2 187809974 62ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 1334 35304580824 26ms
> dd: read=278528, write=0, cancelled_write=0
>
> I guess your patch is mainly fixing the high order allocations while
> my workload is mainly order 0 readahead page allocations. There are
> 1000 forks, however the "start time: 246" seems to indicate that the
> order-1 reclaim latency is not improved.
Maybe, 8K * 1000 isn't big footprint so I think reclaim doesn't happen.
>
> I'll try modifying your patch and see how it works out. The obvious
> change is to apply it to the order-0 case. Hope this won't create much
> more isolated pages.
>
> Attached is your patch rebased to 2.6.39-rc3, after resolving some
> merge conflicts and fixing a trivial NULL pointer bug.
Thanks!
I would like to see detail with it in my system if I can reproduce it.
>
>> >
>> > Some notes:
>> >
>> > - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
>> > reclaim allocation fails") has the same target, however is obviously
>> > costly and less effective. It seems more clean to just remove the
>> > retry and drain code than to retain it.
>>
>> Tend to agree.
>> My old patch can solve it, I think.
>
> Sadly nope. See above.
>
>> >
>> > - it's a bit hacky to reclaim more than requested pages inside
>> > do_try_to_free_page(), and it won't help cgroup for now
>> >
>> > - it only aims to reduce failures when there are plenty of reclaimable
>> > pages, so it stops the opportunistic reclaim when scanned 2 times pages
>> >
>> > Test results:
>> >
>> > - the failure rate is pretty sensible to the page reclaim size,
>> > from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>> >
>> > - the IPIs are reduced by over 100 times
>> >
>> > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
>> > -------------------------------------------------------------------------------
>> > nr_alloc_fail 10496
>> > allocstall 1576602
>> >
>> > slabs_scanned 21632
>> > kswapd_steal 4393382
>> > kswapd_inodesteal 124
>> > kswapd_low_wmark_hit_quickly 885
>> > kswapd_high_wmark_hit_quickly 2321
>> > kswapd_skip_congestion_wait 0
>> > pageoutrun 29426
>> >
>> > CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
>> >
>> > LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts
>> > RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts
>> > TLB: 189 15 13 17 64 294 97 63 TLB shootdowns
>> >
>> > patched (WMARK_MIN)
>> > -------------------
>> > nr_alloc_fail 704
>> > allocstall 105551
>> >
>> > slabs_scanned 33280
>> > kswapd_steal 4525537
>> > kswapd_inodesteal 187
>> > kswapd_low_wmark_hit_quickly 4980
>> > kswapd_high_wmark_hit_quickly 2573
>> > kswapd_skip_congestion_wait 0
>> > pageoutrun 35429
>> >
>> > CAL: 93 286 396 754 272 297 275 281 Function call interrupts
>> >
>> > LOC: 520550 517751 517043 522016 520302 518479 519329 517179 Local timer interrupts
>> > RES: 2131 1371 1376 1269 1390 1181 1409 1280 Rescheduling interrupts
>> > TLB: 280 26 27 30 65 305 134 75 TLB shootdowns
>> >
>> > patched (WMARK_HIGH)
>> > --------------------
>> > nr_alloc_fail 282
>> > allocstall 53860
>> >
>> > slabs_scanned 23936
>> > kswapd_steal 4561178
>> > kswapd_inodesteal 0
>> > kswapd_low_wmark_hit_quickly 2760
>> > kswapd_high_wmark_hit_quickly 1748
>> > kswapd_skip_congestion_wait 0
>> > pageoutrun 32639
>> >
>> > CAL: 93 463 410 540 298 282 272 306 Function call interrupts
>> >
>> > LOC: 513956 510749 509890 514897 514300 512392 512825 510574 Local timer interrupts
>> > RES: 1174 2081 1411 1320 1742 2683 1380 1230 Rescheduling interrupts
>> > TLB: 274 21 19 22 57 317 131 61 TLB shootdowns
>> >
>> > patched (WMARK_HIGH, limited scan)
>> > ----------------------------------
>> > nr_alloc_fail 276
>> > allocstall 54034
>> >
>> > slabs_scanned 24320
>> > kswapd_steal 4507482
>> > kswapd_inodesteal 262
>> > kswapd_low_wmark_hit_quickly 2638
>> > kswapd_high_wmark_hit_quickly 1710
>> > kswapd_skip_congestion_wait 0
>> > pageoutrun 32182
>> >
>> > CAL: 69 443 421 567 273 279 269 334 Function call interrupts
>>
>> Looks amazing.
>
> Yeah, I have strong feelings against drain_all_pages() in the direct
> reclaim path. The intuition is, once drain_all_pages() is called, the
> later on direct reclaims will have less chance to fill the drained
> buffers and therefore forced into drain_all_pages() again and again.
>
> drain_all_pages() is probably an overkill for preventing OOM.
> Generally speaking, it's questionable to "squeeze the last page before
> OOM".
>
> A typical desktop enters thrashing storms before OOM, as Hugh pointed
> out, this may well not the end users wanted. I agree with him and
> personally prefer some applications to be OOM killed rather than the
> whole system goes unusable thrashing like mad.
Tend to agree. The rule is applied to embedded system, too.
Couldn't we mitigate draining just in case it is high order page.
>
>> > LOC: 514736 511698 510993 514069 514185 512986 513838 511229 Local timer interrupts
>> > RES: 2153 1556 1126 1351 3047 1554 1131 1560 Rescheduling interrupts
>> > TLB: 209 26 20 15 71 315 117 71 TLB shootdowns
>> >
>> > patched (WMARK_HIGH, limited scan, stop on watermark OK), 100 dd
>> > ----------------------------------------------------------------
>> >
>> > start time: 3
>> > total time: 50
>> > nr_alloc_fail 162
>> > allocstall 45523
>> >
>> > CPU count real total virtual total delay total
>> > 921 3024540200 3009244668 37123129525
>> > IO count delay total delay average
>> > 0 0 0ms
>> > SWAP count delay total delay average
>> > 0 0 0ms
>> > RECLAIM count delay total delay average
>> > 357 4891766796 13ms
>> > dd: read=0, write=0, cancelled_write=0
>> >
>> > patched (WMARK_HIGH, limited scan, stop on watermark OK), 1000 dd
>> > -----------------------------------------------------------------
>> >
>> > start time: 272
>> > total time: 509
>> > nr_alloc_fail 3913
>> > allocstall 541789
>> >
>> > CPU count real total virtual total delay total
>> > 1044 3445476208 3437200482 229919915202
>> > IO count delay total delay average
>> > 0 0 0ms
>> > SWAP count delay total delay average
>> > 0 0 0ms
>> > RECLAIM count delay total delay average
>> > 452 34691441605 76ms
>> > dd: read=0, write=0, cancelled_write=0
>> >
>> > patched (WMARK_HIGH, limited scan, stop on watermark OK, no time limit), 1000 dd
>> > --------------------------------------------------------------------------------
>> >
>> > start time: 278
>> > total time: 513
>> > nr_alloc_fail 4737
>> > allocstall 436392
>> >
>> >
>> > CPU count real total virtual total delay total
>> > 1024 3371487456 3359441487 225088210977
>> > IO count delay total delay average
>> > 1 160631171 160ms
>> > SWAP count delay total delay average
>> > 0 0 0ms
>> > RECLAIM count delay total delay average
>> > 367 30809994722 83ms
>> > dd: read=20480, write=0, cancelled_write=0
>> >
>> >
>> > no cond_resched():
>>
>> What's this?
>
> I tried a modified patch that also removes the cond_resched() call in
> __alloc_pages_direct_reclaim(), between try_to_free_pages() and
> get_page_from_freelist(). It seems not helping noticeably.
>
> It looks safe to remove that cond_resched() as we already have such
> calls in shrink_page_list().
I tried similar thing but Andrew have a concern about it.
https://lkml.org/lkml/2011/3/24/138
>
>> >
>> > start time: 263
>> > total time: 516
>> > nr_alloc_fail 5144
>> > allocstall 436787
>> >
>> > CPU count real total virtual total delay total
>> > 1018 3305497488 3283831119 241982934044
>> > IO count delay total delay average
>> > 0 0 0ms
>> > SWAP count delay total delay average
>> > 0 0 0ms
>> > RECLAIM count delay total delay average
>> > 328 31398481378 95ms
>> > dd: read=0, write=0, cancelled_write=0
>> >
>> > zone_watermark_ok_safe():
>> >
>> > start time: 266
>> > total time: 513
>> > nr_alloc_fail 4526
>> > allocstall 440246
>> >
>> > CPU count real total virtual total delay total
>> > 1119 3640446568 3619184439 240945024724
>> > IO count delay total delay average
>> > 3 303620082 101ms
>> > SWAP count delay total delay average
>> > 0 0 0ms
>> > RECLAIM count delay total delay average
>> > 372 27320731898 73ms
>> > dd: read=77824, write=0, cancelled_write=0
>> >
>
>> > start time: 275
>>
>> What's meaing of start time?
>
> It's the time taken to start 1000 dd's.
>
>> > total time: 517
>>
>> Total time is elapsed time on your experiment?
>
> Yeah. They are generated with this script.
>
> $ cat ~/bin/test-dd-sparse.sh
>
> #!/bin/sh
>
> mount /dev/sda7 /fs
>
> tic=$(date +'%s')
>
> for i in `seq 1000`
> do
> truncate -s 1G /fs/sparse-$i
> dd if=/fs/sparse-$i of=/dev/null &>/dev/null &
> done
>
> tac=$(date +'%s')
> echo start time: $((tac-tic))
>
> wait
>
> tac=$(date +'%s')
> echo total time: $((tac-tic))
>
> egrep '(nr_alloc_fail|allocstall)' /proc/vmstat
> egrep '(CAL|RES|LOC|TLB)' /proc/interrupts
>
>> > nr_alloc_fail 4694
>> > allocstall 431021
>> >
>> >
>> > CPU count real total virtual total delay total
>> > 1073 3534462680 3512544928 234056498221
>>
>> What's meaning of CPU fields?
>
> It's "waiting for a CPU (while being runnable)" as described in
> Documentation/accounting/delay-accounting.txt.
Thanks
>
>> > IO count delay total delay average
>> > 0 0 0ms
>> > SWAP count delay total delay average
>> > 0 0 0ms
>> > RECLAIM count delay total delay average
>> > 386 34751778363 89ms
>> > dd: read=0, write=0, cancelled_write=0
>> >
>>
>> Where is vanilla data for comparing latency?
>> Personally, It's hard to parse your data.
>
> Sorry it's somehow too much data and kernel revisions.. The base kernel's
> average latency is 29ms:
>
> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> -------------------------------------------------------------------------------
>
> CPU count real total virtual total delay total
> 1122 3676441096 3656793547 274182127286
> IO count delay total delay average
> 3 291765493 97ms
> SWAP count delay total delay average
> 0 0 0ms
> RECLAIM count delay total delay average
> 1350 39229752193 29ms
> dd: read=45056, write=0, cancelled_write=0
>
> start time: 245
> total time: 526
> nr_alloc_fail 14586
> allocstall 1578343
> LOC: 533981 529210 528283 532346 533392 531314 531705 528983 Local timer interrupts
> RES: 3123 2177 1676 1580 2157 1974 1606 1696 Rescheduling interrupts
> CAL: 218392 218631 219167 219217 218840 218985 218429 218440 Function call interrupts
> TLB: 175 13 21 18 62 309 119 42 TLB shootdowns
>
>>
>> > CC: Mel Gorman <[email protected]>
>> > Signed-off-by: Wu Fengguang <[email protected]>
>> > ---
>> > fs/buffer.c | 4 ++--
>> > include/linux/swap.h | 3 ++-
>> > mm/page_alloc.c | 20 +++++---------------
>> > mm/vmscan.c | 31 +++++++++++++++++++++++--------
>> > 4 files changed, 32 insertions(+), 26 deletions(-)
>> > --- linux-next.orig/mm/vmscan.c 2011-04-29 10:42:14.000000000 +0800
>> > +++ linux-next/mm/vmscan.c 2011-04-30 21:59:33.000000000 +0800
>> > @@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon
>> > * returns: 0, if no pages reclaimed
>> > * else, the number of pages reclaimed
>> > */
>> > -static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>> > - struct scan_control *sc)
>> > +static unsigned long do_try_to_free_pages(struct zone *preferred_zone,
>> > + struct zonelist *zonelist,
>> > + struct scan_control *sc)
>> > {
>> > int priority;
>> > unsigned long total_scanned = 0;
>> > @@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page
>> > struct zoneref *z;
>> > struct zone *zone;
>> > unsigned long writeback_threshold;
>> > + unsigned long min_reclaim = sc->nr_to_reclaim;
>>
>> Hmm,
>>
>> >
>> > get_mems_allowed();
>> > delayacct_freepages_start();
>> > @@ -2041,6 +2043,9 @@ static unsigned long do_try_to_free_page
>> > if (scanning_global_lru(sc))
>> > count_vm_event(ALLOCSTALL);
>> >
>> > + if (preferred_zone)
>> > + sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH];
>> > +
>>
>> Hmm, I don't like this idea.
>> The goal of direct reclaim path is to reclaim pages asap, I beleive.
>> Many thing should be achieve of background kswapd.
>> If admin changes min_free_kbytes, it can affect latency of direct reclaim.
>> It doesn't make sense to me.
>
> Yeah, it does increase delays.. in the 1000 dd case, roughly from 30ms
> to 90ms. This is a major drawback.
Yes.
>
>> > for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>> > sc->nr_scanned = 0;
>> > if (!priority)
>> > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
>> > }
>> > }
>> > total_scanned += sc->nr_scanned;
>> > - if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>> > - goto out;
>> > + if (sc->nr_reclaimed >= min_reclaim) {
>> > + if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>> > + goto out;
>>
>> I can't understand the logic.
>> if nr_reclaimed is bigger than min_reclaim, it's always greater than
>> nr_to_reclaim. What's meaning of min_reclaim?
>
> In direct reclaim, min_reclaim will be the legacy SWAP_CLUSTER_MAX and
> sc->nr_to_reclaim will be increased to the zone's high watermark and
> is kind of "max to reclaim".
>
>>
>> > + if (total_scanned > 2 * sc->nr_to_reclaim)
>> > + goto out;
>>
>> If there are lots of dirty pages in LRU?
>> If there are lots of unevictable pages in LRU?
>> If there are lots of mapped page in LRU but may_unmap = 0 cases?
>> I means it's rather risky early conclusion.
>
> That test means to avoid scanning too much on __GFP_NORETRY direct
> reclaims. My assumption for __GFP_NORETRY is, it should fail fast when
> the LRU pages seem hard to reclaim. And the problem in the 1000 dd
> case is, it's all easy to reclaim LRU pages but __GFP_NORETRY still
> fails from time to time, with lots of IPIs that may hurt large
> machines a lot.
I don't have enough time and a environment to test it.
So I can't make sure of it but my concern is a latency.
If you solve latency problem considering CPU scaling, I won't oppose it. :)
--
Kind regards,
Minchan Kim
On Mon, May 2, 2011 at 7:14 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> On Mon, May 02, 2011 at 01:35:42AM +0900, Minchan Kim wrote:
>>
>> > Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
>> ^^^^^^^^^^^^^^^^
>> typo : wasn't complete
>
> I think your idea is eligible. Wu's approach may increase throughput but
Yes. it doesn't change many subtle things and make much fair but the
Wu's concern is order-0 pages with __GFP_NORETRY. By his experiment,
my patch doesn't help much his concern.
The problem I have is I don't have any infrastructure for reproducing
his experiment. :(
> may decrease latency. So, do you have a plan to finish the work?
I want it but the day would be after finishing inorder-putback series. :)
Maybe you have a environment(8core system). If you want it, go ahead. :)
Thanks, KOSAKI.
--
Kind regards,
Minchan Kim
2011/5/3 Minchan Kim <[email protected]>:
> On Mon, May 2, 2011 at 7:14 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>>> On Mon, May 02, 2011 at 01:35:42AM +0900, Minchan Kim wrote:
>>>
>>> > Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?^^^^^^^^^^^^^^^^
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? typo : wasn't complete
>>
>> I think your idea is eligible. Wu's approach may increase throughput but
>
> Yes. it doesn't change many subtle things and make much fair but the
> Wu's concern is order-0 pages with __GFP_NORETRY. By his experiment,
> my patch doesn't help much his concern.
> The problem I have is I don't have any infrastructure for reproducing
> his experiment. :(
>
>> may decrease latency. So, do you have a plan to finish the work?
>
> I want it but the day would be after finishing inorder-putback series. :)
> Maybe you have a environment(8core system). If you want it, go ahead. :)
Hahaha, no. I lost such ia64 box by physical crash. ;-)
I haven't reproduce his issue yet too.
Hi Satoru,
On Tue, May 03, 2011 at 08:27:43AM +0800, Satoru Moriya wrote:
> Hi Wu,
>
> > On Mon, May 02, 2011 at 09:29:58PM +0800, Wu Fengguang wrote:
> > > > > + if (preferred_zone &&
> > > > > + zone_watermark_ok_safe(preferred_zone, sc->order,
> > > > > + high_wmark_pages(preferred_zone),
> > > > > + zone_idx(preferred_zone), 0))
> > > > > + goto out;
> > > > > + }
> > > >
> > > > As I said, I think direct reclaim path sould be fast if possbile and
> > > > it should not a function of min_free_kbytes.
> > >
> > > It can be made not a function of min_free_kbytes by simply changing
> > > high_wmark_pages() to low_wmark_pages() in the above chunk, since
> > > direct reclaim is triggered when ALLOC_WMARK_LOW cannot be satisfied,
> > > ie. it just dropped below low_wmark_pages().
> > >
> > > But still, it costs 62ms reclaim latency (base kernel is 29ms).
> >
> > I got new findings: the CPU schedule delays are much larger than
> > reclaim delays. It does make the "direct reclaim until low watermark
> > OK" latency less a problem :)
> >
> > 1000 dd test case:
> > RECLAIM delay CPU delay nr_alloc_fail CAL (last CPU)
> > base kernel 29ms 244ms 14586 218440
> > patched 62ms 215ms 5004 325
>
> Hmm, in your system, the latency of direct reclaim may be a less problem.
>
> But, generally speaking, in a latency sensitive system in enterprise area
> there are two kind of processes. One is latency sensitive -(A) the other
> is not-latency sensitive -(B). And usually we set cpu affinity for both processes
> to avoid scheduling issue in (A). In this situation, CPU delay tends to be lower
> than the above and a less problem but reclaim delay is more critical.
Good point, thanks!
I also tried increasing min_free_kbytes as indicated by Minchan and
find 1-second long reclaim delays... Even adding explicit time limits,
it's still over 100ms with very high nr_alloc_fail.
I'm listing the code and results here as a record. But in general I'll
stop experiments in this direction. We need some more oriented way
that can guarantee to satisfy the page allocation request after small
sized direct reclaims.
Thanks,
Fengguang
---
root@fat /home/wfg# ./test-dd-sparse.sh
start time: 250
total time: 518
nr_alloc_fail 18551
allocstall 234468
LOC: 525770 523124 520782 529151 526192 525004 524166 521527 Local timer interrupts
RES: 2174 1674 1301 1420 3329 1563 1314 1563 Rescheduling interrupts
CAL: 67 402 602 267 240 270 291 274 Function call interrupts
TLB: 197 25 23 17 80 321 121 58 TLB shootdowns
CPU count real total virtual total delay total delay average
1078 3408481832 3400786094 256971188317 238.378ms
IO count delay total delay average
5 414363739 82ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
187 28564728545 152ms
Subject: mm: cut down __GFP_NORETRY page allocation failures
Date: Thu Apr 28 13:46:39 CST 2011
Concurrent page allocations are suffering from high failure rates.
On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
the page allocation failures are
nr_alloc_fail 733 # interleaved reads by 1 single task
nr_alloc_fail 11799 # concurrent reads by 1000 tasks
The concurrent read test script is:
for i in `seq 1000`
do
truncate -s 1G /fs/sparse-$i
dd if=/fs/sparse-$i of=/dev/null &
done
In order for get_page_from_freelist() to get free page,
(1) try_to_free_pages() should use much higher .nr_to_reclaim than the
current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
possible low watermark state
(2) the get_page_from_freelist() _after_ direct reclaim should use lower
watermark than its normal invocations, so that it can reasonably
"reserve" some free pages for itself and prevent other concurrent
page allocators stealing all its reclaimed pages.
Some notes:
- commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
reclaim allocation fails") has the same target, however is obviously
costly and less effective. It seems more clean to just remove the
retry and drain code than to retain it.
- it's a bit hacky to reclaim more than requested pages inside
do_try_to_free_page(), and it won't help cgroup for now
- it only aims to reduce failures when there are plenty of reclaimable
pages, so it stops the opportunistic reclaim when scanned 2 times pages
Test results (1000 dd case):
- the failure rate is pretty sensible to the page reclaim size,
from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 5004 (WMARK_HIGH, stop on low
watermark ok) to 10496 (SWAP_CLUSTER_MAX)
- the IPIs are reduced by over 500 times
- the reclaim delay is doubled, from 29ms to 62ms
Base kernel is vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocations.
base kernel, 1000 dd
--------------------
start time: 245
total time: 526
nr_alloc_fail 14586
allocstall 1578343
LOC: 533981 529210 528283 532346 533392 531314 531705 528983 Local timer interrupts
RES: 3123 2177 1676 1580 2157 1974 1606 1696 Rescheduling interrupts
CAL: 218392 218631 219167 219217 218840 218985 218429 218440 Function call interrupts
TLB: 175 13 21 18 62 309 119 42 TLB shootdowns
CPU count real total virtual total delay total
1122 3676441096 3656793547 274182127286
IO count delay total delay average
3 291765493 97ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1350 39229752193 29ms
dd: read=45056, write=0, cancelled_write=0
patched, 1000 dd
----------------
root@fat /home/wfg# ./test-dd-sparse.sh
start time: 260
total time: 519
nr_alloc_fail 5004
allocstall 551429
LOC: 524861 521832 520945 524632 524666 523334 523797 521562 Local timer interrupts
RES: 1323 1976 2505 1610 1544 1848 3310 1644 Rescheduling interrupts
CAL: 67 335 353 614 289 287 293 325 Function call interrupts
TLB: 288 29 26 34 103 321 123 70 TLB shootdowns
CPU count real total virtual total delay total
1177 3797422704 3775174301 253228435955
IO count delay total delay average
1 198528820 198ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
508 31660219699 62ms
base kernel, 100 dd
-------------------
root@fat /home/wfg# ./test-dd-sparse.sh
start time: 3
total time: 53
nr_alloc_fail 849
allocstall 131330
LOC: 59843 56506 55838 65283 61774 57929 58880 56246 Local timer interrupts
RES: 376 308 372 239 374 307 491 239 Rescheduling interrupts
CAL: 17737 18083 17948 18192 17929 17845 17893 17906 Function call interrupts
TLB: 307 26 25 21 80 324 137 79 TLB shootdowns
CPU count real total virtual total delay total
974 3197513904 3180727460 38504429363
IO count delay total delay average
1 18156696 18ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1036 3439387298 3ms
dd: read=12288, write=0, cancelled_write=0
patched, 100 dd
---------------
root@fat /home/wfg# ./test-dd-sparse.sh
start time: 3
total time: 52
nr_alloc_fail 307
allocstall 48178
LOC: 56486 53514 52792 55879 56317 55383 55311 53168 Local timer interrupts
RES: 604 345 257 250 775 371 272 252 Rescheduling interrupts
CAL: 75 373 369 543 272 278 295 296 Function call interrupts
TLB: 259 24 19 24 82 306 139 53 TLB shootdowns
CPU count real total virtual total delay total
974 3177516944 3161771347 38508053977
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
393 5389030889 13ms
dd: read=0, write=0, cancelled_write=0
CC: Mel Gorman <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/buffer.c | 4 ++--
include/linux/swap.h | 3 ++-
mm/page_alloc.c | 22 +++++-----------------
mm/vmscan.c | 34 ++++++++++++++++++++++++++--------
4 files changed, 35 insertions(+), 28 deletions(-)
--- linux-next.orig/mm/vmscan.c 2011-05-02 22:14:14.000000000 +0800
+++ linux-next/mm/vmscan.c 2011-05-03 10:07:14.000000000 +0800
@@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon
* returns: 0, if no pages reclaimed
* else, the number of pages reclaimed
*/
-static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
- struct scan_control *sc)
+static unsigned long do_try_to_free_pages(struct zone *preferred_zone,
+ struct zonelist *zonelist,
+ struct scan_control *sc)
{
int priority;
unsigned long total_scanned = 0;
@@ -2034,6 +2035,8 @@ static unsigned long do_try_to_free_page
struct zoneref *z;
struct zone *zone;
unsigned long writeback_threshold;
+ unsigned long min_reclaim = sc->nr_to_reclaim;
+ unsigned long start_time = jiffies;
get_mems_allowed();
delayacct_freepages_start();
@@ -2041,6 +2044,9 @@ static unsigned long do_try_to_free_page
if (scanning_global_lru(sc))
count_vm_event(ALLOCSTALL);
+ if (preferred_zone)
+ sc->nr_to_reclaim += preferred_zone->watermark[WMARK_LOW];
+
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
sc->nr_scanned = 0;
if (!priority)
@@ -2067,8 +2073,19 @@ static unsigned long do_try_to_free_page
}
}
total_scanned += sc->nr_scanned;
- if (sc->nr_reclaimed >= sc->nr_to_reclaim)
- goto out;
+ if (sc->nr_reclaimed >= min_reclaim) {
+ if (sc->nr_reclaimed >= sc->nr_to_reclaim)
+ goto out;
+ if (total_scanned > 2 * sc->nr_to_reclaim)
+ goto out;
+ if (preferred_zone &&
+ zone_watermark_ok(preferred_zone, sc->order,
+ low_wmark_pages(preferred_zone),
+ zone_idx(preferred_zone), 0))
+ goto out;
+ if (jiffies - start_time > HZ / 100)
+ goto out;
+ }
/*
* Try to write back as many pages as we just scanned. This
@@ -2117,7 +2134,8 @@ out:
return 0;
}
-unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+unsigned long try_to_free_pages(struct zone *preferred_zone,
+ struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
unsigned long nr_reclaimed;
@@ -2137,7 +2155,7 @@ unsigned long try_to_free_pages(struct z
sc.may_writepage,
gfp_mask);
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+ nr_reclaimed = do_try_to_free_pages(preferred_zone, zonelist, &sc);
trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
@@ -2207,7 +2225,7 @@ unsigned long try_to_free_mem_cgroup_pag
sc.may_writepage,
sc.gfp_mask);
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+ nr_reclaimed = do_try_to_free_pages(NULL, zonelist, &sc);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
@@ -2796,7 +2814,7 @@ unsigned long shrink_all_memory(unsigned
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
- nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+ nr_reclaimed = do_try_to_free_pages(NULL, zonelist, &sc);
p->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
--- linux-next.orig/mm/page_alloc.c 2011-05-02 22:14:14.000000000 +0800
+++ linux-next/mm/page_alloc.c 2011-05-02 22:14:21.000000000 +0800
@@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int migratetype, unsigned long *did_some_progress)
{
- struct page *page = NULL;
+ struct page *page;
struct reclaim_state reclaim_state;
- bool drained = false;
cond_resched();
@@ -1901,33 +1900,22 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
reclaim_state.reclaimed_slab = 0;
current->reclaim_state = &reclaim_state;
- *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
+ *did_some_progress = try_to_free_pages(preferred_zone, zonelist, order,
+ gfp_mask, nodemask);
current->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
current->flags &= ~PF_MEMALLOC;
- cond_resched();
-
if (unlikely(!(*did_some_progress)))
return NULL;
-retry:
+ alloc_flags |= ALLOC_HARDER;
+
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
alloc_flags, preferred_zone,
migratetype);
-
- /*
- * If an allocation failed after direct reclaim, it could be because
- * pages are pinned on the per-cpu lists. Drain them and try again
- */
- if (!page && !drained) {
- drain_all_pages();
- drained = true;
- goto retry;
- }
-
return page;
}
--- linux-next.orig/fs/buffer.c 2011-05-02 22:14:14.000000000 +0800
+++ linux-next/fs/buffer.c 2011-05-02 22:14:21.000000000 +0800
@@ -288,8 +288,8 @@ static void free_more_memory(void)
gfp_zone(GFP_NOFS), NULL,
&zone);
if (zone)
- try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
- GFP_NOFS, NULL);
+ try_to_free_pages(zone, node_zonelist(nid, GFP_NOFS),
+ 0, GFP_NOFS, NULL);
}
}
--- linux-next.orig/include/linux/swap.h 2011-05-02 22:14:14.000000000 +0800
+++ linux-next/include/linux/swap.h 2011-05-02 22:14:21.000000000 +0800
@@ -249,7 +249,8 @@ static inline void lru_cache_add_file(st
#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
/* linux/mm/vmscan.c */
-extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+extern unsigned long try_to_free_pages(struct zone *preferred_zone,
+ struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
Hi Minchan,
On Tue, May 03, 2011 at 08:49:20AM +0800, Minchan Kim wrote:
> Hi Wu, Sorry for slow response.
> I guess you know why I am slow. :)
Yeah, never mind :)
> Unfortunately, my patch doesn't consider order-0 pages, as you mentioned below.
> I read your mail which states it doesn't help although it considers
> order-0 pages and drain.
> Actually, I tried to look into that but in my poor system(core2duo, 2G
> ram), nr_alloc_fail never happens. :(
I'm running a 4-core 8-thread CPU with 3G ram.
Did you run with this patch?
[PATCH] mm: readahead page allocations are OK to fail
https://lkml.org/lkml/2011/4/26/129
It's very good at generating lots of __GFP_NORETRY order-0 page
allocation requests.
> I will try it in other desktop but I am not sure I can reproduce it.
>
> >
> > root@fat /home/wfg# ./test-dd-sparse.sh
> > start time: 246
> > total time: 531
> > nr_alloc_fail 14097
> > allocstall 1578332
> > LOC: 542698 538947 536986 567118 552114 539605 541201 537623 Local timer interrupts
> > RES: 3368 1908 1474 1476 2809 1602 1500 1509 Rescheduling interrupts
> > CAL: 223844 224198 224268 224436 223952 224056 223700 223743 Function call interrupts
> > TLB: 381 27 22 19 96 404 111 67 TLB shootdowns
> >
> > root@fat /home/wfg# getdelays -dip `pidof dd`
> > print delayacct stats ON
> > printing IO accounting
> > PID 5202
> >
> >
> > CPU count real total virtual total delay total
> > 1132 3635447328 3627947550 276722091605
> > IO count delay total delay average
> > 2 187809974 62ms
> > SWAP count delay total delay average
> > 0 0 0ms
> > RECLAIM count delay total delay average
> > 1334 35304580824 26ms
> > dd: read=278528, write=0, cancelled_write=0
> >
> > I guess your patch is mainly fixing the high order allocations while
> > my workload is mainly order 0 readahead page allocations. There are
> > 1000 forks, however the "start time: 246" seems to indicate that the
> > order-1 reclaim latency is not improved.
>
> Maybe, 8K * 1000 isn't big footprint so I think reclaim doesn't happen.
It's mainly a guess. In an earlier experiment of simply increasing
nr_to_reclaim to high_wmark_pages() without any other constraints, it
does manage to reduce start time to about 25 seconds.
> > I'll try modifying your patch and see how it works out. The obvious
> > change is to apply it to the order-0 case. Hope this won't create much
> > more isolated pages.
> >
> > Attached is your patch rebased to 2.6.39-rc3, after resolving some
> > merge conflicts and fixing a trivial NULL pointer bug.
>
> Thanks!
> I would like to see detail with it in my system if I can reproduce it.
OK.
> >> > no cond_resched():
> >>
> >> What's this?
> >
> > I tried a modified patch that also removes the cond_resched() call in
> > __alloc_pages_direct_reclaim(), between try_to_free_pages() and
> > get_page_from_freelist(). It seems not helping noticeably.
> >
> > It looks safe to remove that cond_resched() as we already have such
> > calls in shrink_page_list().
>
> I tried similar thing but Andrew have a concern about it.
> https://lkml.org/lkml/2011/3/24/138
Yeah cond_resched() is at least not the root cause of our problems..
> >> > + if (total_scanned > 2 * sc->nr_to_reclaim)
> >> > + goto out;
> >>
> >> If there are lots of dirty pages in LRU?
> >> If there are lots of unevictable pages in LRU?
> >> If there are lots of mapped page in LRU but may_unmap = 0 cases?
> >> I means it's rather risky early conclusion.
> >
> > That test means to avoid scanning too much on __GFP_NORETRY direct
> > reclaims. My assumption for __GFP_NORETRY is, it should fail fast when
> > the LRU pages seem hard to reclaim. And the problem in the 1000 dd
> > case is, it's all easy to reclaim LRU pages but __GFP_NORETRY still
> > fails from time to time, with lots of IPIs that may hurt large
> > machines a lot.
>
> I don't have enough time and a environment to test it.
> So I can't make sure of it but my concern is a latency.
> If you solve latency problem considering CPU scaling, I won't oppose it. :)
OK, let's head for that direction :)
Thanks,
Fengguang
On Tue, May 3, 2011 at 12:51 PM, Wu Fengguang <[email protected]> wrote:
> Hi Minchan,
>
> On Tue, May 03, 2011 at 08:49:20AM +0800, Minchan Kim wrote:
>> Hi Wu, Sorry for slow response.
>> I guess you know why I am slow. :)
>
> Yeah, never mind :)
>
>> Unfortunately, my patch doesn't consider order-0 pages, as you mentioned below.
>> I read your mail which states it doesn't help although it considers
>> order-0 pages and drain.
>> Actually, I tried to look into that but in my poor system(core2duo, 2G
>> ram), nr_alloc_fail never happens. :(
>
> I'm running a 4-core 8-thread CPU with 3G ram.
>
> Did you run with this patch?
>
> [PATCH] mm: readahead page allocations are OK to fail
> https://lkml.org/lkml/2011/4/26/129
>
Of course.
I will try it in my better machine i5 4 core 3G ram.
> It's very good at generating lots of __GFP_NORETRY order-0 page
> allocation requests.
>
>> I will try it in other desktop but I am not sure I can reproduce it.
>>
>> >
>> > root@fat /home/wfg# ./test-dd-sparse.sh
>> > start time: 246
>> > total time: 531
>> > nr_alloc_fail 14097
>> > allocstall 1578332
>> > LOC: 542698 538947 536986 567118 552114 539605 541201 537623 Local timer interrupts
>> > RES: 3368 1908 1474 1476 2809 1602 1500 1509 Rescheduling interrupts
>> > CAL: 223844 224198 224268 224436 223952 224056 223700 223743 Function call interrupts
>> > TLB: 381 27 22 19 96 404 111 67 TLB shootdowns
>> >
>> > root@fat /home/wfg# getdelays -dip `pidof dd`
>> > print delayacct stats ON
>> > printing IO accounting
>> > PID 5202
>> >
>> >
>> > CPU count real total virtual total delay total
>> > 1132 3635447328 3627947550 276722091605
>> > IO count delay total delay average
>> > 2 187809974 62ms
>> > SWAP count delay total delay average
>> > 0 0 0ms
>> > RECLAIM count delay total delay average
>> > 1334 35304580824 26ms
>> > dd: read=278528, write=0, cancelled_write=0
>> >
>> > I guess your patch is mainly fixing the high order allocations while
>> > my workload is mainly order 0 readahead page allocations. There are
>> > 1000 forks, however the "start time: 246" seems to indicate that the
>> > order-1 reclaim latency is not improved.
>>
>> Maybe, 8K * 1000 isn't big footprint so I think reclaim doesn't happen.
>
> It's mainly a guess. In an earlier experiment of simply increasing
> nr_to_reclaim to high_wmark_pages() without any other constraints, it
> does manage to reduce start time to about 25 seconds.
If so, I guess the workload might depend on order-0 page, not stack allocation.
>
>> > I'll try modifying your patch and see how it works out. The obvious
>> > change is to apply it to the order-0 case. Hope this won't create much
>> > more isolated pages.
>> >
>> > Attached is your patch rebased to 2.6.39-rc3, after resolving some
>> > merge conflicts and fixing a trivial NULL pointer bug.
>>
>> Thanks!
>> I would like to see detail with it in my system if I can reproduce it.
>
> OK.
>
>> >> > no cond_resched():
>> >>
>> >> What's this?
>> >
>> > I tried a modified patch that also removes the cond_resched() call in
>> > __alloc_pages_direct_reclaim(), between try_to_free_pages() and
>> > get_page_from_freelist(). It seems not helping noticeably.
>> >
>> > It looks safe to remove that cond_resched() as we already have such
>> > calls in shrink_page_list().
>>
>> I tried similar thing but Andrew have a concern about it.
>> https://lkml.org/lkml/2011/3/24/138
>
> Yeah cond_resched() is at least not the root cause of our problems..
>
>> >> > + if (total_scanned > 2 * sc->nr_to_reclaim)
>> >> > + goto out;
>> >>
>> >> If there are lots of dirty pages in LRU?
>> >> If there are lots of unevictable pages in LRU?
>> >> If there are lots of mapped page in LRU but may_unmap = 0 cases?
>> >> I means it's rather risky early conclusion.
>> >
>> > That test means to avoid scanning too much on __GFP_NORETRY direct
>> > reclaims. My assumption for __GFP_NORETRY is, it should fail fast when
>> > the LRU pages seem hard to reclaim. And the problem in the 1000 dd
>> > case is, it's all easy to reclaim LRU pages but __GFP_NORETRY still
>> > fails from time to time, with lots of IPIs that may hurt large
>> > machines a lot.
>>
>> I don't have enough time and a environment to test it.
>> So I can't make sure of it but my concern is a latency.
>> If you solve latency problem considering CPU scaling, I won't oppose it. :)
>
> OK, let's head for that direction :)
Anyway, the problem about draining overhead with __GFP_NORETRY is
valuable, I think.
We should handle it
>
> Thanks,
> Fengguang
>
Thanks for the good experiments and numbers.
--
Kind regards,
Minchan Kim
On Thu, Apr 28, 2011 at 9:36 PM, Wu Fengguang <[email protected]> wrote:
> Concurrent page allocations are suffering from high failure rates.
>
> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
> the page allocation failures are
>
> nr_alloc_fail 733 # interleaved reads by 1 single task
> nr_alloc_fail 11799 # concurrent reads by 1000 tasks
>
> The concurrent read test script is:
>
> for i in `seq 1000`
> do
> truncate -s 1G /fs/sparse-$i
> dd if=/fs/sparse-$i of=/dev/null &
> done
>
With Core2 Duo, 3G ram, No swap partition I can not produce the alloc fail
> In order for get_page_from_freelist() to get free page,
>
> (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
> current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
> possible low watermark state as well as fill the pcp with enough free
> pages to overflow its high watermark.
>
> (2) the get_page_from_freelist() _after_ direct reclaim should use lower
> watermark than its normal invocations, so that it can reasonably
> "reserve" some free pages for itself and prevent other concurrent
> page allocators stealing all its reclaimed pages.
>
> Some notes:
>
> - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
> reclaim allocation fails") has the same target, however is obviously
> costly and less effective. It seems more clean to just remove the
> retry and drain code than to retain it.
>
> - it's a bit hacky to reclaim more than requested pages inside
> do_try_to_free_page(), and it won't help cgroup for now
>
> - it only aims to reduce failures when there are plenty of reclaimable
> pages, so it stops the opportunistic reclaim when scanned 2 times pages
>
> Test results:
>
> - the failure rate is pretty sensible to the page reclaim size,
> from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>
> - the IPIs are reduced by over 100 times
>
> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> -------------------------------------------------------------------------------
> nr_alloc_fail 10496
> allocstall 1576602
>
> slabs_scanned 21632
> kswapd_steal 4393382
> kswapd_inodesteal 124
> kswapd_low_wmark_hit_quickly 885
> kswapd_high_wmark_hit_quickly 2321
> kswapd_skip_congestion_wait 0
> pageoutrun 29426
>
> CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
>
> LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts
> RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts
> TLB: 189 15 13 17 64 294 97 63 TLB shootdowns
Could you tell how to get above info?
>
> patched (WMARK_MIN)
> -------------------
> nr_alloc_fail 704
> allocstall 105551
>
> slabs_scanned 33280
> kswapd_steal 4525537
> kswapd_inodesteal 187
> kswapd_low_wmark_hit_quickly 4980
> kswapd_high_wmark_hit_quickly 2573
> kswapd_skip_congestion_wait 0
> pageoutrun 35429
>
> CAL: 93 286 396 754 272 297 275 281 Function call interrupts
>
> LOC: 520550 517751 517043 522016 520302 518479 519329 517179 Local timer interrupts
> RES: 2131 1371 1376 1269 1390 1181 1409 1280 Rescheduling interrupts
> TLB: 280 26 27 30 65 305 134 75 TLB shootdowns
>
> patched (WMARK_HIGH)
> --------------------
> nr_alloc_fail 282
> allocstall 53860
>
> slabs_scanned 23936
> kswapd_steal 4561178
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 2760
> kswapd_high_wmark_hit_quickly 1748
> kswapd_skip_congestion_wait 0
> pageoutrun 32639
>
> CAL: 93 463 410 540 298 282 272 306 Function call interrupts
>
> LOC: 513956 510749 509890 514897 514300 512392 512825 510574 Local timer interrupts
> RES: 1174 2081 1411 1320 1742 2683 1380 1230 Rescheduling interrupts
> TLB: 274 21 19 22 57 317 131 61 TLB shootdowns
>
> this patch (WMARK_HIGH, limited scan)
> -------------------------------------
> nr_alloc_fail 276
> allocstall 54034
>
> slabs_scanned 24320
> kswapd_steal 4507482
> kswapd_inodesteal 262
> kswapd_low_wmark_hit_quickly 2638
> kswapd_high_wmark_hit_quickly 1710
> kswapd_skip_congestion_wait 0
> pageoutrun 32182
>
> CAL: 69 443 421 567 273 279 269 334 Function call interrupts
>
> LOC: 514736 511698 510993 514069 514185 512986 513838 511229 Local timer interrupts
> RES: 2153 1556 1126 1351 3047 1554 1131 1560 Rescheduling interrupts
> TLB: 209 26 20 15 71 315 117 71 TLB shootdowns
>
> CC: Mel Gorman <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> mm/page_alloc.c | 17 +++--------------
> mm/vmscan.c | 6 ++++++
> 2 files changed, 9 insertions(+), 14 deletions(-)
> --- linux-next.orig/mm/vmscan.c 2011-04-28 21:16:16.000000000 +0800
> +++ linux-next/mm/vmscan.c 2011-04-28 21:28:57.000000000 +0800
> @@ -1978,6 +1978,8 @@ static void shrink_zones(int priority, s
> continue;
> if (zone->all_unreclaimable && priority != DEF_PRIORITY)
> continue; /* Let kswapd poll it */
> + sc->nr_to_reclaim = max(sc->nr_to_reclaim,
> + zone->watermark[WMARK_HIGH]);
> }
>
> shrink_zone(priority, zone, sc);
> @@ -2034,6 +2036,7 @@ static unsigned long do_try_to_free_page
> struct zoneref *z;
> struct zone *zone;
> unsigned long writeback_threshold;
> + unsigned long min_reclaim = sc->nr_to_reclaim;
>
> get_mems_allowed();
> delayacct_freepages_start();
> @@ -2067,6 +2070,9 @@ static unsigned long do_try_to_free_page
> }
> }
> total_scanned += sc->nr_scanned;
> + if (sc->nr_reclaimed >= min_reclaim &&
> + total_scanned > 2 * sc->nr_to_reclaim)
> + goto out;
> if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> goto out;
>
> --- linux-next.orig/mm/page_alloc.c 2011-04-28 21:16:16.000000000 +0800
> +++ linux-next/mm/page_alloc.c 2011-04-28 21:16:18.000000000 +0800
> @@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> int migratetype, unsigned long *did_some_progress)
> {
> - struct page *page = NULL;
> + struct page *page;
> struct reclaim_state reclaim_state;
> - bool drained = false;
>
> cond_resched();
>
> @@ -1912,22 +1911,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> if (unlikely(!(*did_some_progress)))
> return NULL;
>
> -retry:
> + alloc_flags |= ALLOC_HARDER;
> +
> page = get_page_from_freelist(gfp_mask, nodemask, order,
> zonelist, high_zoneidx,
> alloc_flags, preferred_zone,
> migratetype);
> -
> - /*
> - * If an allocation failed after direct reclaim, it could be because
> - * pages are pinned on the per-cpu lists. Drain them and try again
> - */
> - if (!page && !drained) {
> - drain_all_pages();
> - drained = true;
> - goto retry;
> - }
> -
> return page;
> }
>
>
--
Regards
dave
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Wed, May 4, 2011 at 9:56 AM, Dave Young <[email protected]> wrote:
> On Thu, Apr 28, 2011 at 9:36 PM, Wu Fengguang <[email protected]> wrote:
>> Concurrent page allocations are suffering from high failure rates.
>>
>> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
>> the page allocation failures are
>>
>> nr_alloc_fail 733 # interleaved reads by 1 single task
>> nr_alloc_fail 11799 # concurrent reads by 1000 tasks
>>
>> The concurrent read test script is:
>>
>> for i in `seq 1000`
>> do
>> truncate -s 1G /fs/sparse-$i
>> dd if=/fs/sparse-$i of=/dev/null &
>> done
>>
>
> With Core2 Duo, 3G ram, No swap partition I can not produce the alloc fail
unset CONFIG_SCHED_AUTOGROUP and CONFIG_CGROUP_SCHED seems affects the
test results, now I see several nr_alloc_fail (dd is not finished
yet):
dave@darkstar-32:$ grep fail /proc/vmstat:
nr_alloc_fail 4
compact_pagemigrate_failed 0
compact_fail 3
htlb_buddy_alloc_fail 0
thp_collapse_alloc_fail 4
So the result is related to cpu scheduler.
>
>> In order for get_page_from_freelist() to get free page,
>>
>> (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
>> current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
>> possible low watermark state as well as fill the pcp with enough free
>> pages to overflow its high watermark.
>>
>> (2) the get_page_from_freelist() _after_ direct reclaim should use lower
>> watermark than its normal invocations, so that it can reasonably
>> "reserve" some free pages for itself and prevent other concurrent
>> page allocators stealing all its reclaimed pages.
>>
>> Some notes:
>>
>> - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
>> reclaim allocation fails") has the same target, however is obviously
>> costly and less effective. It seems more clean to just remove the
>> retry and drain code than to retain it.
>>
>> - it's a bit hacky to reclaim more than requested pages inside
>> do_try_to_free_page(), and it won't help cgroup for now
>>
>> - it only aims to reduce failures when there are plenty of reclaimable
>> pages, so it stops the opportunistic reclaim when scanned 2 times pages
>>
>> Test results:
>>
>> - the failure rate is pretty sensible to the page reclaim size,
>> from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>>
>> - the IPIs are reduced by over 100 times
>>
>> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
>> -------------------------------------------------------------------------------
>> nr_alloc_fail 10496
>> allocstall 1576602
>>
>> slabs_scanned 21632
>> kswapd_steal 4393382
>> kswapd_inodesteal 124
>> kswapd_low_wmark_hit_quickly 885
>> kswapd_high_wmark_hit_quickly 2321
>> kswapd_skip_congestion_wait 0
>> pageoutrun 29426
>>
>> CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
>>
>> LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts
>> RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts
>> TLB: 189 15 13 17 64 294 97 63 TLB shootdowns
>
> Could you tell how to get above info?
>
>>
>> patched (WMARK_MIN)
>> -------------------
>> nr_alloc_fail 704
>> allocstall 105551
>>
>> slabs_scanned 33280
>> kswapd_steal 4525537
>> kswapd_inodesteal 187
>> kswapd_low_wmark_hit_quickly 4980
>> kswapd_high_wmark_hit_quickly 2573
>> kswapd_skip_congestion_wait 0
>> pageoutrun 35429
>>
>> CAL: 93 286 396 754 272 297 275 281 Function call interrupts
>>
>> LOC: 520550 517751 517043 522016 520302 518479 519329 517179 Local timer interrupts
>> RES: 2131 1371 1376 1269 1390 1181 1409 1280 Rescheduling interrupts
>> TLB: 280 26 27 30 65 305 134 75 TLB shootdowns
>>
>> patched (WMARK_HIGH)
>> --------------------
>> nr_alloc_fail 282
>> allocstall 53860
>>
>> slabs_scanned 23936
>> kswapd_steal 4561178
>> kswapd_inodesteal 0
>> kswapd_low_wmark_hit_quickly 2760
>> kswapd_high_wmark_hit_quickly 1748
>> kswapd_skip_congestion_wait 0
>> pageoutrun 32639
>>
>> CAL: 93 463 410 540 298 282 272 306 Function call interrupts
>>
>> LOC: 513956 510749 509890 514897 514300 512392 512825 510574 Local timer interrupts
>> RES: 1174 2081 1411 1320 1742 2683 1380 1230 Rescheduling interrupts
>> TLB: 274 21 19 22 57 317 131 61 TLB shootdowns
>>
>> this patch (WMARK_HIGH, limited scan)
>> -------------------------------------
>> nr_alloc_fail 276
>> allocstall 54034
>>
>> slabs_scanned 24320
>> kswapd_steal 4507482
>> kswapd_inodesteal 262
>> kswapd_low_wmark_hit_quickly 2638
>> kswapd_high_wmark_hit_quickly 1710
>> kswapd_skip_congestion_wait 0
>> pageoutrun 32182
>>
>> CAL: 69 443 421 567 273 279 269 334 Function call interrupts
>>
>> LOC: 514736 511698 510993 514069 514185 512986 513838 511229 Local timer interrupts
>> RES: 2153 1556 1126 1351 3047 1554 1131 1560 Rescheduling interrupts
>> TLB: 209 26 20 15 71 315 117 71 TLB shootdowns
>>
>> CC: Mel Gorman <[email protected]>
>> Signed-off-by: Wu Fengguang <[email protected]>
>> ---
>> mm/page_alloc.c | 17 +++--------------
>> mm/vmscan.c | 6 ++++++
>> 2 files changed, 9 insertions(+), 14 deletions(-)
>> --- linux-next.orig/mm/vmscan.c 2011-04-28 21:16:16.000000000 +0800
>> +++ linux-next/mm/vmscan.c 2011-04-28 21:28:57.000000000 +0800
>> @@ -1978,6 +1978,8 @@ static void shrink_zones(int priority, s
>> continue;
>> if (zone->all_unreclaimable && priority != DEF_PRIORITY)
>> continue; /* Let kswapd poll it */
>> + sc->nr_to_reclaim = max(sc->nr_to_reclaim,
>> + zone->watermark[WMARK_HIGH]);
>> }
>>
>> shrink_zone(priority, zone, sc);
>> @@ -2034,6 +2036,7 @@ static unsigned long do_try_to_free_page
>> struct zoneref *z;
>> struct zone *zone;
>> unsigned long writeback_threshold;
>> + unsigned long min_reclaim = sc->nr_to_reclaim;
>>
>> get_mems_allowed();
>> delayacct_freepages_start();
>> @@ -2067,6 +2070,9 @@ static unsigned long do_try_to_free_page
>> }
>> }
>> total_scanned += sc->nr_scanned;
>> + if (sc->nr_reclaimed >= min_reclaim &&
>> + total_scanned > 2 * sc->nr_to_reclaim)
>> + goto out;
>> if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>> goto out;
>>
>> --- linux-next.orig/mm/page_alloc.c 2011-04-28 21:16:16.000000000 +0800
>> +++ linux-next/mm/page_alloc.c 2011-04-28 21:16:18.000000000 +0800
>> @@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>> nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
>> int migratetype, unsigned long *did_some_progress)
>> {
>> - struct page *page = NULL;
>> + struct page *page;
>> struct reclaim_state reclaim_state;
>> - bool drained = false;
>>
>> cond_resched();
>>
>> @@ -1912,22 +1911,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>> if (unlikely(!(*did_some_progress)))
>> return NULL;
>>
>> -retry:
>> + alloc_flags |= ALLOC_HARDER;
>> +
>> page = get_page_from_freelist(gfp_mask, nodemask, order,
>> zonelist, high_zoneidx,
>> alloc_flags, preferred_zone,
>> migratetype);
>> -
>> - /*
>> - * If an allocation failed after direct reclaim, it could be because
>> - * pages are pinned on the per-cpu lists. Drain them and try again
>> - */
>> - if (!page && !drained) {
>> - drain_all_pages();
>> - drained = true;
>> - goto retry;
>> - }
>> -
>> return page;
>> }
>>
>>
>
>
>
> --
> Regards
> dave
>
--
Regards
dave
On Wed, May 04, 2011 at 10:32:01AM +0800, Dave Young wrote:
> On Wed, May 4, 2011 at 9:56 AM, Dave Young <[email protected]> wrote:
> > On Thu, Apr 28, 2011 at 9:36 PM, Wu Fengguang <[email protected]> wrote:
> >> Concurrent page allocations are suffering from high failure rates.
> >>
> >> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
> >> the page allocation failures are
> >>
> >> nr_alloc_fail 733 # interleaved reads by 1 single task
> >> nr_alloc_fail 11799 # concurrent reads by 1000 tasks
> >>
> >> The concurrent read test script is:
> >>
> >> for i in `seq 1000`
> >> do
> >> truncate -s 1G /fs/sparse-$i
> >> dd if=/fs/sparse-$i of=/dev/null &
> >> done
> >>
> >
> > With Core2 Duo, 3G ram, No swap partition I can not produce the alloc fail
>
> unset CONFIG_SCHED_AUTOGROUP and CONFIG_CGROUP_SCHED seems affects the
> test results, now I see several nr_alloc_fail (dd is not finished
> yet):
>
> dave@darkstar-32:$ grep fail /proc/vmstat:
> nr_alloc_fail 4
> compact_pagemigrate_failed 0
> compact_fail 3
> htlb_buddy_alloc_fail 0
> thp_collapse_alloc_fail 4
>
> So the result is related to cpu scheduler.
Good catch! My kernel also disabled CONFIG_CGROUP_SCHED and
CONFIG_SCHED_AUTOGROUP.
Thanks,
Fengguang
> > CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
> >
> > LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts
> > RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts
> > TLB: 189 15 13 17 64 294 97 63 TLB shootdowns
>
> Could you tell how to get above info?
It's /proc/interrupts.
I have two lines at the end of the attached script to collect the
information, and another script to call getdelays on every 10s. The
posted reclaim delays are the last successful getdelays output.
I've automated the test process, so that with one single command line
a new kernel will be built and the test box will rerun tests on the
new kernel :)
Thanks,
Fengguang
On Wed, May 04, 2011 at 10:56:09AM +0800, Wu Fengguang wrote:
> On Wed, May 04, 2011 at 10:32:01AM +0800, Dave Young wrote:
> > On Wed, May 4, 2011 at 9:56 AM, Dave Young <[email protected]> wrote:
> > > On Thu, Apr 28, 2011 at 9:36 PM, Wu Fengguang <[email protected]> wrote:
> > >> Concurrent page allocations are suffering from high failure rates.
> > >>
> > >> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
> > >> the page allocation failures are
> > >>
> > >> nr_alloc_fail 733 # interleaved reads by 1 single task
> > >> nr_alloc_fail 11799 # concurrent reads by 1000 tasks
> > >>
> > >> The concurrent read test script is:
> > >>
> > >> for i in `seq 1000`
> > >> do
> > >> truncate -s 1G /fs/sparse-$i
> > >> dd if=/fs/sparse-$i of=/dev/null &
> > >> done
> > >>
> > >
> > > With Core2 Duo, 3G ram, No swap partition I can not produce the alloc fail
> >
> > unset CONFIG_SCHED_AUTOGROUP and CONFIG_CGROUP_SCHED seems affects the
> > test results, now I see several nr_alloc_fail (dd is not finished
> > yet):
> >
> > dave@darkstar-32:$ grep fail /proc/vmstat:
> > nr_alloc_fail 4
> > compact_pagemigrate_failed 0
> > compact_fail 3
> > htlb_buddy_alloc_fail 0
> > thp_collapse_alloc_fail 4
> >
> > So the result is related to cpu scheduler.
>
> Good catch! My kernel also disabled CONFIG_CGROUP_SCHED and
> CONFIG_SCHED_AUTOGROUP.
I tried enable the two options and find that "ps ax" runs much faster
when the 1000 dd's are running. And test results in base kernel are:
start time: 287
total time: 499
nr_alloc_fail 5075
allocstall 20658
LOC: 502393 501303 500813 503814 501972 501775 501949 501143 Local timer interrupts
RES: 5716 8584 7603 2699 7972 15383 8921 4345 Rescheduling interrupts
CAL: 1543 1731 1733 1809 1692 1715 1765 1753 Function call interrupts
TLB: 132 27 31 21 70 175 68 46 TLB shootdowns
CPU count real total virtual total delay total delay average
916 2803573792 2785739581 200248952651 218.612ms
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
15 234623427 15ms
dd: read=0, write=0, cancelled_write=0
Comparing to the cgroup-sched disabled results (cited below), the
allocstall is reduced to 1.3% and CALs are mostly eliminated.
nr_alloc_fail is cut down by almost 2/3. RECLAIM delay is reduced
from 29ms to 15ms. Virtually everything improved considerably!
Thanks,
Fengguang
---
start time: 245
total time: 526
nr_alloc_fail 14586
allocstall 1578343
LOC: 533981 529210 528283 532346 533392 531314 531705 528983 Local timer interrupts
RES: 3123 2177 1676 1580 2157 1974 1606 1696 Rescheduling interrupts
CAL: 218392 218631 219167 219217 218840 218985 218429 218440 Function call interrupts
TLB: 175 13 21 18 62 309 119 42 TLB shootdowns
CPU count real total virtual total delay total
1122 3676441096 3656793547 274182127286
IO count delay total delay average
3 291765493 97ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1350 39229752193 29ms
dd: read=45056, write=0, cancelled_write=0
Thanks,
Fengguang
On Wed, May 4, 2011 at 12:00 PM, Wu Fengguang <[email protected]> wrote:
>> > CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
>> >
>> > LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts
>> > RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts
>> > TLB: 189 15 13 17 64 294 97 63 TLB shootdowns
>>
>> Could you tell how to get above info?
>
> It's /proc/interrupts.
>
> I have two lines at the end of the attached script to collect the
> information, and another script to call getdelays on every 10s. The
> posted reclaim delays are the last successful getdelays output.
>
> I've automated the test process, so that with one single command line
> a new kernel will be built and the test box will rerun tests on the
> new kernel :)
Thank you for that effort!
>
> Thanks,
> Fengguang
>
--
Regards
dave