2013-03-01 09:22:42

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH] add extra free kbytes tunable

Hi Johannes,

On 02/23/2013 01:56 AM, Johannes Weiner wrote:
> On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:
>>> The problem is that adding this tunable will constrain future VM
>>> implementations. We will forever need to at least retain the
>>> pseudo-file. We will also need to make some effort to retain its
>>> behaviour.
>>>
>>> It would of course be better to fix things so you don't need to tweak
>>> VM internals to get acceptable behaviour.
>> I sympathize with this. It's presently all that keeps us afloat though.
>> I'll whine about it again later if nothing else pans out.
>>
>>> You said:
>>>
>>> : We have a server workload wherein machines with 100G+ of "free" memory
>>> : (used by page cache), scattered but frequent random io reads from 12+
>>> : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
>>> : in a few different ways.
>>> :
>>> : 1) It'll run into small amounts of reclaim randomly (a few hundred
>>> : thousand).
>>> :
>>> : 2) A burst of reads or traffic can cause extra pressure, which kswapd
>>> : occasionally responds to by freeing up 40g+ of the pagecache all at once
>>> : (!) while pausing the system (Argh).
>>> :
>>> : 3) A blip in an upstream provider or failover from a peer causes the
>>> : kernel to allocate massive amounts of memory for retransmission
>>> : queues/etc, potentially along with buffered IO reads and (some, but not
>>> : often a ton) of new allocations from an application. This paired with 2)
>>> : can cause the box to stall for 15+ seconds.
>>>
>>> Can we prioritise these? 2) looks just awful - kswapd shouldn't just
>>> go off and free 40G of pagecache. Do you know what's actually in that
>>> pagecache? Large number of small files or small number of (very) large
>>> files?
>> We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
>> accessed via address. occasionally madvise (WILLNEED) applied to the
>> address ranges before attempting to use them. There're a mix of other
>> files but nothing significant. The mmap's are READONLY and writes are done
>> via pwrite-ish functions.
>>
>> I could use some guidance on inspecting/tracing the problem. I've been
>> trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
>>
>> - The amount of memory freed back up is either a percentage of total
>> memory or a percentage of free memory. (a machine with 48G of ram will
>> "only" free up an extra 4-7g)
>>
>> - It's most likely to happen after a fresh boot, or if "3 > drop_caches"
>> is applied with the application down. As it fills it seems to get itself
>> into trouble, but becomes more stable after that. Unfortunately 1) and 3)
>> still apply to a stable instance.
>>
>> - Protecting the DMA32 zone with something like "1 1 32" into
>> lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
>>
>> - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
>> hundred thousand pages before finding anything it actually wants to
>> reclaim (low vmeff). I've only been able to reproduce this from a clean
>> start. It can take up to 3 seconds before kswapd starts actually
>> reclaiming pages.
>>
>> - So far as I can tell we're almost exclusively using 0 order allocations.
>> THP is disabled.
>>
>> There's not much dirty memory involved. It's not flushing out writes while
>> reclaiming, it just kills off massive amount of cached memory.
> Mapped file pages have to get scanned twice before they are reclaimed
> because we don't have enough usage information after the first scan.

It seems that just VM_EXEC mapped file pages are protected.
Issue in page reclaim subsystem:
static inline int page_is_file_cache(struct page *page)
{
return !PageSwapBacked(page);
}
AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and
be cleaned if removed from swap cache. So anonymous pages which are
reclaimed and add to swap cache won't have this flag, then they will be
treated as file backed pages? Is it buggy? In function
__add_to_swap_cache if add to radix tree successfully will result in
increase NR_FILE_PAGES, why?
>
> In your case, when you start this workload after a fresh boot or
> dropping the caches, there will be 48G of mapped file pages that have
> never been scanned before and that need to be looked at twice.
>
> Unfortunately, if kswapd does not make progress (and it won't for some
> time at first), it will scan more and more aggressively with

Why kswapd does not make progress for some time at first?

> increasing scan priority. And when the 48G of pages are finally
> cycled, kswapd's scan window is a large percentage of your machine's
> memory, and it will free every single page in it.
>
> I think we should think about capping kswapd zone reclaim cycles just
> as we do for direct reclaim. It's a little ridiculous that it can run
> unbounded and reclaim every page in a zone without ever checking back
> against the watermark. We still increase the scan window evenly when
> we don't make forward progress, but we are more carefully inching zone
> levels back toward the watermarks.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c4883eb..8a4c446 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> .may_unmap = 1,
> .may_swap = 1,
> /*
> - * kswapd doesn't want to be bailed out while reclaim. because
> - * we want to put equal scanning pressure on each zone.
> + * Even kswapd zone scans want to be bailed out after
> + * reclaiming a good chunk of pages. It will just
> + * come back if the watermarks are still not met.
> */
> - .nr_to_reclaim = ULONG_MAX,
> + .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .order = order,
> .target_mem_cgroup = NULL,
> };
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2013-03-01 09:31:58

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH] add extra free kbytes tunable

On 03/01/2013 05:22 PM, Simon Jeons wrote:
> Hi Johannes,
>
> On 02/23/2013 01:56 AM, Johannes Weiner wrote:
>> On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:
>>>> The problem is that adding this tunable will constrain future VM
>>>> implementations. We will forever need to at least retain the
>>>> pseudo-file. We will also need to make some effort to retain its
>>>> behaviour.
>>>>
>>>> It would of course be better to fix things so you don't need to tweak
>>>> VM internals to get acceptable behaviour.
>>> I sympathize with this. It's presently all that keeps us afloat though.
>>> I'll whine about it again later if nothing else pans out.
>>>
>>>> You said:
>>>>
>>>> : We have a server workload wherein machines with 100G+ of "free"
>>>> memory
>>>> : (used by page cache), scattered but frequent random io reads from
>>>> 12+
>>>> : SSD's, and 5gbps+ of internet traffic, will frequently hit direct
>>>> reclaim
>>>> : in a few different ways.
>>>> :
>>>> : 1) It'll run into small amounts of reclaim randomly (a few hundred
>>>> : thousand).
>>>> :
>>>> : 2) A burst of reads or traffic can cause extra pressure, which
>>>> kswapd
>>>> : occasionally responds to by freeing up 40g+ of the pagecache all
>>>> at once
>>>> : (!) while pausing the system (Argh).
>>>> :
>>>> : 3) A blip in an upstream provider or failover from a peer causes the
>>>> : kernel to allocate massive amounts of memory for retransmission
>>>> : queues/etc, potentially along with buffered IO reads and (some,
>>>> but not
>>>> : often a ton) of new allocations from an application. This paired
>>>> with 2)
>>>> : can cause the box to stall for 15+ seconds.
>>>>
>>>> Can we prioritise these? 2) looks just awful - kswapd shouldn't just
>>>> go off and free 40G of pagecache. Do you know what's actually in that
>>>> pagecache? Large number of small files or small number of (very)
>>>> large
>>>> files?
>>> We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
>>> accessed via address. occasionally madvise (WILLNEED) applied to the
>>> address ranges before attempting to use them. There're a mix of other
>>> files but nothing significant. The mmap's are READONLY and writes
>>> are done
>>> via pwrite-ish functions.
>>>
>>> I could use some guidance on inspecting/tracing the problem. I've been
>>> trying to reproduce it in a lab, and respecting to 2)'s issue I've
>>> found:
>>>
>>> - The amount of memory freed back up is either a percentage of total
>>> memory or a percentage of free memory. (a machine with 48G of ram will
>>> "only" free up an extra 4-7g)
>>>
>>> - It's most likely to happen after a fresh boot, or if "3 >
>>> drop_caches"
>>> is applied with the application down. As it fills it seems to get
>>> itself
>>> into trouble, but becomes more stable after that. Unfortunately 1)
>>> and 3)
>>> still apply to a stable instance.
>>>
>>> - Protecting the DMA32 zone with something like "1 1 32" into
>>> lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
>>>
>>> - While watching "sar -B 1" I'll see kswapd wake up, and scan up to
>>> a few
>>> hundred thousand pages before finding anything it actually wants to
>>> reclaim (low vmeff). I've only been able to reproduce this from a clean
>>> start. It can take up to 3 seconds before kswapd starts actually
>>> reclaiming pages.
>>>
>>> - So far as I can tell we're almost exclusively using 0 order
>>> allocations.
>>> THP is disabled.
>>>
>>> There's not much dirty memory involved. It's not flushing out writes
>>> while
>>> reclaiming, it just kills off massive amount of cached memory.
>> Mapped file pages have to get scanned twice before they are reclaimed
>> because we don't have enough usage information after the first scan.
>
> It seems that just VM_EXEC mapped file pages are protected.
> Issue in page reclaim subsystem:
> static inline int page_is_file_cache(struct page *page)
> {
> return !PageSwapBacked(page);
> }
> AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and
> be cleaned if removed from swap cache. So anonymous pages which are
> reclaimed and add to swap cache won't have this flag, then they will
> be treated as

s/are/aren't

> file backed pages? Is it buggy? In function __add_to_swap_cache if
> add to radix tree successfully will result in increase NR_FILE_PAGES,
> why?
>>
>> In your case, when you start this workload after a fresh boot or
>> dropping the caches, there will be 48G of mapped file pages that have
>> never been scanned before and that need to be looked at twice.
>>
>> Unfortunately, if kswapd does not make progress (and it won't for some
>> time at first), it will scan more and more aggressively with
>
> Why kswapd does not make progress for some time at first?
>
>> increasing scan priority. And when the 48G of pages are finally
>> cycled, kswapd's scan window is a large percentage of your machine's
>> memory, and it will free every single page in it.
>>
>> I think we should think about capping kswapd zone reclaim cycles just
>> as we do for direct reclaim. It's a little ridiculous that it can run
>> unbounded and reclaim every page in a zone without ever checking back
>> against the watermark. We still increase the scan window evenly when
>> we don't make forward progress, but we are more carefully inching zone
>> levels back toward the watermarks.
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c4883eb..8a4c446 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t
>> *pgdat, int order,
>> .may_unmap = 1,
>> .may_swap = 1,
>> /*
>> - * kswapd doesn't want to be bailed out while reclaim. because
>> - * we want to put equal scanning pressure on each zone.
>> + * Even kswapd zone scans want to be bailed out after
>> + * reclaiming a good chunk of pages. It will just
>> + * come back if the watermarks are still not met.
>> */
>> - .nr_to_reclaim = ULONG_MAX,
>> + .nr_to_reclaim = SWAP_CLUSTER_MAX,
>> .order = order,
>> .target_mem_cgroup = NULL,
>> };
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2013-03-01 22:34:23

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] add extra free kbytes tunable

On Fri, 1 Mar 2013, Simon Jeons wrote:
> On 03/01/2013 05:22 PM, Simon Jeons wrote:
> > On 02/23/2013 01:56 AM, Johannes Weiner wrote:
> > > Mapped file pages have to get scanned twice before they are reclaimed
> > > because we don't have enough usage information after the first scan.
> >
> > It seems that just VM_EXEC mapped file pages are protected.
> > Issue in page reclaim subsystem:
> > static inline int page_is_file_cache(struct page *page)
> > {
> > return !PageSwapBacked(page);
> > }
> > AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be
> > cleaned if removed from swap cache. So anonymous pages which are reclaimed
> > and add to swap cache won't have this flag, then they will be treated as
>
> s/are/aren't

PG_swapbacked != PG_swapcache

2013-03-02 00:11:00

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH] add extra free kbytes tunable

On 03/02/2013 06:33 AM, Hugh Dickins wrote:
> On Fri, 1 Mar 2013, Simon Jeons wrote:
>> On 03/01/2013 05:22 PM, Simon Jeons wrote:
>>> On 02/23/2013 01:56 AM, Johannes Weiner wrote:
>>>> Mapped file pages have to get scanned twice before they are reclaimed
>>>> because we don't have enough usage information after the first scan.
>>> It seems that just VM_EXEC mapped file pages are protected.
>>> Issue in page reclaim subsystem:
>>> static inline int page_is_file_cache(struct page *page)
>>> {
>>> return !PageSwapBacked(page);
>>> }
>>> AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be
>>> cleaned if removed from swap cache. So anonymous pages which are reclaimed
>>> and add to swap cache won't have this flag, then they will be treated as
>> s/are/aren't
> PG_swapbacked != PG_swapcache

Oh, I see. Thanks Hugh, thanks for your patient. :)

In function __add_to_swap_cache if add to radix tree successfully will
result in increase NR_FILE_PAGES, why? This is anonymous page instead of
file backed page.

2013-03-02 01:43:01

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] add extra free kbytes tunable

On Sat, 2 Mar 2013, Simon Jeons wrote:
>
> In function __add_to_swap_cache if add to radix tree successfully will result
> in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed
> page.

Right, that's hard to understand without historical background.

I think the quick answer would be that we used to (and still do) think
of file-cache and swap-cache as two halves of page-cache. And then when
someone changed the way stats were gathered, they couldn't very well
name the stat for page-cache pages NR_PAGE_PAGES, so they called it
NR_FILE_PAGES - but it still included swap.

We have tried down the years to keep the info shown in /proc/meminfo
(for example, but it is the prime example) consistent across releases,
while adding new lines and new distinctions.

But it has often been hard to find good enough short enough names for
those new distinctions: when 2.6.28 split the LRUs between file-backed
and swap-backed, it used "anon" for swap-backed in /proc/meminfo.

So you'll find that shmem and swap are counted as file in some places
and anon in others, and it's hard to grasp which is where and why,
without remembering the history.

I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts
total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo:
so it's undoing what you observe __add_to_swap_cache() to be doing.

It's quite possible that if you went through all the users of
NR_FILE_PAGES, you'd find it makes much more sense to leave out
the swap-cache pages, and just add those on where needed.

But you might find a few places where it's hard to decide whether
the swap-cache pages were ever intended to be included or not, and
hard to decide if it's safe to change those numbers now or not.

Hugh

2013-03-02 02:48:29

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH] add extra free kbytes tunable

On 03/02/2013 09:42 AM, Hugh Dickins wrote:
> On Sat, 2 Mar 2013, Simon Jeons wrote:
>> In function __add_to_swap_cache if add to radix tree successfully will result
>> in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed
>> page.
> Right, that's hard to understand without historical background.
>
> I think the quick answer would be that we used to (and still do) think
> of file-cache and swap-cache as two halves of page-cache. And then when

shmem page should be treated as file-cache or swap-cache? It is strange
since it is consist of anonymous pages and these pages establish files.

> someone changed the way stats were gathered, they couldn't very well
> name the stat for page-cache pages NR_PAGE_PAGES, so they called it
> NR_FILE_PAGES - but it still included swap.
>
> We have tried down the years to keep the info shown in /proc/meminfo
> (for example, but it is the prime example) consistent across releases,
> while adding new lines and new distinctions.
>
> But it has often been hard to find good enough short enough names for
> those new distinctions: when 2.6.28 split the LRUs between file-backed
> and swap-backed, it used "anon" for swap-backed in /proc/meminfo.
>
> So you'll find that shmem and swap are counted as file in some places
> and anon in others, and it's hard to grasp which is where and why,
> without remembering the history.
>
> I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts
> total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo:
> so it's undoing what you observe __add_to_swap_cache() to be doing.
>
> It's quite possible that if you went through all the users of
> NR_FILE_PAGES, you'd find it makes much more sense to leave out
> the swap-cache pages, and just add those on where needed.
>
> But you might find a few places where it's hard to decide whether
> the swap-cache pages were ever intended to be included or not, and
> hard to decide if it's safe to change those numbers now or not.
>
> Hugh

2013-03-02 03:09:12

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] add extra free kbytes tunable

On Sat, 2 Mar 2013, Simon Jeons wrote:
> On 03/02/2013 09:42 AM, Hugh Dickins wrote:
> > On Sat, 2 Mar 2013, Simon Jeons wrote:
> > > In function __add_to_swap_cache if add to radix tree successfully will
> > > result
> > > in increase NR_FILE_PAGES, why? This is anonymous page instead of file
> > > backed
> > > page.
> > Right, that's hard to understand without historical background.
> >
> > I think the quick answer would be that we used to (and still do) think
> > of file-cache and swap-cache as two halves of page-cache. And then when
>
> shmem page should be treated as file-cache or swap-cache? It is strange since
> it is consist of anonymous pages and these pages establish files.

A shmem page is swap-backed file-cache, and it may get transferred to or
from swap-cache: yes, it's a difficult and confusing case, as I said below.

I would never call it "anonymous", but it is counted in /proc/meminfo's
Active(anon) or Inactive(anon) rather than in (file), because "anon"
there is shorthand for "swap-backed".

> > So you'll find that shmem and swap are counted as file in some places
> > and anon in others, and it's hard to grasp which is where and why,
> > without remembering the history.

Hugh

2013-03-02 04:06:55

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH] add extra free kbytes tunable

On 03/02/2013 11:08 AM, Hugh Dickins wrote:
> On Sat, 2 Mar 2013, Simon Jeons wrote:
>> On 03/02/2013 09:42 AM, Hugh Dickins wrote:
>>> On Sat, 2 Mar 2013, Simon Jeons wrote:
>>>> In function __add_to_swap_cache if add to radix tree successfully will
>>>> result
>>>> in increase NR_FILE_PAGES, why? This is anonymous page instead of file
>>>> backed
>>>> page.
>>> Right, that's hard to understand without historical background.
>>>
>>> I think the quick answer would be that we used to (and still do) think
>>> of file-cache and swap-cache as two halves of page-cache. And then when
>> shmem page should be treated as file-cache or swap-cache? It is strange since
>> it is consist of anonymous pages and these pages establish files.
> A shmem page is swap-backed file-cache, and it may get transferred to or
> from swap-cache: yes, it's a difficult and confusing case, as I said below.
>
> I would never call it "anonymous", but it is counted in /proc/meminfo's
> Active(anon) or Inactive(anon) rather than in (file), because "anon"
> there is shorthand for "swap-backed".

Oh, I see. Thanks. :)

>
>>> So you'll find that shmem and swap are counted as file in some places
>>> and anon in others, and it's hard to grasp which is where and why,
>>> without remembering the history.
> Hugh

2013-03-09 01:08:16

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH] add extra free kbytes tunable

Hi Hugh,
On 03/02/2013 11:08 AM, Hugh Dickins wrote:
> On Sat, 2 Mar 2013, Simon Jeons wrote:
>> On 03/02/2013 09:42 AM, Hugh Dickins wrote:
>>> On Sat, 2 Mar 2013, Simon Jeons wrote:
>>>> In function __add_to_swap_cache if add to radix tree successfully will
>>>> result
>>>> in increase NR_FILE_PAGES, why? This is anonymous page instead of file
>>>> backed
>>>> page.
>>> Right, that's hard to understand without historical background.
>>>
>>> I think the quick answer would be that we used to (and still do) think
>>> of file-cache and swap-cache as two halves of page-cache. And then when
>> shmem page should be treated as file-cache or swap-cache? It is strange since
>> it is consist of anonymous pages and these pages establish files.
> A shmem page is swap-backed file-cache, and it may get transferred to or
> from swap-cache: yes, it's a difficult and confusing case, as I said below.
>
> I would never call it "anonymous", but it is counted in /proc/meminfo's
> Active(anon) or Inactive(anon) rather than in (file), because "anon"
> there is shorthand for "swap-backed".

In read_swap_cache_async:

SetPageSwapBacked(new_page);
__add_to_swap_cache();
swap_readpage();
ClearPageSwapBacked(new_page);

Why clear PG_swapbacked flag?

>
>>> So you'll find that shmem and swap are counted as file in some places
>>> and anon in others, and it's hard to grasp which is where and why,
>>> without remembering the history.
> Hugh