LinuxLists.cc - Re: [PATCH]vmscan: handle underflow for get_scan

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

> > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > <1% seems no good reclaim rate.
> >
> > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > because under streaming io workload, the number of scanning anon pages should
> > be zero. this is very strong requirement. if not, backup operation will makes
> > a lot of swapping out.
> Sounds there is no big impact for the workload which you mentioned with the patch.
> please see below descriptions.
> I updated the description of the patch as fengguang suggested.

Umm.. sorry, no.

"one fix but introduce another one bug" is not good deal. instead,
I'll revert the guilty commit at first as akpm mentioned.

thanks.

2010-04-02 09:24:49

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Fri, Apr 02, 2010 at 05:14:38PM +0800, KOSAKI Motohiro wrote:
> > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > <1% seems no good reclaim rate.
> > >
> > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > because under streaming io workload, the number of scanning anon pages should
> > > be zero. this is very strong requirement. if not, backup operation will makes
> > > a lot of swapping out.
> > Sounds there is no big impact for the workload which you mentioned with the patch.
> > please see below descriptions.
> > I updated the description of the patch as fengguang suggested.
>
> Umm.. sorry, no.
>
> "one fix but introduce another one bug" is not good deal. instead,
> I'll revert the guilty commit at first as akpm mentioned.
Even we revert the commit, the patch still has its benefit, as it increases
calculation precision, right?

Thanks,
Shaohua

2010-04-04 14:19:29

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

> On Fri, Apr 02, 2010 at 05:14:38PM +0800, KOSAKI Motohiro wrote:
> > > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > > <1% seems no good reclaim rate.
> > > >
> > > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > > because under streaming io workload, the number of scanning anon pages should
> > > > be zero. this is very strong requirement. if not, backup operation will makes
> > > > a lot of swapping out.
> > > Sounds there is no big impact for the workload which you mentioned with the patch.
> > > please see below descriptions.
> > > I updated the description of the patch as fengguang suggested.
> >
> > Umm.. sorry, no.
> >
> > "one fix but introduce another one bug" is not good deal. instead,
> > I'll revert the guilty commit at first as akpm mentioned.
> Even we revert the commit, the patch still has its benefit, as it increases
> calculation precision, right?

no, you shouldn't ignore the regression case.

If we can remove the streaming io corner case by another patch, this patch
can be considered to merge.

thanks.

2010-04-06 00:50:44

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Fri, Apr 02, 2010 at 02:50:52PM +0800, Li, Shaohua wrote:
> On Wed, Mar 31, 2010 at 01:53:27PM +0800, KOSAKI Motohiro wrote:
> > > > On Tue, Mar 30, 2010 at 02:08:53PM +0800, KOSAKI Motohiro wrote:
> > > > > Hi
> > > > >
> > > > > > Commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26 introduces a regression.
> > > > > > With it, our tmpfs test always oom. The test has a lot of rotated anon
> > > > > > pages and cause percent[0] zero. Actually the percent[0] is a very small
> > > > > > value, but our calculation round it to zero. The commit makes vmscan
> > > > > > completely skip anon pages and cause oops.
> > > > > > An option is if percent[x] is zero in get_scan_ratio(), forces it
> > > > > > to 1. See below patch.
> > > > > > But the offending commit still changes behavior. Without the commit, we scan
> > > > > > all pages if priority is zero, below patch doesn't fix this. Don't know if
> > > > > > It's required to fix this too.
> > > > >
> > > > > Can you please post your /proc/meminfo and reproduce program? I'll digg it.
> > > > >
> > > > > Very unfortunately, this patch isn't acceptable. In past time, vmscan
> > > > > had similar logic, but 1% swap-out made lots bug reports.
> > > > if 1% is still big, how about below patch?
> > >
> > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > <1% seems no good reclaim rate.
> >
> > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > because under streaming io workload, the number of scanning anon pages should
> > be zero. this is very strong requirement. if not, backup operation will makes
> > a lot of swapping out.
> Sounds there is no big impact for the workload which you mentioned with the patch.
> please see below descriptions.
> I updated the description of the patch as fengguang suggested.
>
>
>
> Commit 84b18490d introduces a regression. With it, our tmpfs test always oom.
> The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there are
> 6 copies of kernel source and the test does kbuild for each copy. My
> investigation shows the test has a lot of rotated anon pages and quite few
> file pages, so get_scan_ratio calculates percent[0] to be zero. Actually
> the percent[0] shoule be a very small value, but our calculation round it
> to zero. The commit makes vmscan completely skip anon pages and cause oops.
>
> To avoid underflow, we don't use percentage, instead we directly calculate
> how many pages should be scaned. In this way, we should get several scan pages
> for < 1% percent. With this fix, my test doesn't oom any more.
>
> Note, this patch doesn't really change logics, but just increase precise. For
> system with a lot of memory, this might slightly changes behavior. For example,
> in a sequential file read workload, without the patch, we don't swap any anon
> pages. With it, if anon memory size is bigger than 16G, we will say one anon page

see?

> swapped. The 16G is calculated as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap
> is assumed to be 1024 which is common in this workload. So the impact sounds not
> a big deal.
>
> Signed-off-by: Shaohua Li <[email protected]>
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 79c8098..80a7ed5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1519,27 +1519,50 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> }
>
> /*
> + * Smallish @nr_to_scan's are deposited in @nr_saved_scan,
> + * until we collected @swap_cluster_max pages to scan.
> + */
> +static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
> + unsigned long *nr_saved_scan)
> +{
> + unsigned long nr;
> +
> + *nr_saved_scan += nr_to_scan;
> + nr = *nr_saved_scan;
> +
> + if (nr >= SWAP_CLUSTER_MAX)
> + *nr_saved_scan = 0;
> + else
> + nr = 0;
> +
> + return nr;
> +}
> +
> +/*
> * Determine how aggressively the anon and file LRU lists should be
> * scanned. The relative value of each set of LRU lists is determined
> * by looking at the fraction of the pages scanned we did rotate back
> * onto the active list instead of evict.
> *
> - * percent[0] specifies how much pressure to put on ram/swap backed
> - * memory, while percent[1] determines pressure on the file LRUs.
> + * nr[x] specifies how many pages should be scaned

The new comment loses information..

> */
> -static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
> - unsigned long *percent)
> +static void get_scan_count(struct zone *zone, struct scan_control *sc,
> + unsigned long *nr, int priority)
> {
> unsigned long anon, file, free;
> unsigned long anon_prio, file_prio;
> unsigned long ap, fp;
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> + unsigned long fraction[2], denominator[2];

denominator[2] can be reduced to denominator.
because denominator[0] == denominator[1] always holds.

> + enum lru_list l;
>
> /* If we have no swap space, do not bother scanning anon pages. */
> if (!sc->may_swap || (nr_swap_pages <= 0)) {
> - percent[0] = 0;
> - percent[1] = 100;
> - return;
> + fraction[0] = 0;
> + denominator[0] = 1;
> + fraction[1] = 1;
> + denominator[1] = 1;
> + goto out;
> }
>
> anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
> @@ -1552,9 +1575,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
> /* If we have very few page cache pages,
> force-scan anon pages. */
> if (unlikely(file + free <= high_wmark_pages(zone))) {
> - percent[0] = 100;
> - percent[1] = 0;
> - return;
> + fraction[0] = 1;
> + denominator[0] = 1;
> + fraction[1] = 0;
> + denominator[1] = 1;
> + goto out;
> }
> }
>
> @@ -1601,29 +1626,29 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
> fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
> fp /= reclaim_stat->recent_rotated[1] + 1;
>
> - /* Normalize to percentages */
> - percent[0] = 100 * ap / (ap + fp + 1);
> - percent[1] = 100 - percent[0];
> -}
> -
> -/*
> - * Smallish @nr_to_scan's are deposited in @nr_saved_scan,
> - * until we collected @swap_cluster_max pages to scan.
> - */
> -static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
> - unsigned long *nr_saved_scan)
> -{
> - unsigned long nr;
> + fraction[0] = ap;
> + denominator[0] = ap + fp + 1;
> + fraction[1] = fp;
> + denominator[1] = ap + fp + 1;
>
> - *nr_saved_scan += nr_to_scan;
> - nr = *nr_saved_scan;
> +out:
> + for_each_evictable_lru(l) {
> + int file = is_file_lru(l);
> + unsigned long scan;
>
> - if (nr >= SWAP_CLUSTER_MAX)
> - *nr_saved_scan = 0;
> - else
> - nr = 0;
> + if (fraction[file] == 0) {
> + nr[l] = 0;
> + continue;
> + }
>
> - return nr;
> + scan = zone_nr_lru_pages(zone, sc, l);
> + if (priority) {
> + scan >>= priority;
> + scan = (scan * fraction[file] / denominator[file]);

The "()" is not necessary here, or better end it here^

Thanks,
Fengguang

> + }
> + nr[l] = nr_scan_try_batch(scan,
> + &reclaim_stat->nr_saved_scan[l]);
> + }
> }
>
> /*
> @@ -1634,31 +1659,11 @@ static void shrink_zone(int priority, struct zone *zone,
> {
> unsigned long nr[NR_LRU_LISTS];
> unsigned long nr_to_scan;
> - unsigned long percent[2]; /* anon @ 0; file @ 1 */
> enum lru_list l;
> unsigned long nr_reclaimed = sc->nr_reclaimed;
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> - struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> -
> - get_scan_ratio(zone, sc, percent);
>
> - for_each_evictable_lru(l) {
> - int file = is_file_lru(l);
> - unsigned long scan;
> -
> - if (percent[file] == 0) {
> - nr[l] = 0;
> - continue;
> - }
> -
> - scan = zone_nr_lru_pages(zone, sc, l);
> - if (priority) {
> - scan >>= priority;
> - scan = (scan * percent[file]) / 100;
> - }
> - nr[l] = nr_scan_try_batch(scan,
> - &reclaim_stat->nr_saved_scan[l]);
> - }
> + get_scan_count(zone, sc, nr, priority);
>
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {

2010-04-06 01:25:42

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Sun, Apr 04, 2010 at 10:19:06PM +0800, KOSAKI Motohiro wrote:
> > On Fri, Apr 02, 2010 at 05:14:38PM +0800, KOSAKI Motohiro wrote:
> > > > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > > > <1% seems no good reclaim rate.
> > > > >
> > > > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > > > because under streaming io workload, the number of scanning anon pages should
> > > > > be zero. this is very strong requirement. if not, backup operation will makes
> > > > > a lot of swapping out.
> > > > Sounds there is no big impact for the workload which you mentioned with the patch.
> > > > please see below descriptions.
> > > > I updated the description of the patch as fengguang suggested.
> > >
> > > Umm.. sorry, no.
> > >
> > > "one fix but introduce another one bug" is not good deal. instead,
> > > I'll revert the guilty commit at first as akpm mentioned.
> > Even we revert the commit, the patch still has its benefit, as it increases
> > calculation precision, right?
>
> no, you shouldn't ignore the regression case.
I don't think this is serious. In my calculation, there is only 1 page swapped out
for 6G anonmous memory. 1 page should haven't any performance impact.

Thanks,
Shaohua

2010-04-06 01:27:50

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Sun, Apr 04, 2010 at 08:48:38AM +0800, Wu, Fengguang wrote:
> On Fri, Apr 02, 2010 at 02:50:52PM +0800, Li, Shaohua wrote:
> > On Wed, Mar 31, 2010 at 01:53:27PM +0800, KOSAKI Motohiro wrote:
> > > > > On Tue, Mar 30, 2010 at 02:08:53PM +0800, KOSAKI Motohiro wrote:
> > > > > > Hi
> > > > > >
> > > > > > > Commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26 introduces a regression.
> > > > > > > With it, our tmpfs test always oom. The test has a lot of rotated anon
> > > > > > > pages and cause percent[0] zero. Actually the percent[0] is a very small
> > > > > > > value, but our calculation round it to zero. The commit makes vmscan
> > > > > > > completely skip anon pages and cause oops.
> > > > > > > An option is if percent[x] is zero in get_scan_ratio(), forces it
> > > > > > > to 1. See below patch.
> > > > > > > But the offending commit still changes behavior. Without the commit, we scan
> > > > > > > all pages if priority is zero, below patch doesn't fix this. Don't know if
> > > > > > > It's required to fix this too.
> > > > > >
> > > > > > Can you please post your /proc/meminfo and reproduce program? I'll digg it.
> > > > > >
> > > > > > Very unfortunately, this patch isn't acceptable. In past time, vmscan
> > > > > > had similar logic, but 1% swap-out made lots bug reports.
> > > > > if 1% is still big, how about below patch?
> > > >
> > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > <1% seems no good reclaim rate.
> > >
> > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > because under streaming io workload, the number of scanning anon pages should
> > > be zero. this is very strong requirement. if not, backup operation will makes
> > > a lot of swapping out.
> > Sounds there is no big impact for the workload which you mentioned with the patch.
> > please see below descriptions.
> > I updated the description of the patch as fengguang suggested.
> >
> >
> >
> > Commit 84b18490d introduces a regression. With it, our tmpfs test always oom.
> > The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there are
> > 6 copies of kernel source and the test does kbuild for each copy. My
> > investigation shows the test has a lot of rotated anon pages and quite few
> > file pages, so get_scan_ratio calculates percent[0] to be zero. Actually
> > the percent[0] shoule be a very small value, but our calculation round it
> > to zero. The commit makes vmscan completely skip anon pages and cause oops.
> >
> > To avoid underflow, we don't use percentage, instead we directly calculate
> > how many pages should be scaned. In this way, we should get several scan pages
> > for < 1% percent. With this fix, my test doesn't oom any more.
> >
> > Note, this patch doesn't really change logics, but just increase precise. For
> > system with a lot of memory, this might slightly changes behavior. For example,
> > in a sequential file read workload, without the patch, we don't swap any anon
> > pages. With it, if anon memory size is bigger than 16G, we will say one anon page
>
> see?
Thanks, will send a updated against -mm since we reverted the offending patch.

Thanks,
Shaohua

2010-04-06 01:36:40

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

> On Sun, Apr 04, 2010 at 10:19:06PM +0800, KOSAKI Motohiro wrote:
> > > On Fri, Apr 02, 2010 at 05:14:38PM +0800, KOSAKI Motohiro wrote:
> > > > > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > > > > <1% seems no good reclaim rate.
> > > > > >
> > > > > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > > > > because under streaming io workload, the number of scanning anon pages should
> > > > > > be zero. this is very strong requirement. if not, backup operation will makes
> > > > > > a lot of swapping out.
> > > > > Sounds there is no big impact for the workload which you mentioned with the patch.
> > > > > please see below descriptions.
> > > > > I updated the description of the patch as fengguang suggested.
> > > >
> > > > Umm.. sorry, no.
> > > >
> > > > "one fix but introduce another one bug" is not good deal. instead,
> > > > I'll revert the guilty commit at first as akpm mentioned.
> > > Even we revert the commit, the patch still has its benefit, as it increases
> > > calculation precision, right?
> >
> > no, you shouldn't ignore the regression case.
> I don't think this is serious. In my calculation, there is only 1 page swapped out
> for 6G anonmous memory. 1 page should haven't any performance impact.

there is. I had received exactly opposite claim. because shrink_zone()
is not called only once. it is called very much time.

2010-04-06 01:50:55

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Tue, Apr 06, 2010 at 09:25:36AM +0800, Li, Shaohua wrote:
> On Sun, Apr 04, 2010 at 10:19:06PM +0800, KOSAKI Motohiro wrote:
> > > On Fri, Apr 02, 2010 at 05:14:38PM +0800, KOSAKI Motohiro wrote:
> > > > > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > > > > <1% seems no good reclaim rate.
> > > > > >
> > > > > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > > > > because under streaming io workload, the number of scanning anon pages should
> > > > > > be zero. this is very strong requirement. if not, backup operation will makes
> > > > > > a lot of swapping out.
> > > > > Sounds there is no big impact for the workload which you mentioned with the patch.
> > > > > please see below descriptions.
> > > > > I updated the description of the patch as fengguang suggested.
> > > >
> > > > Umm.. sorry, no.
> > > >
> > > > "one fix but introduce another one bug" is not good deal. instead,
> > > > I'll revert the guilty commit at first as akpm mentioned.
> > > Even we revert the commit, the patch still has its benefit, as it increases
> > > calculation precision, right?
> >
> > no, you shouldn't ignore the regression case.

> I don't think this is serious. In my calculation, there is only 1 page swapped out
> for 6G anonmous memory. 1 page should haven't any performance impact.

1 anon page scanned for every N file pages scanned?

Is N a _huge_ enough ratio so that the anon list will be very light scanned?

Rik: here is a little background.

Under streaming IO, the current get_scan_ratio() will get a percent[0]
that is (much) less than 1, so underflows to 0.

It has the bad effect of completely disabling the scan of anon list,
which leads to OOM in Shaohua's test case. OTOH, it also has the good
side effect of keeping anon pages in memory and totally prevent swap
IO.

Shaohua's patch improves the computation precision by computing nr[]
directly in get_scan_ratio(). This is good in general, however will
enable light scan of the anon list on streaming IO.

Thanks,
Fengguang

2010-04-06 02:06:28

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

> On Tue, Apr 06, 2010 at 09:25:36AM +0800, Li, Shaohua wrote:
> > On Sun, Apr 04, 2010 at 10:19:06PM +0800, KOSAKI Motohiro wrote:
> > > > On Fri, Apr 02, 2010 at 05:14:38PM +0800, KOSAKI Motohiro wrote:
> > > > > > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > > > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > > > > > <1% seems no good reclaim rate.
> > > > > > >
> > > > > > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > > > > > because under streaming io workload, the number of scanning anon pages should
> > > > > > > be zero. this is very strong requirement. if not, backup operation will makes
> > > > > > > a lot of swapping out.
> > > > > > Sounds there is no big impact for the workload which you mentioned with the patch.
> > > > > > please see below descriptions.
> > > > > > I updated the description of the patch as fengguang suggested.
> > > > >
> > > > > Umm.. sorry, no.
> > > > >
> > > > > "one fix but introduce another one bug" is not good deal. instead,
> > > > > I'll revert the guilty commit at first as akpm mentioned.
> > > > Even we revert the commit, the patch still has its benefit, as it increases
> > > > calculation precision, right?
> > >
> > > no, you shouldn't ignore the regression case.
>
> > I don't think this is serious. In my calculation, there is only 1 page swapped out
> > for 6G anonmous memory. 1 page should haven't any performance impact.
>
> 1 anon page scanned for every N file pages scanned?
>
> Is N a _huge_ enough ratio so that the anon list will be very light scanned?
>
> Rik: here is a little background.

The problem is, the VM are couteniously discarding no longer used file
cache. if we are scan extra anon 1 page, we will observe tons swap usage
after few days.

please don't only think benchmark.

> Under streaming IO, the current get_scan_ratio() will get a percent[0]
> that is (much) less than 1, so underflows to 0.
>
> It has the bad effect of completely disabling the scan of anon list,
> which leads to OOM in Shaohua's test case. OTOH, it also has the good
> side effect of keeping anon pages in memory and totally prevent swap
> IO.
>
> Shaohua's patch improves the computation precision by computing nr[]
> directly in get_scan_ratio(). This is good in general, however will
> enable light scan of the anon list on streaming IO.

In such case, percent[0] should be big. I think underflowing is not point.
His test case is merely streaming io copy, why can't we drop tmpfs cached
page? his /proc/meminfo describe his machine didn't have droppable file cache.
so, big percent[1] value seems makes no sense. no?

I'm not sure we need either below detection. I need more investigate.
1) detect no discardable file cache
2) detect streaming io on tmpfs (as regular file)

2010-04-06 02:30:51

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Tue, Apr 06, 2010 at 10:06:19AM +0800, KOSAKI Motohiro wrote:
> > On Tue, Apr 06, 2010 at 09:25:36AM +0800, Li, Shaohua wrote:
> > > On Sun, Apr 04, 2010 at 10:19:06PM +0800, KOSAKI Motohiro wrote:
> > > > > On Fri, Apr 02, 2010 at 05:14:38PM +0800, KOSAKI Motohiro wrote:
> > > > > > > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > > > > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > > > > > > <1% seems no good reclaim rate.
> > > > > > > >
> > > > > > > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > > > > > > because under streaming io workload, the number of scanning anon pages should
> > > > > > > > be zero. this is very strong requirement. if not, backup operation will makes
> > > > > > > > a lot of swapping out.
> > > > > > > Sounds there is no big impact for the workload which you mentioned with the patch.
> > > > > > > please see below descriptions.
> > > > > > > I updated the description of the patch as fengguang suggested.
> > > > > >
> > > > > > Umm.. sorry, no.
> > > > > >
> > > > > > "one fix but introduce another one bug" is not good deal. instead,
> > > > > > I'll revert the guilty commit at first as akpm mentioned.
> > > > > Even we revert the commit, the patch still has its benefit, as it increases
> > > > > calculation precision, right?
> > > >
> > > > no, you shouldn't ignore the regression case.
> >
> > > I don't think this is serious. In my calculation, there is only 1 page swapped out
> > > for 6G anonmous memory. 1 page should haven't any performance impact.
> >
> > 1 anon page scanned for every N file pages scanned?
> >
> > Is N a _huge_ enough ratio so that the anon list will be very light scanned?
> >
> > Rik: here is a little background.
>
> The problem is, the VM are couteniously discarding no longer used file
> cache. if we are scan extra anon 1 page, we will observe tons swap usage
> after few days.
>
> please don't only think benchmark.

OK the days-of-streaming-io typically happen in file servers. Suppose
a file server with 16GB memory, 1GB of which is consumed by anonymous
pages, others are for page cache.

Assume that the exact file:anon ratio computed by the get_scan_ratio()
algorithm is 1000:1. In that case percent[0]=0.1 and is rounded down
to 0, which keeps the anon pages in memory for the few days.

Now with Shaohua's patch, nr[0] = (262144/4096)/1000 = 0.06 will also
be rounded down to 0. It only becomes >=1 when
- reclaim runs into trouble and priority goes low
- anon list goes huge

So I guess Shaohua's patch still has reasonable "underflow" threshold :)

Thanks,
Fengguang

>
> > Under streaming IO, the current get_scan_ratio() will get a percent[0]
> > that is (much) less than 1, so underflows to 0.
> >
> > It has the bad effect of completely disabling the scan of anon list,
> > which leads to OOM in Shaohua's test case. OTOH, it also has the good
> > side effect of keeping anon pages in memory and totally prevent swap
> > IO.
> >
> > Shaohua's patch improves the computation precision by computing nr[]
> > directly in get_scan_ratio(). This is good in general, however will
> > enable light scan of the anon list on streaming IO.
>
> In such case, percent[0] should be big. I think underflowing is not point.
> His test case is merely streaming io copy, why can't we drop tmpfs cached
> page? his /proc/meminfo describe his machine didn't have droppable file cache.
> so, big percent[1] value seems makes no sense. no?
>
> I'm not sure we need either below detection. I need more investigate.
> 1) detect no discardable file cache
> 2) detect streaming io on tmpfs (as regular file)
>
>
>

2010-04-06 02:58:54

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

> On Tue, Apr 06, 2010 at 10:06:19AM +0800, KOSAKI Motohiro wrote:
> > > On Tue, Apr 06, 2010 at 09:25:36AM +0800, Li, Shaohua wrote:
> > > > On Sun, Apr 04, 2010 at 10:19:06PM +0800, KOSAKI Motohiro wrote:
> > > > > > On Fri, Apr 02, 2010 at 05:14:38PM +0800, KOSAKI Motohiro wrote:
> > > > > > > > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > > > > > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > > > > > > > <1% seems no good reclaim rate.
> > > > > > > > >
> > > > > > > > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > > > > > > > because under streaming io workload, the number of scanning anon pages should
> > > > > > > > > be zero. this is very strong requirement. if not, backup operation will makes
> > > > > > > > > a lot of swapping out.
> > > > > > > > Sounds there is no big impact for the workload which you mentioned with the patch.
> > > > > > > > please see below descriptions.
> > > > > > > > I updated the description of the patch as fengguang suggested.
> > > > > > >
> > > > > > > Umm.. sorry, no.
> > > > > > >
> > > > > > > "one fix but introduce another one bug" is not good deal. instead,
> > > > > > > I'll revert the guilty commit at first as akpm mentioned.
> > > > > > Even we revert the commit, the patch still has its benefit, as it increases
> > > > > > calculation precision, right?
> > > > >
> > > > > no, you shouldn't ignore the regression case.
> > >
> > > > I don't think this is serious. In my calculation, there is only 1 page swapped out
> > > > for 6G anonmous memory. 1 page should haven't any performance impact.
> > >
> > > 1 anon page scanned for every N file pages scanned?
> > >
> > > Is N a _huge_ enough ratio so that the anon list will be very light scanned?
> > >
> > > Rik: here is a little background.
> >
> > The problem is, the VM are couteniously discarding no longer used file
> > cache. if we are scan extra anon 1 page, we will observe tons swap usage
> > after few days.
> >
> > please don't only think benchmark.
>
> OK the days-of-streaming-io typically happen in file servers. Suppose
> a file server with 16GB memory, 1GB of which is consumed by anonymous
> pages, others are for page cache.
>
> Assume that the exact file:anon ratio computed by the get_scan_ratio()
> algorithm is 1000:1. In that case percent[0]=0.1 and is rounded down
> to 0, which keeps the anon pages in memory for the few days.
>
> Now with Shaohua's patch, nr[0] = (262144/4096)/1000 = 0.06 will also
> be rounded down to 0. It only becomes >=1 when
> - reclaim runs into trouble and priority goes low
> - anon list goes huge
>
> So I guess Shaohua's patch still has reasonable "underflow" threshold :)

Again, I didn't said his patch is no worth. I only said we don't have to
ignore the downside.

2010-04-06 03:31:27

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Tue, Apr 06, 2010 at 10:58:43AM +0800, KOSAKI Motohiro wrote:
> > On Tue, Apr 06, 2010 at 10:06:19AM +0800, KOSAKI Motohiro wrote:
> > > > On Tue, Apr 06, 2010 at 09:25:36AM +0800, Li, Shaohua wrote:
> > > > > On Sun, Apr 04, 2010 at 10:19:06PM +0800, KOSAKI Motohiro wrote:
> > > > > > > On Fri, Apr 02, 2010 at 05:14:38PM +0800, KOSAKI Motohiro wrote:
> > > > > > > > > > > This patch makes a lot of sense than previous. however I think <1% anon ratio
> > > > > > > > > > > shouldn't happen anyway because file lru doesn't have reclaimable pages.
> > > > > > > > > > > <1% seems no good reclaim rate.
> > > > > > > > > >
> > > > > > > > > > Oops, the above mention is wrong. sorry. only 1 page is still too big.
> > > > > > > > > > because under streaming io workload, the number of scanning anon pages should
> > > > > > > > > > be zero. this is very strong requirement. if not, backup operation will makes
> > > > > > > > > > a lot of swapping out.
> > > > > > > > > Sounds there is no big impact for the workload which you mentioned with the patch.
> > > > > > > > > please see below descriptions.
> > > > > > > > > I updated the description of the patch as fengguang suggested.
> > > > > > > >
> > > > > > > > Umm.. sorry, no.
> > > > > > > >
> > > > > > > > "one fix but introduce another one bug" is not good deal. instead,
> > > > > > > > I'll revert the guilty commit at first as akpm mentioned.
> > > > > > > Even we revert the commit, the patch still has its benefit, as it increases
> > > > > > > calculation precision, right?
> > > > > >
> > > > > > no, you shouldn't ignore the regression case.
> > > >
> > > > > I don't think this is serious. In my calculation, there is only 1 page swapped out
> > > > > for 6G anonmous memory. 1 page should haven't any performance impact.
> > > >
> > > > 1 anon page scanned for every N file pages scanned?
> > > >
> > > > Is N a _huge_ enough ratio so that the anon list will be very light scanned?
> > > >
> > > > Rik: here is a little background.
> > >
> > > The problem is, the VM are couteniously discarding no longer used file
> > > cache. if we are scan extra anon 1 page, we will observe tons swap usage
> > > after few days.
> > >
> > > please don't only think benchmark.
> >
> > OK the days-of-streaming-io typically happen in file servers. Suppose
> > a file server with 16GB memory, 1GB of which is consumed by anonymous
> > pages, others are for page cache.
> >
> > Assume that the exact file:anon ratio computed by the get_scan_ratio()
> > algorithm is 1000:1. In that case percent[0]=0.1 and is rounded down
> > to 0, which keeps the anon pages in memory for the few days.
> >
> > Now with Shaohua's patch, nr[0] = (262144/4096)/1000 = 0.06 will also
> > be rounded down to 0. It only becomes >=1 when
> > - reclaim runs into trouble and priority goes low
> > - anon list goes huge
> >
> > So I guess Shaohua's patch still has reasonable "underflow" threshold :)
>
> Again, I didn't said his patch is no worth. I only said we don't have to
> ignore the downside.

Right, we should document both the upside and downside.

The main difference happens when file:anon scan ratio > 100:1.

For the current percent[] based computing, percent[0]=0 hence nr[0]=0
which disables anon list scan unconditionally, for good or for bad.

For the direct nr[] computing,
- nr[0] will be 0 for typical file servers, because with priority=12
and anon lru size < 1.6GB, nr[0] = (anon_size/4096)/100 < 0
- nr[0] will be non-zero when priority=1 and anon_size > 100 pages,
this stops OOM for Shaohua's test case, however may not be enough to
guarantee safety (your previous reverting patch can provide this
guarantee).

I liked Shaohua's patch a lot -- it adapts well to both the
file-server case and the mostly-anon-pages case :)

Thanks,
Fengguang

2010-04-06 03:41:10

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On 04/05/2010 11:31 PM, Wu Fengguang wrote:
> On Tue, Apr 06, 2010 at 10:58:43AM +0800, KOSAKI Motohiro wrote:
>> Again, I didn't said his patch is no worth. I only said we don't have to
>> ignore the downside.
>
> Right, we should document both the upside and downside.

The downside is obvious: streaming IO (used once data
that does not fit in the cache) can push out data that
is used more often - requiring that it be swapped in
at a later point in time.

I understand what Shaohua's patch does, but I do not
understand the upside. What good does it do to increase
the size of the cache for streaming IO data, which is
generally touched only once?

What kind of performance benefits can we get by doing
that?

> The main difference happens when file:anon scan ratio> 100:1.
>
> For the current percent[] based computing, percent[0]=0 hence nr[0]=0
> which disables anon list scan unconditionally, for good or for bad.
>
> For the direct nr[] computing,
> - nr[0] will be 0 for typical file servers, because with priority=12
> and anon lru size< 1.6GB, nr[0] = (anon_size/4096)/100< 0
> - nr[0] will be non-zero when priority=1 and anon_size> 100 pages,
> this stops OOM for Shaohua's test case, however may not be enough to
> guarantee safety (your previous reverting patch can provide this
> guarantee).
>
> I liked Shaohua's patch a lot -- it adapts well to both the
> file-server case and the mostly-anon-pages case :)

2010-04-06 04:49:23

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Tue, Apr 06, 2010 at 11:40:47AM +0800, Rik van Riel wrote:
> On 04/05/2010 11:31 PM, Wu Fengguang wrote:
> > On Tue, Apr 06, 2010 at 10:58:43AM +0800, KOSAKI Motohiro wrote:
> >> Again, I didn't said his patch is no worth. I only said we don't have to
> >> ignore the downside.
> >
> > Right, we should document both the upside and downside.
>
> The downside is obvious: streaming IO (used once data
> that does not fit in the cache) can push out data that
> is used more often - requiring that it be swapped in
> at a later point in time.
>
> I understand what Shaohua's patch does, but I do not
> understand the upside. What good does it do to increase
> the size of the cache for streaming IO data, which is
> generally touched only once?

Not that bad :) With Shaohua's patch the anon list will typically
_never_ get scanned, just like before.

If it's mostly use-once IO, file:anon will be 1000 or even 10000, and
priority=12. Then only anon lists larger than 16GB or 160GB will get
nr[0] >= 1.

> What kind of performance benefits can we get by doing
> that?

So vmscan behavior and performance remain the same as before.

For really large anon list, such workload is beyond our imagination.
So we cannot assert "don't scan anon list" will be a benefit.

On the other hand, in the test case of "do stream IO when most memory
occupied by tmpfs pages", it is very bad behavior refuse to scan anon
list in normal and suddenly start scanning _the whole_ anon list when
priority hits 0. Shaohua's patch helps it by gradually increasing the
scan nr of anon list as memory pressure increases.

Thanks,
Fengguang

> > The main difference happens when file:anon scan ratio> 100:1.
> >
> > For the current percent[] based computing, percent[0]=0 hence nr[0]=0
> > which disables anon list scan unconditionally, for good or for bad.
> >
> > For the direct nr[] computing,
> > - nr[0] will be 0 for typical file servers, because with priority=12
> > and anon lru size< 1.6GB, nr[0] = (anon_size/4096)/100< 0
> > - nr[0] will be non-zero when priority=1 and anon_size> 100 pages,
> > this stops OOM for Shaohua's test case, however may not be enough to
> > guarantee safety (your previous reverting patch can provide this
> > guarantee).
> >
> > I liked Shaohua's patch a lot -- it adapts well to both the
> > file-server case and the mostly-anon-pages case :)
>
>

2010-04-06 05:03:38

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

Shaohua,

> + scan = zone_nr_lru_pages(zone, sc, l);
> + if (priority) {
> + scan >>= priority;
> + scan = (scan * fraction[file] / denominator[file]);

Ah, the (scan * fraction[file]) may overflow in 32bit kernel!

Thanks,
Fengguang

2010-04-06 05:09:55

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Tue, Apr 06, 2010 at 12:49:10PM +0800, Wu, Fengguang wrote:
> On Tue, Apr 06, 2010 at 11:40:47AM +0800, Rik van Riel wrote:
> > On 04/05/2010 11:31 PM, Wu Fengguang wrote:
> > > On Tue, Apr 06, 2010 at 10:58:43AM +0800, KOSAKI Motohiro wrote:
> > >> Again, I didn't said his patch is no worth. I only said we don't have to
> > >> ignore the downside.
> > >
> > > Right, we should document both the upside and downside.
> >
> > The downside is obvious: streaming IO (used once data
> > that does not fit in the cache) can push out data that
> > is used more often - requiring that it be swapped in
> > at a later point in time.
> >
> > I understand what Shaohua's patch does, but I do not
> > understand the upside. What good does it do to increase
> > the size of the cache for streaming IO data, which is
> > generally touched only once?
>
> Not that bad :) With Shaohua's patch the anon list will typically
> _never_ get scanned, just like before.
>
> If it's mostly use-once IO, file:anon will be 1000 or even 10000, and
> priority=12. Then only anon lists larger than 16GB or 160GB will get
> nr[0] >= 1.
>
> > What kind of performance benefits can we get by doing
> > that?
>
> So vmscan behavior and performance remain the same as before.
>
> For really large anon list, such workload is beyond our imagination.
> So we cannot assert "don't scan anon list" will be a benefit.
>
> On the other hand, in the test case of "do stream IO when most memory
> occupied by tmpfs pages", it is very bad behavior refuse to scan anon
> list in normal and suddenly start scanning _the whole_ anon list when
> priority hits 0. Shaohua's patch helps it by gradually increasing the
> scan nr of anon list as memory pressure increases.
Yep, the gradually increasing scan nr is the main advantage in my mind.

Thanks,
Shaohua
> > > The main difference happens when file:anon scan ratio> 100:1.
> > >
> > > For the current percent[] based computing, percent[0]=0 hence nr[0]=0
> > > which disables anon list scan unconditionally, for good or for bad.
> > >
> > > For the direct nr[] computing,
> > > - nr[0] will be 0 for typical file servers, because with priority=12
> > > and anon lru size< 1.6GB, nr[0] = (anon_size/4096)/100< 0
> > > - nr[0] will be non-zero when priority=1 and anon_size> 100 pages,
> > > this stops OOM for Shaohua's test case, however may not be enough to
> > > guarantee safety (your previous reverting patch can provide this
> > > guarantee).
> > >
> > > I liked Shaohua's patch a lot -- it adapts well to both the
> > > file-server case and the mostly-anon-pages case :)
> >
> >

2010-04-06 05:37:07

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

2010-04-09 06:51:12

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Tue, Apr 06, 2010 at 01:03:25PM +0800, Wu, Fengguang wrote:
> Shaohua,
>
> > + scan = zone_nr_lru_pages(zone, sc, l);
> > + if (priority) {
> > + scan >>= priority;
> > + scan = (scan * fraction[file] / denominator[file]);
>
> Ah, the (scan * fraction[file]) may overflow in 32bit kernel!
I updated the patch to address previous issues.
is it possible to put this to -mm tree to see if there is anything wield happen?

get_scan_ratio() calculates percentage and if the percentage is < 1%, it will
round percentage down to 0% and cause we completely ignore scanning anon/file
pages to reclaim memory even the total anon/file pages are very big.

To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned. In this way, we should get several scanned pages
for < 1% percent.

This has some benefits:
1. increase our calculation precision
2. making our scan more smoothly. Without this, if percent[x] is underflow,
shrink_zone() doesn't scan any pages and suddenly it scans all pages when priority
is zero. With this, even priority isn't zero, shrink_zone() gets chance to scan
some pages.

Note, this patch doesn't really change logics, but just increase precision. For
system with a lot of memory, this might slightly changes behavior. For example,
in a sequential file read workload, without the patch, we don't swap any anon
pages. With it, if anon memory size is bigger than 16G, we will see one anon page
swapped. The 16G is calculated as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap
is assumed to be 1024 which is common in this workload. So the impact sounds not
a big deal.

Signed-off-by: Shaohua Li <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Wu Fengguang <[email protected]>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..1070f83 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1519,21 +1519,52 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
}

/*
+ * Smallish @nr_to_scan's are deposited in @nr_saved_scan,
+ * until we collected @swap_cluster_max pages to scan.
+ */
+static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
+ unsigned long *nr_saved_scan)
+{
+ unsigned long nr;
+
+ *nr_saved_scan += nr_to_scan;
+ nr = *nr_saved_scan;
+
+ if (nr >= SWAP_CLUSTER_MAX)
+ *nr_saved_scan = 0;
+ else
+ nr = 0;
+
+ return nr;
+}
+
+/*
* Determine how aggressively the anon and file LRU lists should be
* scanned. The relative value of each set of LRU lists is determined
* by looking at the fraction of the pages scanned we did rotate back
* onto the active list instead of evict.
*
- * percent[0] specifies how much pressure to put on ram/swap backed
- * memory, while percent[1] determines pressure on the file LRUs.
+ * nr[0] = anon pages to scan; nr[1] = file pages to scan
*/
-static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
- unsigned long *percent)
+static void get_scan_count(struct zone *zone, struct scan_control *sc,
+ unsigned long *nr, int priority)
{
unsigned long anon, file, free;
unsigned long anon_prio, file_prio;
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+ u64 fraction[2], denominator;
+ enum lru_list l;
+ int noswap = 0;
+
+ /* If we have no swap space, do not bother scanning anon pages. */
+ if (!sc->may_swap || (nr_swap_pages <= 0)) {
+ noswap = 1;
+ fraction[0] = 0;
+ fraction[1] = 1;
+ denominator = 1;
+ goto out;
+ }

anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
@@ -1545,9 +1576,10 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
/* If we have very few page cache pages,
force-scan anon pages. */
if (unlikely(file + free <= high_wmark_pages(zone))) {
- percent[0] = 100;
- percent[1] = 0;
- return;
+ fraction[0] = 1;
+ fraction[1] = 0;
+ denominator = 1;
+ goto out;
}
}

@@ -1594,29 +1626,22 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;

- /* Normalize to percentages */
- percent[0] = 100 * ap / (ap + fp + 1);
- percent[1] = 100 - percent[0];
-}
-
-/*
- * Smallish @nr_to_scan's are deposited in @nr_saved_scan,
- * until we collected @swap_cluster_max pages to scan.
- */
-static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
- unsigned long *nr_saved_scan)
-{
- unsigned long nr;
-
- *nr_saved_scan += nr_to_scan;
- nr = *nr_saved_scan;
-
- if (nr >= SWAP_CLUSTER_MAX)
- *nr_saved_scan = 0;
- else
- nr = 0;
+ fraction[0] = ap;
+ fraction[1] = fp;
+ denominator = ap + fp + 1;
+out:
+ for_each_evictable_lru(l) {
+ int file = is_file_lru(l);
+ unsigned long scan;

- return nr;
+ scan = zone_nr_lru_pages(zone, sc, l);
+ if (priority || noswap) {
+ scan >>= priority;
+ scan = div64_u64(scan * fraction[file], denominator);
+ }
+ nr[l] = nr_scan_try_batch(scan,
+ &reclaim_stat->nr_saved_scan[l]);
+ }
}

/*
@@ -1627,33 +1652,11 @@ static void shrink_zone(int priority, struct zone *zone,
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
- unsigned long percent[2]; /* anon @ 0; file @ 1 */
enum lru_list l;
unsigned long nr_reclaimed = sc->nr_reclaimed;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
- struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
- int noswap = 0;
-
- /* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || (nr_swap_pages <= 0)) {
- noswap = 1;
- percent[0] = 0;
- percent[1] = 100;
- } else
- get_scan_ratio(zone, sc, percent);

- for_each_evictable_lru(l) {
- int file = is_file_lru(l);
- unsigned long scan;
-
- scan = zone_nr_lru_pages(zone, sc, l);
- if (priority || noswap) {
- scan >>= priority;
- scan = (scan * percent[file]) / 100;
- }
- nr[l] = nr_scan_try_batch(scan,
- &reclaim_stat->nr_saved_scan[l]);
- }
+ get_scan_count(zone, sc, nr, priority);

while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {

2010-04-09 21:21:47

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Fri, 9 Apr 2010 14:51:04 +0800
Shaohua Li <[email protected]> wrote:

> get_scan_ratio() calculates percentage and if the percentage is < 1%, it will
> round percentage down to 0% and cause we completely ignore scanning anon/file
> pages to reclaim memory even the total anon/file pages are very big.
>
> To avoid underflow, we don't use percentage, instead we directly calculate
> how many pages should be scaned. In this way, we should get several scanned pages
> for < 1% percent.
>
> This has some benefits:
> 1. increase our calculation precision
> 2. making our scan more smoothly. Without this, if percent[x] is underflow,
> shrink_zone() doesn't scan any pages and suddenly it scans all pages when priority
> is zero. With this, even priority isn't zero, shrink_zone() gets chance to scan
> some pages.
>
> Note, this patch doesn't really change logics, but just increase precision. For
> system with a lot of memory, this might slightly changes behavior. For example,
> in a sequential file read workload, without the patch, we don't swap any anon
> pages. With it, if anon memory size is bigger than 16G, we will see one anon page
> swapped. The 16G is calculated as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap
> is assumed to be 1024 which is common in this workload. So the impact sounds not
> a big deal.

I grabbed this.

Did we decide that this needed to be backported into 2.6.33.x? If so,
some words explaining the reasoning would be needed.

Come to that, it's not obvious that we need this in 2.6.34 either. What
is the user-visible impact here?

2010-04-09 21:25:58

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On 04/09/2010 05:20 PM, Andrew Morton wrote:

> Come to that, it's not obvious that we need this in 2.6.34 either. What
> is the user-visible impact here?

I suspect very little impact, especially during workloads
where we can just reclaim clean page cache at DEF_PRIORITY.
FWIW, the patch looks good to me, so:

Acked-by: Rik van Riel <[email protected]>

2010-04-12 01:57:59

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Sat, Apr 10, 2010 at 05:20:57AM +0800, Andrew Morton wrote:
> On Fri, 9 Apr 2010 14:51:04 +0800
> Shaohua Li <[email protected]> wrote:
>
> > get_scan_ratio() calculates percentage and if the percentage is < 1%, it will
> > round percentage down to 0% and cause we completely ignore scanning anon/file
> > pages to reclaim memory even the total anon/file pages are very big.
> >
> > To avoid underflow, we don't use percentage, instead we directly calculate
> > how many pages should be scaned. In this way, we should get several scanned pages
> > for < 1% percent.
> >
> > This has some benefits:
> > 1. increase our calculation precision
> > 2. making our scan more smoothly. Without this, if percent[x] is underflow,
> > shrink_zone() doesn't scan any pages and suddenly it scans all pages when priority
> > is zero. With this, even priority isn't zero, shrink_zone() gets chance to scan
> > some pages.
> >
> > Note, this patch doesn't really change logics, but just increase precision. For
> > system with a lot of memory, this might slightly changes behavior. For example,
> > in a sequential file read workload, without the patch, we don't swap any anon
> > pages. With it, if anon memory size is bigger than 16G, we will see one anon page
> > swapped. The 16G is calculated as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap
> > is assumed to be 1024 which is common in this workload. So the impact sounds not
> > a big deal.
>
> I grabbed this.
>
> Did we decide that this needed to be backported into 2.6.33.x? If so,
> some words explaining the reasoning would be needed.
>
> Come to that, it's not obvious that we need this in 2.6.34 either.
Not needed.

> is the user-visible impact here?
Should be very small I think.

Thanks,
Shaohua

2010-04-13 01:30:38

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

> On 04/09/2010 05:20 PM, Andrew Morton wrote:
>
> > Come to that, it's not obvious that we need this in 2.6.34 either. What
> > is the user-visible impact here?
>
> I suspect very little impact, especially during workloads
> where we can just reclaim clean page cache at DEF_PRIORITY.
> FWIW, the patch looks good to me, so:
>
> Acked-by: Rik van Riel <[email protected]>
>

I'm surprised this ack a bit. Rik, do you have any improvement plan about
streaming io detection logic?
I think the patch have a slightly marginal benefit, it help to <1% scan
ratio case. but it have big regression, it cause streaming io (e.g. backup
operation) makes tons swap.

So, I thought we sould do either,
1) drop this one
2) merge to change stream io detection logic improvement at first, and
merge this one at second.

Am i missing something?

2010-04-13 02:43:16

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On 04/12/2010 09:30 PM, KOSAKI Motohiro wrote:
>> On 04/09/2010 05:20 PM, Andrew Morton wrote:
>>
>>> Come to that, it's not obvious that we need this in 2.6.34 either. What
>>> is the user-visible impact here?
>>
>> I suspect very little impact, especially during workloads
>> where we can just reclaim clean page cache at DEF_PRIORITY.
>> FWIW, the patch looks good to me, so:
>>
>> Acked-by: Rik van Riel<[email protected]>
>>
>
> I'm surprised this ack a bit. Rik, do you have any improvement plan about
> streaming io detection logic?
> I think the patch have a slightly marginal benefit, it help to<1% scan
> ratio case. but it have big regression, it cause streaming io (e.g. backup
> operation) makes tons swap.

How? From the description I believe it took 16GB in
a zone before we start scanning anon pages when
reclaiming at DEF_PRIORITY?

Would that casue a problem?

> So, I thought we sould do either,
> 1) drop this one
> 2) merge to change stream io detection logic improvement at first, and
> merge this one at second.

We may need better streaming IO detection, anyway.

I have noticed that while heavy sequential reads are fine,
the virtual machines on my desktop system do a lot of whole
block writes. Presumably, a lot of those writes are to the
same blocks, over and over again.

This causes the blocks to be promoted to the active file
list, which ends up growing the active file list to the
point where things from the working set get evicted.

All for file pages that may only get WRITTEN to by the
guests, because the guests cache their own copy whenever
they need to read them!

I'll have to check the page cache code to see if it
keeps frequently written pages as accessed. We may be
better off evicting frequently written pages, and
keeping our cache space for data that is read...

2010-04-13 07:55:22

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

> On 04/12/2010 09:30 PM, KOSAKI Motohiro wrote:
> >> On 04/09/2010 05:20 PM, Andrew Morton wrote:
> >>
> >>> Come to that, it's not obvious that we need this in 2.6.34 either. What
> >>> is the user-visible impact here?
> >>
> >> I suspect very little impact, especially during workloads
> >> where we can just reclaim clean page cache at DEF_PRIORITY.
> >> FWIW, the patch looks good to me, so:
> >>
> >> Acked-by: Rik van Riel<[email protected]>
> >>
> >
> > I'm surprised this ack a bit. Rik, do you have any improvement plan about
> > streaming io detection logic?
> > I think the patch have a slightly marginal benefit, it help to<1% scan
> > ratio case. but it have big regression, it cause streaming io (e.g. backup
> > operation) makes tons swap.
>
> How? From the description I believe it took 16GB in
> a zone before we start scanning anon pages when
> reclaiming at DEF_PRIORITY?
>
> Would that casue a problem?

Please remember, 2.6.27 has following +1 scanning modifier.

zone->nr_scan_active += (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
^^^^

and, early (ano not yet merged) SplitLRU VM has similar +1. likes

scan = zone_nr_lru_pages(zone, sc, l);
scan >>= priority;
scan = (scan * percent[file]) / 100 + 1;
^^^

We didn't think only one page scanning is not big matter. but it was not
correct. we got streaming io bug report. the above +1 makes annoying swap
io. because some server need big backup operation rather much much than
physical memory size.

example, If vm are dropping 1TB use once pages, 0.1% anon scanning makes
1GB scan. and almost server only have some gigabyte swap although it
has >1TB memory.

If my memory is not correct, please correct me.

My point is, greater or smaller than 16GB isn't essential. all patches
should have big worth than the downside. The description said "the impact
sounds not a big deal", nobody disagree it. but it's worth is more little.
I don't imagine this patch improve anything.

>
> > So, I thought we sould do either,
> > 1) drop this one
> > 2) merge to change stream io detection logic improvement at first, and
> > merge this one at second.
>
> We may need better streaming IO detection, anyway.

agreed. that's no doubt.

> I have noticed that while heavy sequential reads are fine,
> the virtual machines on my desktop system do a lot of whole
> block writes. Presumably, a lot of those writes are to the
> same blocks, over and over again.
>
> This causes the blocks to be promoted to the active file
> list, which ends up growing the active file list to the
> point where things from the working set get evicted.
>
> All for file pages that may only get WRITTEN to by the
> guests, because the guests cache their own copy whenever
> they need to read them!
>
> I'll have to check the page cache code to see if it
> keeps frequently written pages as accessed. We may be
> better off evicting frequently written pages, and
> keeping our cache space for data that is read...

One question, In such case your guest don't use DirectIO?
Or do you talk about guest VM behabior?

I guess inactive_file_is_low_global() can be improvement a lot.
but I'm not sure.

2010-04-13 08:55:58

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

> > > I'm surprised this ack a bit. Rik, do you have any improvement plan about
> > > streaming io detection logic?
> > > I think the patch have a slightly marginal benefit, it help to<1% scan
> > > ratio case. but it have big regression, it cause streaming io (e.g. backup
> > > operation) makes tons swap.
> >
> > How? From the description I believe it took 16GB in
> > a zone before we start scanning anon pages when
> > reclaiming at DEF_PRIORITY?
> >
> > Would that casue a problem?
>
> Please remember, 2.6.27 has following +1 scanning modifier.
>
> zone->nr_scan_active += (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
> ^^^^
>
> and, early (ano not yet merged) SplitLRU VM has similar +1. likes
>
> scan = zone_nr_lru_pages(zone, sc, l);
> scan >>= priority;
> scan = (scan * percent[file]) / 100 + 1;
> ^^^
>
> We didn't think only one page scanning is not big matter. but it was not
> correct. we got streaming io bug report. the above +1 makes annoying swap
> io. because some server need big backup operation rather much much than
> physical memory size.
>
> example, If vm are dropping 1TB use once pages, 0.1% anon scanning makes
> 1GB scan. and almost server only have some gigabyte swap although it
> has >1TB memory.
>
> If my memory is not correct, please correct me.
>
> My point is, greater or smaller than 16GB isn't essential. all patches
> should have big worth than the downside. The description said "the impact
> sounds not a big deal", nobody disagree it. but it's worth is more little.
> I don't imagine this patch improve anything.

And now I've merged this patch into my local vmscan patch queue.
After solving streaming io issue, I'll put it to mainline.

Thanks.

2010-04-14 01:27:17

[permalink] [raw]

Subject: Re: [PATCH]vmscan: handle underflow for get_scan_ratio

On Tue, Apr 13, 2010 at 04:55:52PM +0800, KOSAKI Motohiro wrote:
> > > > I'm surprised this ack a bit. Rik, do you have any improvement plan about
> > > > streaming io detection logic?
> > > > I think the patch have a slightly marginal benefit, it help to<1% scan
> > > > ratio case. but it have big regression, it cause streaming io (e.g. backup
> > > > operation) makes tons swap.
> > >
> > > How? From the description I believe it took 16GB in
> > > a zone before we start scanning anon pages when
> > > reclaiming at DEF_PRIORITY?
> > >
> > > Would that casue a problem?
> >
> > Please remember, 2.6.27 has following +1 scanning modifier.
> >
> > zone->nr_scan_active += (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
> > ^^^^
> >
> > and, early (ano not yet merged) SplitLRU VM has similar +1. likes
> >
> > scan = zone_nr_lru_pages(zone, sc, l);
> > scan >>= priority;
> > scan = (scan * percent[file]) / 100 + 1;
> > ^^^
> >
> > We didn't think only one page scanning is not big matter. but it was not
> > correct. we got streaming io bug report. the above +1 makes annoying swap
> > io. because some server need big backup operation rather much much than
> > physical memory size.
> >
> > example, If vm are dropping 1TB use once pages, 0.1% anon scanning makes
> > 1GB scan. and almost server only have some gigabyte swap although it
> > has >1TB memory.
> >
> > If my memory is not correct, please correct me.
> >
> > My point is, greater or smaller than 16GB isn't essential. all patches
> > should have big worth than the downside. The description said "the impact
> > sounds not a big deal", nobody disagree it. but it's worth is more little.
> > I don't imagine this patch improve anything.
>
> And now I've merged this patch into my local vmscan patch queue.
> After solving streaming io issue, I'll put it to mainline.
if the streaming io issue is popular, how about below patch against my last one?
we take priority == DEF_PRIORITY an exception.

Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c 2010-04-14 09:03:28.000000000 +0800
+++ linux/mm/vmscan.c 2010-04-14 09:19:56.000000000 +0800
@@ -1629,6 +1629,22 @@ static void get_scan_count(struct zone *
fraction[0] = ap;
fraction[1] = fp;
denominator = ap + fp + 1;
+
+ /*
+ * memory pressure isn't high, we allow percentage underflow. This
+ * avoids swap in stream io case.
+ */
+ if (priority == DEF_PRIORITY) {
+ if (fraction[0] * 99 < fraction[1]) {
+ fraction[0] = 0;
+ fraction[1] = 1;
+ denominator = 1;
+ } else if (fraction[1] * 99 < fraction[0]) {
+ fraction[0] = 1;
+ fraction[1] = 0;
+ denominator = 1;
+ }
+ }
out:
for_each_evictable_lru(l) {
int file = is_file_lru(l);

2010-04-15 03:25:47