When file refaults are detected and there are many inactive file pages,
the system never reclaim anonymous pages, the file pages are dropped
aggressively when there are still a lot of cold anonymous pages and
system thrashes. This issue impacts the performance of applications with
large executable, e.g. chrome.
When file refaults are detected. inactive_list_is_low() may return
different values depends on the actual_reclaim parameter, the following
2 conditions could be satisfied at the same time.
1) inactive_list_is_low() returns false in get_scan_count() to trigger
scanning file lists only.
2) inactive_list_is_low() returns true in shrink_list() to allow
scanning active file list.
In that case vmscan would only scan file lists, and as active file list
is also scanned, inactive_list_is_low() may keep returning false in
get_scan_count() until file cache is very low.
Before commit 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in
cache workingset transition"), inactive_list_is_low() never returns
different value in get_scan_count() and shrink_list() in one
shrink_node_memcg() run. The original design should be that when
inactive_list_is_low() returns false for file lists, vmscan only scan
inactive file list. As only inactive file list is scanned,
inactive_list_is_low() would soon return true.
This patch makes the return value of inactive_list_is_low() independent
of actual_reclaim.
The problem can be reproduced by the following test program.
---8<---
void fallocate_file(const char *filename, off_t size)
{
struct stat st;
int fd;
if (!stat(filename, &st) && st.st_size >= size)
return;
fd = open(filename, O_WRONLY | O_CREAT, 0600);
if (fd < 0) {
perror("create file");
exit(1);
}
if (posix_fallocate(fd, 0, size)) {
perror("fallocate");
exit(1);
}
close(fd);
}
long *alloc_anon(long size)
{
long *start = malloc(size);
memset(start, 1, size);
return start;
}
long access_file(const char *filename, long size, long rounds)
{
int fd, i;
volatile char *start1, *end1, *start2;
const int page_size = getpagesize();
long sum = 0;
fd = open(filename, O_RDONLY);
if (fd == -1) {
perror("open");
exit(1);
}
/*
* Some applications, e.g. chrome, use a lot of executable file
* pages, map some of the pages with PROT_EXEC flag to simulate
* the behavior.
*/
start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
fd, 0);
if (start1 == MAP_FAILED) {
perror("mmap");
exit(1);
}
end1 = start1 + size / 2;
start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
if (start2 == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (i = 0; i < rounds; ++i) {
struct timeval before, after;
volatile char *ptr1 = start1, *ptr2 = start2;
gettimeofday(&before, NULL);
for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
sum += *ptr1 + *ptr2;
gettimeofday(&after, NULL);
printf("File access time, round %d: %f (sec)\n", i,
(after.tv_sec - before.tv_sec) +
(after.tv_usec - before.tv_usec) / 1000000.0);
}
return sum;
}
int main(int argc, char *argv[])
{
const long MB = 1024 * 1024;
long anon_mb, file_mb, file_rounds;
const char filename[] = "large";
long *ret1;
long ret2;
if (argc != 4) {
printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS\n");
exit(0);
}
anon_mb = atoi(argv[1]);
file_mb = atoi(argv[2]);
file_rounds = atoi(argv[3]);
fallocate_file(filename, file_mb * MB);
printf("Allocate %ld MB anonymous pages\n", anon_mb);
ret1 = alloc_anon(anon_mb * MB);
printf("Access %ld MB file pages\n", file_mb);
ret2 = access_file(filename, file_mb * MB, file_rounds);
printf("Print result to prevent optimization: %ld\n",
*ret1 + ret2);
return 0;
}
---8<---
Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the
program fills ram with 2048 MB memory, access a 200 MB file for 10
times. Without this patch, the file cache is dropped aggresively and
every access to the file is from disk.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.489316 (sec)
File access time, round 1: 2.581277 (sec)
File access time, round 2: 2.487624 (sec)
File access time, round 3: 2.449100 (sec)
File access time, round 4: 2.420423 (sec)
File access time, round 5: 2.343411 (sec)
File access time, round 6: 2.454833 (sec)
File access time, round 7: 2.483398 (sec)
File access time, round 8: 2.572701 (sec)
File access time, round 9: 2.493014 (sec)
With this patch, these file pages can be cached.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.475189 (sec)
File access time, round 1: 2.440777 (sec)
File access time, round 2: 2.411671 (sec)
File access time, round 3: 1.955267 (sec)
File access time, round 4: 0.029924 (sec)
File access time, round 5: 0.000808 (sec)
File access time, round 6: 0.000771 (sec)
File access time, round 7: 0.000746 (sec)
File access time, round 8: 0.000738 (sec)
File access time, round 9: 0.000747 (sec)
Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
Signed-off-by: Kuo-Hsin Yang <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7889f583ced9f..b95d05fe828d1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2151,7 +2151,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
* rid of the stale workingset quickly.
*/
refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
- if (file && actual_reclaim && lruvec->refaults != refaults) {
+ if (file && lruvec->refaults != refaults) {
inactive_ratio = 0;
} else {
gb = (inactive + active) >> (30 - PAGE_SHIFT);
--
2.22.0.410.gd8fdbe21b5-goog
Could we please get some review of this one? Johannes, it supposedly
fixes your patch?
I added cc:stable to this. Agreeable?
From: Kuo-Hsin Yang <[email protected]>
Subject: mm: vmscan: fix not scanning anonymous pages when detecting file refaults
When file refaults are detected and there are many inactive file pages,
the system never reclaim anonymous pages, the file pages are dropped
aggressively when there are still a lot of cold anonymous pages and system
thrashes. This issue impacts the performance of applications with large
executable, e.g. chrome.
When file refaults are detected. inactive_list_is_low() may return
different values depends on the actual_reclaim parameter, the following 2
conditions could be satisfied at the same time.
1) inactive_list_is_low() returns false in get_scan_count() to trigger
scanning file lists only.
2) inactive_list_is_low() returns true in shrink_list() to allow
scanning active file list.
In that case vmscan would only scan file lists, and as active file list is
also scanned, inactive_list_is_low() may keep returning false in
get_scan_count() until file cache is very low.
Before 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache
workingset transition"), inactive_list_is_low() never returns different
value in get_scan_count() and shrink_list() in one shrink_node_memcg()
run. The original design should be that when inactive_list_is_low()
returns false for file lists, vmscan only scan inactive file list. As
only inactive file list is scanned, inactive_list_is_low() would soon
return true.
This patch makes the return value of inactive_list_is_low() independent of
actual_reclaim.
The problem can be reproduced by the following test program.
---8<---
void fallocate_file(const char *filename, off_t size)
{
struct stat st;
int fd;
if (!stat(filename, &st) && st.st_size >= size)
return;
fd = open(filename, O_WRONLY | O_CREAT, 0600);
if (fd < 0) {
perror("create file");
exit(1);
}
if (posix_fallocate(fd, 0, size)) {
perror("fallocate");
exit(1);
}
close(fd);
}
long *alloc_anon(long size)
{
long *start = malloc(size);
memset(start, 1, size);
return start;
}
long access_file(const char *filename, long size, long rounds)
{
int fd, i;
volatile char *start1, *end1, *start2;
const int page_size = getpagesize();
long sum = 0;
fd = open(filename, O_RDONLY);
if (fd == -1) {
perror("open");
exit(1);
}
/*
* Some applications, e.g. chrome, use a lot of executable file
* pages, map some of the pages with PROT_EXEC flag to simulate
* the behavior.
*/
start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
fd, 0);
if (start1 == MAP_FAILED) {
perror("mmap");
exit(1);
}
end1 = start1 + size / 2;
start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
if (start2 == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (i = 0; i < rounds; ++i) {
struct timeval before, after;
volatile char *ptr1 = start1, *ptr2 = start2;
gettimeofday(&before, NULL);
for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
sum += *ptr1 + *ptr2;
gettimeofday(&after, NULL);
printf("File access time, round %d: %f (sec)
", i,
(after.tv_sec - before.tv_sec) +
(after.tv_usec - before.tv_usec) / 1000000.0);
}
return sum;
}
int main(int argc, char *argv[])
{
const long MB = 1024 * 1024;
long anon_mb, file_mb, file_rounds;
const char filename[] = "large";
long *ret1;
long ret2;
if (argc != 4) {
printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS
");
exit(0);
}
anon_mb = atoi(argv[1]);
file_mb = atoi(argv[2]);
file_rounds = atoi(argv[3]);
fallocate_file(filename, file_mb * MB);
printf("Allocate %ld MB anonymous pages
", anon_mb);
ret1 = alloc_anon(anon_mb * MB);
printf("Access %ld MB file pages
", file_mb);
ret2 = access_file(filename, file_mb * MB, file_rounds);
printf("Print result to prevent optimization: %ld
",
*ret1 + ret2);
return 0;
}
---8<---
Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program
fills ram with 2048 MB memory, access a 200 MB file for 10 times. Without
this patch, the file cache is dropped aggresively and every access to the
file is from disk.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.489316 (sec)
File access time, round 1: 2.581277 (sec)
File access time, round 2: 2.487624 (sec)
File access time, round 3: 2.449100 (sec)
File access time, round 4: 2.420423 (sec)
File access time, round 5: 2.343411 (sec)
File access time, round 6: 2.454833 (sec)
File access time, round 7: 2.483398 (sec)
File access time, round 8: 2.572701 (sec)
File access time, round 9: 2.493014 (sec)
With this patch, these file pages can be cached.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.475189 (sec)
File access time, round 1: 2.440777 (sec)
File access time, round 2: 2.411671 (sec)
File access time, round 3: 1.955267 (sec)
File access time, round 4: 0.029924 (sec)
File access time, round 5: 0.000808 (sec)
File access time, round 6: 0.000771 (sec)
File access time, round 7: 0.000746 (sec)
File access time, round 8: 0.000738 (sec)
File access time, round 9: 0.000747 (sec)
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
Signed-off-by: Kuo-Hsin Yang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Sonny Rao <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/vmscan.c~mm-vmscan-fix-not-scanning-anonymous-pages-when-detecting-file-refaults
+++ a/mm/vmscan.c
@@ -2151,7 +2151,7 @@ static bool inactive_list_is_low(struct
* rid of the stale workingset quickly.
*/
refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
- if (file && actual_reclaim && lruvec->refaults != refaults) {
+ if (file && lruvec->refaults != refaults) {
inactive_ratio = 0;
} else {
gb = (inactive + active) >> (30 - PAGE_SHIFT);
_
On Wed, Jun 19, 2019 at 04:08:35PM +0800, Kuo-Hsin Yang wrote:
> When file refaults are detected and there are many inactive file pages,
> the system never reclaim anonymous pages, the file pages are dropped
> aggressively when there are still a lot of cold anonymous pages and
> system thrashes. This issue impacts the performance of applications with
> large executable, e.g. chrome.
>
> When file refaults are detected. inactive_list_is_low() may return
> different values depends on the actual_reclaim parameter, the following
> 2 conditions could be satisfied at the same time.
>
> 1) inactive_list_is_low() returns false in get_scan_count() to trigger
> scanning file lists only.
> 2) inactive_list_is_low() returns true in shrink_list() to allow
> scanning active file list.
>
> In that case vmscan would only scan file lists, and as active file list
> is also scanned, inactive_list_is_low() may keep returning false in
> get_scan_count() until file cache is very low.
>
> Before commit 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in
> cache workingset transition"), inactive_list_is_low() never returns
> different value in get_scan_count() and shrink_list() in one
> shrink_node_memcg() run. The original design should be that when
> inactive_list_is_low() returns false for file lists, vmscan only scan
> inactive file list. As only inactive file list is scanned,
> inactive_list_is_low() would soon return true.
>
> This patch makes the return value of inactive_list_is_low() independent
> of actual_reclaim.
>
> The problem can be reproduced by the following test program.
>
> ---8<---
> void fallocate_file(const char *filename, off_t size)
> {
> struct stat st;
> int fd;
>
> if (!stat(filename, &st) && st.st_size >= size)
> return;
>
> fd = open(filename, O_WRONLY | O_CREAT, 0600);
> if (fd < 0) {
> perror("create file");
> exit(1);
> }
> if (posix_fallocate(fd, 0, size)) {
> perror("fallocate");
> exit(1);
> }
> close(fd);
> }
>
> long *alloc_anon(long size)
> {
> long *start = malloc(size);
> memset(start, 1, size);
> return start;
> }
>
> long access_file(const char *filename, long size, long rounds)
> {
> int fd, i;
> volatile char *start1, *end1, *start2;
> const int page_size = getpagesize();
> long sum = 0;
>
> fd = open(filename, O_RDONLY);
> if (fd == -1) {
> perror("open");
> exit(1);
> }
>
> /*
> * Some applications, e.g. chrome, use a lot of executable file
> * pages, map some of the pages with PROT_EXEC flag to simulate
> * the behavior.
> */
> start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
> fd, 0);
> if (start1 == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
> end1 = start1 + size / 2;
>
> start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
> if (start2 == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
>
> for (i = 0; i < rounds; ++i) {
> struct timeval before, after;
> volatile char *ptr1 = start1, *ptr2 = start2;
> gettimeofday(&before, NULL);
> for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
> sum += *ptr1 + *ptr2;
> gettimeofday(&after, NULL);
> printf("File access time, round %d: %f (sec)\n", i,
> (after.tv_sec - before.tv_sec) +
> (after.tv_usec - before.tv_usec) / 1000000.0);
> }
> return sum;
> }
>
> int main(int argc, char *argv[])
> {
> const long MB = 1024 * 1024;
> long anon_mb, file_mb, file_rounds;
> const char filename[] = "large";
> long *ret1;
> long ret2;
>
> if (argc != 4) {
> printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS\n");
> exit(0);
> }
> anon_mb = atoi(argv[1]);
> file_mb = atoi(argv[2]);
> file_rounds = atoi(argv[3]);
>
> fallocate_file(filename, file_mb * MB);
> printf("Allocate %ld MB anonymous pages\n", anon_mb);
> ret1 = alloc_anon(anon_mb * MB);
> printf("Access %ld MB file pages\n", file_mb);
> ret2 = access_file(filename, file_mb * MB, file_rounds);
> printf("Print result to prevent optimization: %ld\n",
> *ret1 + ret2);
> return 0;
> }
> ---8<---
>
> Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the
> program fills ram with 2048 MB memory, access a 200 MB file for 10
> times. Without this patch, the file cache is dropped aggresively and
> every access to the file is from disk.
>
> $ ./thrash 2048 200 10
> Allocate 2048 MB anonymous pages
> Access 200 MB file pages
> File access time, round 0: 2.489316 (sec)
> File access time, round 1: 2.581277 (sec)
> File access time, round 2: 2.487624 (sec)
> File access time, round 3: 2.449100 (sec)
> File access time, round 4: 2.420423 (sec)
> File access time, round 5: 2.343411 (sec)
> File access time, round 6: 2.454833 (sec)
> File access time, round 7: 2.483398 (sec)
> File access time, round 8: 2.572701 (sec)
> File access time, round 9: 2.493014 (sec)
>
> With this patch, these file pages can be cached.
>
> $ ./thrash 2048 200 10
> Allocate 2048 MB anonymous pages
> Access 200 MB file pages
> File access time, round 0: 2.475189 (sec)
> File access time, round 1: 2.440777 (sec)
> File access time, round 2: 2.411671 (sec)
> File access time, round 3: 1.955267 (sec)
> File access time, round 4: 0.029924 (sec)
> File access time, round 5: 0.000808 (sec)
> File access time, round 6: 0.000771 (sec)
> File access time, round 7: 0.000746 (sec)
> File access time, round 8: 0.000738 (sec)
> File access time, round 9: 0.000747 (sec)
>
> Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
> Signed-off-by: Kuo-Hsin Yang <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Your change makes sense - we should indeed not force cache trimming
only while the page cache is experiencing refaults.
I can't say I fully understand the changelog, though. The problem of
forcing cache trimming while there is enough page cache is older than
the commit you refer to. It could be argued that this commit is
incomplete - it could have added refault detection not just to
inactive:active file balancing, but also the file:anon balancing; but
it didn't *cause* this problem.
Shouldn't this be
Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty")
Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty")
instead?
Hi Johannes,
On Thu, Jun 27, 2019 at 02:41:23PM -0400, Johannes Weiner wrote:
> On Wed, Jun 19, 2019 at 04:08:35PM +0800, Kuo-Hsin Yang wrote:
> > When file refaults are detected and there are many inactive file pages,
> > the system never reclaim anonymous pages, the file pages are dropped
> > aggressively when there are still a lot of cold anonymous pages and
> > system thrashes. This issue impacts the performance of applications with
> > large executable, e.g. chrome.
> >
> > When file refaults are detected. inactive_list_is_low() may return
> > different values depends on the actual_reclaim parameter, the following
> > 2 conditions could be satisfied at the same time.
> >
> > 1) inactive_list_is_low() returns false in get_scan_count() to trigger
> > scanning file lists only.
> > 2) inactive_list_is_low() returns true in shrink_list() to allow
> > scanning active file list.
> >
> > In that case vmscan would only scan file lists, and as active file list
> > is also scanned, inactive_list_is_low() may keep returning false in
> > get_scan_count() until file cache is very low.
> >
> > Before commit 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in
> > cache workingset transition"), inactive_list_is_low() never returns
> > different value in get_scan_count() and shrink_list() in one
> > shrink_node_memcg() run. The original design should be that when
> > inactive_list_is_low() returns false for file lists, vmscan only scan
> > inactive file list. As only inactive file list is scanned,
> > inactive_list_is_low() would soon return true.
> >
> > This patch makes the return value of inactive_list_is_low() independent
> > of actual_reclaim.
> >
> > The problem can be reproduced by the following test program.
> >
> > ---8<---
> > void fallocate_file(const char *filename, off_t size)
> > {
> > struct stat st;
> > int fd;
> >
> > if (!stat(filename, &st) && st.st_size >= size)
> > return;
> >
> > fd = open(filename, O_WRONLY | O_CREAT, 0600);
> > if (fd < 0) {
> > perror("create file");
> > exit(1);
> > }
> > if (posix_fallocate(fd, 0, size)) {
> > perror("fallocate");
> > exit(1);
> > }
> > close(fd);
> > }
> >
> > long *alloc_anon(long size)
> > {
> > long *start = malloc(size);
> > memset(start, 1, size);
> > return start;
> > }
> >
> > long access_file(const char *filename, long size, long rounds)
> > {
> > int fd, i;
> > volatile char *start1, *end1, *start2;
> > const int page_size = getpagesize();
> > long sum = 0;
> >
> > fd = open(filename, O_RDONLY);
> > if (fd == -1) {
> > perror("open");
> > exit(1);
> > }
> >
> > /*
> > * Some applications, e.g. chrome, use a lot of executable file
> > * pages, map some of the pages with PROT_EXEC flag to simulate
> > * the behavior.
> > */
> > start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
> > fd, 0);
> > if (start1 == MAP_FAILED) {
> > perror("mmap");
> > exit(1);
> > }
> > end1 = start1 + size / 2;
> >
> > start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
> > if (start2 == MAP_FAILED) {
> > perror("mmap");
> > exit(1);
> > }
> >
> > for (i = 0; i < rounds; ++i) {
> > struct timeval before, after;
> > volatile char *ptr1 = start1, *ptr2 = start2;
> > gettimeofday(&before, NULL);
> > for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
> > sum += *ptr1 + *ptr2;
> > gettimeofday(&after, NULL);
> > printf("File access time, round %d: %f (sec)\n", i,
> > (after.tv_sec - before.tv_sec) +
> > (after.tv_usec - before.tv_usec) / 1000000.0);
> > }
> > return sum;
> > }
> >
> > int main(int argc, char *argv[])
> > {
> > const long MB = 1024 * 1024;
> > long anon_mb, file_mb, file_rounds;
> > const char filename[] = "large";
> > long *ret1;
> > long ret2;
> >
> > if (argc != 4) {
> > printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS\n");
> > exit(0);
> > }
> > anon_mb = atoi(argv[1]);
> > file_mb = atoi(argv[2]);
> > file_rounds = atoi(argv[3]);
> >
> > fallocate_file(filename, file_mb * MB);
> > printf("Allocate %ld MB anonymous pages\n", anon_mb);
> > ret1 = alloc_anon(anon_mb * MB);
> > printf("Access %ld MB file pages\n", file_mb);
> > ret2 = access_file(filename, file_mb * MB, file_rounds);
> > printf("Print result to prevent optimization: %ld\n",
> > *ret1 + ret2);
> > return 0;
> > }
> > ---8<---
> >
> > Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the
> > program fills ram with 2048 MB memory, access a 200 MB file for 10
> > times. Without this patch, the file cache is dropped aggresively and
> > every access to the file is from disk.
> >
> > $ ./thrash 2048 200 10
> > Allocate 2048 MB anonymous pages
> > Access 200 MB file pages
> > File access time, round 0: 2.489316 (sec)
> > File access time, round 1: 2.581277 (sec)
> > File access time, round 2: 2.487624 (sec)
> > File access time, round 3: 2.449100 (sec)
> > File access time, round 4: 2.420423 (sec)
> > File access time, round 5: 2.343411 (sec)
> > File access time, round 6: 2.454833 (sec)
> > File access time, round 7: 2.483398 (sec)
> > File access time, round 8: 2.572701 (sec)
> > File access time, round 9: 2.493014 (sec)
> >
> > With this patch, these file pages can be cached.
> >
> > $ ./thrash 2048 200 10
> > Allocate 2048 MB anonymous pages
> > Access 200 MB file pages
> > File access time, round 0: 2.475189 (sec)
> > File access time, round 1: 2.440777 (sec)
> > File access time, round 2: 2.411671 (sec)
> > File access time, round 3: 1.955267 (sec)
> > File access time, round 4: 0.029924 (sec)
> > File access time, round 5: 0.000808 (sec)
> > File access time, round 6: 0.000771 (sec)
> > File access time, round 7: 0.000746 (sec)
> > File access time, round 8: 0.000738 (sec)
> > File access time, round 9: 0.000747 (sec)
> >
> > Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
> > Signed-off-by: Kuo-Hsin Yang <[email protected]>
>
> Acked-by: Johannes Weiner <[email protected]>
>
> Your change makes sense - we should indeed not force cache trimming
> only while the page cache is experiencing refaults.
>
> I can't say I fully understand the changelog, though. The problem of
I guess the point of the patch is "actual_reclaim" paramter made divergency
to balance file vs. anon LRU in get_scan_count. Thus, it ends up scanning
file LRU active/inactive list at file thrashing state.
So, Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
would make sense to me since it introduces the parameter.
> forcing cache trimming while there is enough page cache is older than
> the commit you refer to. It could be argued that this commit is
> incomplete - it could have added refault detection not just to
> inactive:active file balancing, but also the file:anon balancing; but
> it didn't *cause* this problem.
>
> Shouldn't this be
>
> Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty")
> Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty")
That would affect, too but it would be trouble to have stable backport
since we don't have refault machinery in there.
Hi Kuo-Hsin,
On Wed, Jun 19, 2019 at 04:08:35PM +0800, Kuo-Hsin Yang wrote:
> When file refaults are detected and there are many inactive file pages,
> the system never reclaim anonymous pages, the file pages are dropped
> aggressively when there are still a lot of cold anonymous pages and
> system thrashes. This issue impacts the performance of applications with
> large executable, e.g. chrome.
>
> When file refaults are detected. inactive_list_is_low() may return
> different values depends on the actual_reclaim parameter, the following
> 2 conditions could be satisfied at the same time.
>
> 1) inactive_list_is_low() returns false in get_scan_count() to trigger
> scanning file lists only.
> 2) inactive_list_is_low() returns true in shrink_list() to allow
> scanning active file list.
>
> In that case vmscan would only scan file lists, and as active file list
> is also scanned, inactive_list_is_low() may keep returning false in
> get_scan_count() until file cache is very low.
>
> Before commit 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in
> cache workingset transition"), inactive_list_is_low() never returns
> different value in get_scan_count() and shrink_list() in one
> shrink_node_memcg() run. The original design should be that when
> inactive_list_is_low() returns false for file lists, vmscan only scan
> inactive file list. As only inactive file list is scanned,
> inactive_list_is_low() would soon return true.
>
> This patch makes the return value of inactive_list_is_low() independent
> of actual_reclaim.
>
> The problem can be reproduced by the following test program.
>
> ---8<---
> void fallocate_file(const char *filename, off_t size)
> {
> struct stat st;
> int fd;
>
> if (!stat(filename, &st) && st.st_size >= size)
> return;
>
> fd = open(filename, O_WRONLY | O_CREAT, 0600);
> if (fd < 0) {
> perror("create file");
> exit(1);
> }
> if (posix_fallocate(fd, 0, size)) {
> perror("fallocate");
> exit(1);
> }
> close(fd);
> }
>
> long *alloc_anon(long size)
> {
> long *start = malloc(size);
> memset(start, 1, size);
> return start;
> }
>
> long access_file(const char *filename, long size, long rounds)
> {
> int fd, i;
> volatile char *start1, *end1, *start2;
> const int page_size = getpagesize();
> long sum = 0;
>
> fd = open(filename, O_RDONLY);
> if (fd == -1) {
> perror("open");
> exit(1);
> }
>
> /*
> * Some applications, e.g. chrome, use a lot of executable file
> * pages, map some of the pages with PROT_EXEC flag to simulate
> * the behavior.
> */
> start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
> fd, 0);
> if (start1 == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
> end1 = start1 + size / 2;
>
> start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
> if (start2 == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
>
> for (i = 0; i < rounds; ++i) {
> struct timeval before, after;
> volatile char *ptr1 = start1, *ptr2 = start2;
> gettimeofday(&before, NULL);
> for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
> sum += *ptr1 + *ptr2;
> gettimeofday(&after, NULL);
> printf("File access time, round %d: %f (sec)\n", i,
> (after.tv_sec - before.tv_sec) +
> (after.tv_usec - before.tv_usec) / 1000000.0);
> }
> return sum;
> }
>
> int main(int argc, char *argv[])
> {
> const long MB = 1024 * 1024;
> long anon_mb, file_mb, file_rounds;
> const char filename[] = "large";
> long *ret1;
> long ret2;
>
> if (argc != 4) {
> printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS\n");
> exit(0);
> }
> anon_mb = atoi(argv[1]);
> file_mb = atoi(argv[2]);
> file_rounds = atoi(argv[3]);
>
> fallocate_file(filename, file_mb * MB);
> printf("Allocate %ld MB anonymous pages\n", anon_mb);
> ret1 = alloc_anon(anon_mb * MB);
> printf("Access %ld MB file pages\n", file_mb);
> ret2 = access_file(filename, file_mb * MB, file_rounds);
> printf("Print result to prevent optimization: %ld\n",
> *ret1 + ret2);
> return 0;
> }
> ---8<---
>
> Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the
> program fills ram with 2048 MB memory, access a 200 MB file for 10
> times. Without this patch, the file cache is dropped aggresively and
> every access to the file is from disk.
>
> $ ./thrash 2048 200 10
> Allocate 2048 MB anonymous pages
> Access 200 MB file pages
> File access time, round 0: 2.489316 (sec)
> File access time, round 1: 2.581277 (sec)
> File access time, round 2: 2.487624 (sec)
> File access time, round 3: 2.449100 (sec)
> File access time, round 4: 2.420423 (sec)
> File access time, round 5: 2.343411 (sec)
> File access time, round 6: 2.454833 (sec)
> File access time, round 7: 2.483398 (sec)
> File access time, round 8: 2.572701 (sec)
> File access time, round 9: 2.493014 (sec)
>
> With this patch, these file pages can be cached.
>
> $ ./thrash 2048 200 10
> Allocate 2048 MB anonymous pages
> Access 200 MB file pages
> File access time, round 0: 2.475189 (sec)
> File access time, round 1: 2.440777 (sec)
> File access time, round 2: 2.411671 (sec)
> File access time, round 3: 1.955267 (sec)
> File access time, round 4: 0.029924 (sec)
> File access time, round 5: 0.000808 (sec)
> File access time, round 6: 0.000771 (sec)
> File access time, round 7: 0.000746 (sec)
> File access time, round 8: 0.000738 (sec)
> File access time, round 9: 0.000747 (sec)
>
> Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
> Signed-off-by: Kuo-Hsin Yang <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7889f583ced9f..b95d05fe828d1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2151,7 +2151,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> * rid of the stale workingset quickly.
> */
> refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
> - if (file && actual_reclaim && lruvec->refaults != refaults) {
> + if (file && lruvec->refaults != refaults) {
Just a nit:
So, now "actual_reclaim" just aims for the tracing purpose. In that case,
we could rollback the naming to "trace", again.
On Fri, Jun 28, 2019 at 03:51:38PM +0900, Minchan Kim wrote:
> Hi Johannes,
>
> On Thu, Jun 27, 2019 at 02:41:23PM -0400, Johannes Weiner wrote:
> >
> > Acked-by: Johannes Weiner <[email protected]>
> >
> > Your change makes sense - we should indeed not force cache trimming
> > only while the page cache is experiencing refaults.
> >
> > I can't say I fully understand the changelog, though. The problem of
>
> I guess the point of the patch is "actual_reclaim" paramter made divergency
> to balance file vs. anon LRU in get_scan_count. Thus, it ends up scanning
> file LRU active/inactive list at file thrashing state.
>
> So, Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
> would make sense to me since it introduces the parameter.
>
Thanks for the review and explanation, I will update the changelog to
make it clear.
> > forcing cache trimming while there is enough page cache is older than
> > the commit you refer to. It could be argued that this commit is
> > incomplete - it could have added refault detection not just to
> > inactive:active file balancing, but also the file:anon balancing; but
> > it didn't *cause* this problem.
> >
> > Shouldn't this be
> >
> > Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty")
> > Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty")
>
> That would affect, too but it would be trouble to have stable backport
> since we don't have refault machinery in there.
When file refaults are detected and there are many inactive file pages,
the system never reclaim anonymous pages, the file pages are dropped
aggressively when there are still a lot of cold anonymous pages and
system thrashes. This issue impacts the performance of applications
with large executable, e.g. chrome.
Commit 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache
workingset transition") introduced actual_reclaim parameter. When file
refaults are detected, inactive_list_is_low() may return different
values depends on the actual_reclaim parameter. Vmscan would only scan
active/inactive file lists at file thrashing state when the following 2
conditions are satisfied.
1) inactive_list_is_low() returns false in get_scan_count() to trigger
scanning file lists only.
2) inactive_list_is_low() returns true in shrink_list() to allow
scanning active file list.
This patch makes the return value of inactive_list_is_low() independent
of actual_reclaim and rename the parameter back to trace.
The problem can be reproduced by the following test program.
---8<---
void fallocate_file(const char *filename, off_t size)
{
struct stat st;
int fd;
if (!stat(filename, &st) && st.st_size >= size)
return;
fd = open(filename, O_WRONLY | O_CREAT, 0600);
if (fd < 0) {
perror("create file");
exit(1);
}
if (posix_fallocate(fd, 0, size)) {
perror("fallocate");
exit(1);
}
close(fd);
}
long *alloc_anon(long size)
{
long *start = malloc(size);
memset(start, 1, size);
return start;
}
long access_file(const char *filename, long size, long rounds)
{
int fd, i;
volatile char *start1, *end1, *start2;
const int page_size = getpagesize();
long sum = 0;
fd = open(filename, O_RDONLY);
if (fd == -1) {
perror("open");
exit(1);
}
/*
* Some applications, e.g. chrome, use a lot of executable file
* pages, map some of the pages with PROT_EXEC flag to simulate
* the behavior.
*/
start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
fd, 0);
if (start1 == MAP_FAILED) {
perror("mmap");
exit(1);
}
end1 = start1 + size / 2;
start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
if (start2 == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (i = 0; i < rounds; ++i) {
struct timeval before, after;
volatile char *ptr1 = start1, *ptr2 = start2;
gettimeofday(&before, NULL);
for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
sum += *ptr1 + *ptr2;
gettimeofday(&after, NULL);
printf("File access time, round %d: %f (sec)\n", i,
(after.tv_sec - before.tv_sec) +
(after.tv_usec - before.tv_usec) / 1000000.0);
}
return sum;
}
int main(int argc, char *argv[])
{
const long MB = 1024 * 1024;
long anon_mb, file_mb, file_rounds;
const char filename[] = "large";
long *ret1;
long ret2;
if (argc != 4) {
printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS\n");
exit(0);
}
anon_mb = atoi(argv[1]);
file_mb = atoi(argv[2]);
file_rounds = atoi(argv[3]);
fallocate_file(filename, file_mb * MB);
printf("Allocate %ld MB anonymous pages\n", anon_mb);
ret1 = alloc_anon(anon_mb * MB);
printf("Access %ld MB file pages\n", file_mb);
ret2 = access_file(filename, file_mb * MB, file_rounds);
printf("Print result to prevent optimization: %ld\n",
*ret1 + ret2);
return 0;
}
---8<---
Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the
program fills ram with 2048 MB memory, access a 200 MB file for 10
times. Without this patch, the file cache is dropped aggresively and
every access to the file is from disk.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.489316 (sec)
File access time, round 1: 2.581277 (sec)
File access time, round 2: 2.487624 (sec)
File access time, round 3: 2.449100 (sec)
File access time, round 4: 2.420423 (sec)
File access time, round 5: 2.343411 (sec)
File access time, round 6: 2.454833 (sec)
File access time, round 7: 2.483398 (sec)
File access time, round 8: 2.572701 (sec)
File access time, round 9: 2.493014 (sec)
With this patch, these file pages can be cached.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.475189 (sec)
File access time, round 1: 2.440777 (sec)
File access time, round 2: 2.411671 (sec)
File access time, round 3: 1.955267 (sec)
File access time, round 4: 0.029924 (sec)
File access time, round 5: 0.000808 (sec)
File access time, round 6: 0.000771 (sec)
File access time, round 7: 0.000746 (sec)
File access time, round 8: 0.000738 (sec)
File access time, round 9: 0.000747 (sec)
Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
Signed-off-by: Kuo-Hsin Yang <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7889f583ced9f..da0b97204372e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2125,7 +2125,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
* 10TB 320 32GB
*/
static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
- struct scan_control *sc, bool actual_reclaim)
+ struct scan_control *sc, bool trace)
{
enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -2151,7 +2151,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
* rid of the stale workingset quickly.
*/
refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
- if (file && actual_reclaim && lruvec->refaults != refaults) {
+ if (file && lruvec->refaults != refaults) {
inactive_ratio = 0;
} else {
gb = (inactive + active) >> (30 - PAGE_SHIFT);
@@ -2161,7 +2161,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
inactive_ratio = 1;
}
- if (actual_reclaim)
+ if (trace)
trace_mm_vmscan_inactive_list_is_low(pgdat->node_id, sc->reclaim_idx,
lruvec_lru_size(lruvec, inactive_lru, MAX_NR_ZONES), inactive,
lruvec_lru_size(lruvec, active_lru, MAX_NR_ZONES), active,
--
2.22.0.410.gd8fdbe21b5-goog
Hi Minchan,
On Fri, Jun 28, 2019 at 03:51:38PM +0900, Minchan Kim wrote:
> On Thu, Jun 27, 2019 at 02:41:23PM -0400, Johannes Weiner wrote:
> > On Wed, Jun 19, 2019 at 04:08:35PM +0800, Kuo-Hsin Yang wrote:
> > > Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
> > > Signed-off-by: Kuo-Hsin Yang <[email protected]>
> >
> > Acked-by: Johannes Weiner <[email protected]>
> >
> > Your change makes sense - we should indeed not force cache trimming
> > only while the page cache is experiencing refaults.
> >
> > I can't say I fully understand the changelog, though. The problem of
>
> I guess the point of the patch is "actual_reclaim" paramter made divergency
> to balance file vs. anon LRU in get_scan_count. Thus, it ends up scanning
> file LRU active/inactive list at file thrashing state.
Look at the patch again. The parameter was only added to retain
existing behavior. We *always* did file-only reclaim while thrashing -
all the way back to the two commits I mentioned below.
> So, Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
> would make sense to me since it introduces the parameter.
What is the observable behavior problem that this patch introduced?
> > forcing cache trimming while there is enough page cache is older than
> > the commit you refer to. It could be argued that this commit is
> > incomplete - it could have added refault detection not just to
> > inactive:active file balancing, but also the file:anon balancing; but
> > it didn't *cause* this problem.
> >
> > Shouldn't this be
> >
> > Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty")
> > Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty")
>
> That would affect, too but it would be trouble to have stable backport
> since we don't have refault machinery in there.
Hm? The problematic behavior is that we force-scan file while file is
thrashing. We can obviously only solve this in kernels that can
actually detect thrashing.
On Fri, Jun 28, 2019 at 07:16:27PM +0800, Kuo-Hsin Yang wrote:
> When file refaults are detected and there are many inactive file pages,
> the system never reclaim anonymous pages, the file pages are dropped
> aggressively when there are still a lot of cold anonymous pages and
> system thrashes. This issue impacts the performance of applications
> with large executable, e.g. chrome.
This is good.
> Commit 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache
> workingset transition") introduced actual_reclaim parameter. When file
> refaults are detected, inactive_list_is_low() may return different
> values depends on the actual_reclaim parameter. Vmscan would only scan
> active/inactive file lists at file thrashing state when the following 2
> conditions are satisfied.
>
> 1) inactive_list_is_low() returns false in get_scan_count() to trigger
> scanning file lists only.
> 2) inactive_list_is_low() returns true in shrink_list() to allow
> scanning active file list.
>
> This patch makes the return value of inactive_list_is_low() independent
> of actual_reclaim and rename the parameter back to trace.
This is not. The root cause for the problem you describe isn't the
patch you point to. The root cause is our decision to force-scan the
file LRU based on relative inactive:active size alone, without taking
file thrashing into account at all. This is a much older problem.
After the referenced patch, we're taking thrashing into account when
deciding whether to deactivate active file pages or not. To solve the
problem pointed out here, we can extend that same principle to the
decision whether to force-scan files and skip the anon LRUs.
The patch you're pointing to isn't the culprit. On the contrary, it
provides the infrastructure to solve a much older problem.
> The problem can be reproduced by the following test program.
>
> ---8<---
> void fallocate_file(const char *filename, off_t size)
> {
> struct stat st;
> int fd;
>
> if (!stat(filename, &st) && st.st_size >= size)
> return;
>
> fd = open(filename, O_WRONLY | O_CREAT, 0600);
> if (fd < 0) {
> perror("create file");
> exit(1);
> }
> if (posix_fallocate(fd, 0, size)) {
> perror("fallocate");
> exit(1);
> }
> close(fd);
> }
>
> long *alloc_anon(long size)
> {
> long *start = malloc(size);
> memset(start, 1, size);
> return start;
> }
>
> long access_file(const char *filename, long size, long rounds)
> {
> int fd, i;
> volatile char *start1, *end1, *start2;
> const int page_size = getpagesize();
> long sum = 0;
>
> fd = open(filename, O_RDONLY);
> if (fd == -1) {
> perror("open");
> exit(1);
> }
>
> /*
> * Some applications, e.g. chrome, use a lot of executable file
> * pages, map some of the pages with PROT_EXEC flag to simulate
> * the behavior.
> */
> start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
> fd, 0);
> if (start1 == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
> end1 = start1 + size / 2;
>
> start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
> if (start2 == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
>
> for (i = 0; i < rounds; ++i) {
> struct timeval before, after;
> volatile char *ptr1 = start1, *ptr2 = start2;
> gettimeofday(&before, NULL);
> for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
> sum += *ptr1 + *ptr2;
> gettimeofday(&after, NULL);
> printf("File access time, round %d: %f (sec)\n", i,
> (after.tv_sec - before.tv_sec) +
> (after.tv_usec - before.tv_usec) / 1000000.0);
> }
> return sum;
> }
>
> int main(int argc, char *argv[])
> {
> const long MB = 1024 * 1024;
> long anon_mb, file_mb, file_rounds;
> const char filename[] = "large";
> long *ret1;
> long ret2;
>
> if (argc != 4) {
> printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS\n");
> exit(0);
> }
> anon_mb = atoi(argv[1]);
> file_mb = atoi(argv[2]);
> file_rounds = atoi(argv[3]);
>
> fallocate_file(filename, file_mb * MB);
> printf("Allocate %ld MB anonymous pages\n", anon_mb);
> ret1 = alloc_anon(anon_mb * MB);
> printf("Access %ld MB file pages\n", file_mb);
> ret2 = access_file(filename, file_mb * MB, file_rounds);
> printf("Print result to prevent optimization: %ld\n",
> *ret1 + ret2);
> return 0;
> }
> ---8<---
>
> Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the
> program fills ram with 2048 MB memory, access a 200 MB file for 10
> times. Without this patch, the file cache is dropped aggresively and
> every access to the file is from disk.
>
> $ ./thrash 2048 200 10
> Allocate 2048 MB anonymous pages
> Access 200 MB file pages
> File access time, round 0: 2.489316 (sec)
> File access time, round 1: 2.581277 (sec)
> File access time, round 2: 2.487624 (sec)
> File access time, round 3: 2.449100 (sec)
> File access time, round 4: 2.420423 (sec)
> File access time, round 5: 2.343411 (sec)
> File access time, round 6: 2.454833 (sec)
> File access time, round 7: 2.483398 (sec)
> File access time, round 8: 2.572701 (sec)
> File access time, round 9: 2.493014 (sec)
>
> With this patch, these file pages can be cached.
>
> $ ./thrash 2048 200 10
> Allocate 2048 MB anonymous pages
> Access 200 MB file pages
> File access time, round 0: 2.475189 (sec)
> File access time, round 1: 2.440777 (sec)
> File access time, round 2: 2.411671 (sec)
> File access time, round 3: 1.955267 (sec)
> File access time, round 4: 0.029924 (sec)
> File access time, round 5: 0.000808 (sec)
> File access time, round 6: 0.000771 (sec)
> File access time, round 7: 0.000746 (sec)
> File access time, round 8: 0.000738 (sec)
> File access time, round 9: 0.000747 (sec)
This is all good again.
> Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
Please replace this line with the two Fixes: lines that I provided
earlier in this thread.
Thanks.
On Fri, Jun 28, 2019 at 10:22:52AM -0400, Johannes Weiner wrote:
> Hi Minchan,
>
> On Fri, Jun 28, 2019 at 03:51:38PM +0900, Minchan Kim wrote:
> > On Thu, Jun 27, 2019 at 02:41:23PM -0400, Johannes Weiner wrote:
> > > On Wed, Jun 19, 2019 at 04:08:35PM +0800, Kuo-Hsin Yang wrote:
> > > > Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
> > > > Signed-off-by: Kuo-Hsin Yang <[email protected]>
> > >
> > > Acked-by: Johannes Weiner <[email protected]>
> > >
> > > Your change makes sense - we should indeed not force cache trimming
> > > only while the page cache is experiencing refaults.
> > >
> > > I can't say I fully understand the changelog, though. The problem of
> >
> > I guess the point of the patch is "actual_reclaim" paramter made divergency
> > to balance file vs. anon LRU in get_scan_count. Thus, it ends up scanning
> > file LRU active/inactive list at file thrashing state.
>
> Look at the patch again. The parameter was only added to retain
> existing behavior. We *always* did file-only reclaim while thrashing -
> all the way back to the two commits I mentioned below.
Yeah, I know it that we did force file relcaim if we have enough file LRU.
What I confused from the description was "actual_reclaim" part.
Thanks for the pointing out, Johannes. I confirmed it kept the old
behavior in get_scan_count.
>
> > So, Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
> > would make sense to me since it introduces the parameter.
>
> What is the observable behavior problem that this patch introduced?
>
> > > forcing cache trimming while there is enough page cache is older than
> > > the commit you refer to. It could be argued that this commit is
> > > incomplete - it could have added refault detection not just to
> > > inactive:active file balancing, but also the file:anon balancing; but
> > > it didn't *cause* this problem.
> > >
> > > Shouldn't this be
> > >
> > > Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty")
> > > Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty")
> >
> > That would affect, too but it would be trouble to have stable backport
> > since we don't have refault machinery in there.
>
> Hm? The problematic behavior is that we force-scan file while file is
> thrashing. We can obviously only solve this in kernels that can
> actually detect thrashing.
What I meant is I thought it's -stable material but in there, we don't have
refault machinery in v3.8.
I agree this patch fixes above two commits you mentioned so we should use it.
On Fri, Jun 28, 2019 at 10:32:01AM -0400, Johannes Weiner wrote:
> On Fri, Jun 28, 2019 at 07:16:27PM +0800, Kuo-Hsin Yang wrote:
> > When file refaults are detected and there are many inactive file pages,
> > the system never reclaim anonymous pages, the file pages are dropped
> > aggressively when there are still a lot of cold anonymous pages and
> > system thrashes. This issue impacts the performance of applications
> > with large executable, e.g. chrome.
>
> This is good.
>
> > Commit 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache
> > workingset transition") introduced actual_reclaim parameter. When file
> > refaults are detected, inactive_list_is_low() may return different
> > values depends on the actual_reclaim parameter. Vmscan would only scan
> > active/inactive file lists at file thrashing state when the following 2
> > conditions are satisfied.
> >
> > 1) inactive_list_is_low() returns false in get_scan_count() to trigger
> > scanning file lists only.
> > 2) inactive_list_is_low() returns true in shrink_list() to allow
> > scanning active file list.
> >
> > This patch makes the return value of inactive_list_is_low() independent
> > of actual_reclaim and rename the parameter back to trace.
>
> This is not. The root cause for the problem you describe isn't the
> patch you point to. The root cause is our decision to force-scan the
> file LRU based on relative inactive:active size alone, without taking
> file thrashing into account at all. This is a much older problem.
>
> After the referenced patch, we're taking thrashing into account when
> deciding whether to deactivate active file pages or not. To solve the
> problem pointed out here, we can extend that same principle to the
> decision whether to force-scan files and skip the anon LRUs.
>
> The patch you're pointing to isn't the culprit. On the contrary, it
> provides the infrastructure to solve a much older problem.
>
> > The problem can be reproduced by the following test program.
> >
> > ---8<---
> > void fallocate_file(const char *filename, off_t size)
> > {
> > struct stat st;
> > int fd;
> >
> > if (!stat(filename, &st) && st.st_size >= size)
> > return;
> >
> > fd = open(filename, O_WRONLY | O_CREAT, 0600);
> > if (fd < 0) {
> > perror("create file");
> > exit(1);
> > }
> > if (posix_fallocate(fd, 0, size)) {
> > perror("fallocate");
> > exit(1);
> > }
> > close(fd);
> > }
> >
> > long *alloc_anon(long size)
> > {
> > long *start = malloc(size);
> > memset(start, 1, size);
> > return start;
> > }
> >
> > long access_file(const char *filename, long size, long rounds)
> > {
> > int fd, i;
> > volatile char *start1, *end1, *start2;
> > const int page_size = getpagesize();
> > long sum = 0;
> >
> > fd = open(filename, O_RDONLY);
> > if (fd == -1) {
> > perror("open");
> > exit(1);
> > }
> >
> > /*
> > * Some applications, e.g. chrome, use a lot of executable file
> > * pages, map some of the pages with PROT_EXEC flag to simulate
> > * the behavior.
> > */
> > start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
> > fd, 0);
> > if (start1 == MAP_FAILED) {
> > perror("mmap");
> > exit(1);
> > }
> > end1 = start1 + size / 2;
> >
> > start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
> > if (start2 == MAP_FAILED) {
> > perror("mmap");
> > exit(1);
> > }
> >
> > for (i = 0; i < rounds; ++i) {
> > struct timeval before, after;
> > volatile char *ptr1 = start1, *ptr2 = start2;
> > gettimeofday(&before, NULL);
> > for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
> > sum += *ptr1 + *ptr2;
> > gettimeofday(&after, NULL);
> > printf("File access time, round %d: %f (sec)\n", i,
> > (after.tv_sec - before.tv_sec) +
> > (after.tv_usec - before.tv_usec) / 1000000.0);
> > }
> > return sum;
> > }
> >
> > int main(int argc, char *argv[])
> > {
> > const long MB = 1024 * 1024;
> > long anon_mb, file_mb, file_rounds;
> > const char filename[] = "large";
> > long *ret1;
> > long ret2;
> >
> > if (argc != 4) {
> > printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS\n");
> > exit(0);
> > }
> > anon_mb = atoi(argv[1]);
> > file_mb = atoi(argv[2]);
> > file_rounds = atoi(argv[3]);
> >
> > fallocate_file(filename, file_mb * MB);
> > printf("Allocate %ld MB anonymous pages\n", anon_mb);
> > ret1 = alloc_anon(anon_mb * MB);
> > printf("Access %ld MB file pages\n", file_mb);
> > ret2 = access_file(filename, file_mb * MB, file_rounds);
> > printf("Print result to prevent optimization: %ld\n",
> > *ret1 + ret2);
> > return 0;
> > }
> > ---8<---
> >
> > Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the
> > program fills ram with 2048 MB memory, access a 200 MB file for 10
> > times. Without this patch, the file cache is dropped aggresively and
> > every access to the file is from disk.
> >
> > $ ./thrash 2048 200 10
> > Allocate 2048 MB anonymous pages
> > Access 200 MB file pages
> > File access time, round 0: 2.489316 (sec)
> > File access time, round 1: 2.581277 (sec)
> > File access time, round 2: 2.487624 (sec)
> > File access time, round 3: 2.449100 (sec)
> > File access time, round 4: 2.420423 (sec)
> > File access time, round 5: 2.343411 (sec)
> > File access time, round 6: 2.454833 (sec)
> > File access time, round 7: 2.483398 (sec)
> > File access time, round 8: 2.572701 (sec)
> > File access time, round 9: 2.493014 (sec)
> >
> > With this patch, these file pages can be cached.
> >
> > $ ./thrash 2048 200 10
> > Allocate 2048 MB anonymous pages
> > Access 200 MB file pages
> > File access time, round 0: 2.475189 (sec)
> > File access time, round 1: 2.440777 (sec)
> > File access time, round 2: 2.411671 (sec)
> > File access time, round 3: 1.955267 (sec)
> > File access time, round 4: 0.029924 (sec)
> > File access time, round 5: 0.000808 (sec)
> > File access time, round 6: 0.000771 (sec)
> > File access time, round 7: 0.000746 (sec)
> > File access time, round 8: 0.000738 (sec)
> > File access time, round 9: 0.000747 (sec)
>
> This is all good again.
>
> > Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
>
> Please replace this line with the two Fixes: lines that I provided
> earlier in this thread.
Can't we have "Cc: <[email protected]> # 4.12+" so we have fix kernels which has
thrashing/workingset transition detection?
On Fri, Jun 28, 2019 at 10:32:01AM -0400, Johannes Weiner wrote:
> On Fri, Jun 28, 2019 at 07:16:27PM +0800, Kuo-Hsin Yang wrote:
> > Commit 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache
> > workingset transition") introduced actual_reclaim parameter. When file
> > refaults are detected, inactive_list_is_low() may return different
> > values depends on the actual_reclaim parameter. Vmscan would only scan
> > active/inactive file lists at file thrashing state when the following 2
> > conditions are satisfied.
> >
> > 1) inactive_list_is_low() returns false in get_scan_count() to trigger
> > scanning file lists only.
> > 2) inactive_list_is_low() returns true in shrink_list() to allow
> > scanning active file list.
> >
> > This patch makes the return value of inactive_list_is_low() independent
> > of actual_reclaim and rename the parameter back to trace.
>
> This is not. The root cause for the problem you describe isn't the
> patch you point to. The root cause is our decision to force-scan the
> file LRU based on relative inactive:active size alone, without taking
> file thrashing into account at all. This is a much older problem.
>
> After the referenced patch, we're taking thrashing into account when
> deciding whether to deactivate active file pages or not. To solve the
> problem pointed out here, we can extend that same principle to the
> decision whether to force-scan files and skip the anon LRUs.
>
> The patch you're pointing to isn't the culprit. On the contrary, it
> provides the infrastructure to solve a much older problem.
>
> > Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
>
> Please replace this line with the two Fixes: lines that I provided
> earlier in this thread.
Thanks for your clarification, I will update the changelog.
When file refaults are detected and there are many inactive file pages,
the system never reclaim anonymous pages, the file pages are dropped
aggressively when there are still a lot of cold anonymous pages and
system thrashes. This issue impacts the performance of applications
with large executable, e.g. chrome.
With this patch, when file refault is detected, inactive_list_is_low()
always returns true for file pages in get_scan_count() to enable
scanning anonymous pages.
The problem can be reproduced by the following test program.
---8<---
void fallocate_file(const char *filename, off_t size)
{
struct stat st;
int fd;
if (!stat(filename, &st) && st.st_size >= size)
return;
fd = open(filename, O_WRONLY | O_CREAT, 0600);
if (fd < 0) {
perror("create file");
exit(1);
}
if (posix_fallocate(fd, 0, size)) {
perror("fallocate");
exit(1);
}
close(fd);
}
long *alloc_anon(long size)
{
long *start = malloc(size);
memset(start, 1, size);
return start;
}
long access_file(const char *filename, long size, long rounds)
{
int fd, i;
volatile char *start1, *end1, *start2;
const int page_size = getpagesize();
long sum = 0;
fd = open(filename, O_RDONLY);
if (fd == -1) {
perror("open");
exit(1);
}
/*
* Some applications, e.g. chrome, use a lot of executable file
* pages, map some of the pages with PROT_EXEC flag to simulate
* the behavior.
*/
start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
fd, 0);
if (start1 == MAP_FAILED) {
perror("mmap");
exit(1);
}
end1 = start1 + size / 2;
start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
if (start2 == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (i = 0; i < rounds; ++i) {
struct timeval before, after;
volatile char *ptr1 = start1, *ptr2 = start2;
gettimeofday(&before, NULL);
for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
sum += *ptr1 + *ptr2;
gettimeofday(&after, NULL);
printf("File access time, round %d: %f (sec)\n", i,
(after.tv_sec - before.tv_sec) +
(after.tv_usec - before.tv_usec) / 1000000.0);
}
return sum;
}
int main(int argc, char *argv[])
{
const long MB = 1024 * 1024;
long anon_mb, file_mb, file_rounds;
const char filename[] = "large";
long *ret1;
long ret2;
if (argc != 4) {
printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS\n");
exit(0);
}
anon_mb = atoi(argv[1]);
file_mb = atoi(argv[2]);
file_rounds = atoi(argv[3]);
fallocate_file(filename, file_mb * MB);
printf("Allocate %ld MB anonymous pages\n", anon_mb);
ret1 = alloc_anon(anon_mb * MB);
printf("Access %ld MB file pages\n", file_mb);
ret2 = access_file(filename, file_mb * MB, file_rounds);
printf("Print result to prevent optimization: %ld\n",
*ret1 + ret2);
return 0;
}
---8<---
Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the
program fills ram with 2048 MB memory, access a 200 MB file for 10
times. Without this patch, the file cache is dropped aggresively and
every access to the file is from disk.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.489316 (sec)
File access time, round 1: 2.581277 (sec)
File access time, round 2: 2.487624 (sec)
File access time, round 3: 2.449100 (sec)
File access time, round 4: 2.420423 (sec)
File access time, round 5: 2.343411 (sec)
File access time, round 6: 2.454833 (sec)
File access time, round 7: 2.483398 (sec)
File access time, round 8: 2.572701 (sec)
File access time, round 9: 2.493014 (sec)
With this patch, these file pages can be cached.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.475189 (sec)
File access time, round 1: 2.440777 (sec)
File access time, round 2: 2.411671 (sec)
File access time, round 3: 1.955267 (sec)
File access time, round 4: 0.029924 (sec)
File access time, round 5: 0.000808 (sec)
File access time, round 6: 0.000771 (sec)
File access time, round 7: 0.000746 (sec)
File access time, round 8: 0.000738 (sec)
File access time, round 9: 0.000747 (sec)
Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty")
Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty")
Signed-off-by: Kuo-Hsin Yang <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: <[email protected]> # 4.12+
---
mm/vmscan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7889f583ced9f..da0b97204372e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2125,7 +2125,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
* 10TB 320 32GB
*/
static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
- struct scan_control *sc, bool actual_reclaim)
+ struct scan_control *sc, bool trace)
{
enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -2151,7 +2151,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
* rid of the stale workingset quickly.
*/
refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
- if (file && actual_reclaim && lruvec->refaults != refaults) {
+ if (file && lruvec->refaults != refaults) {
inactive_ratio = 0;
} else {
gb = (inactive + active) >> (30 - PAGE_SHIFT);
@@ -2161,7 +2161,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
inactive_ratio = 1;
}
- if (actual_reclaim)
+ if (trace)
trace_mm_vmscan_inactive_list_is_low(pgdat->node_id, sc->reclaim_idx,
lruvec_lru_size(lruvec, inactive_lru, MAX_NR_ZONES), inactive,
lruvec_lru_size(lruvec, active_lru, MAX_NR_ZONES), active,
--
2.22.0.410.gd8fdbe21b5-goog
On Mon 01-07-19 16:10:38, Kuo-Hsin Yang wrote:
> When file refaults are detected and there are many inactive file pages,
> the system never reclaim anonymous pages, the file pages are dropped
> aggressively when there are still a lot of cold anonymous pages and
> system thrashes. This issue impacts the performance of applications
> with large executable, e.g. chrome.
>
> With this patch, when file refault is detected, inactive_list_is_low()
> always returns true for file pages in get_scan_count() to enable
> scanning anonymous pages.
>
> The problem can be reproduced by the following test program.
>
> ---8<---
> void fallocate_file(const char *filename, off_t size)
> {
> struct stat st;
> int fd;
>
> if (!stat(filename, &st) && st.st_size >= size)
> return;
>
> fd = open(filename, O_WRONLY | O_CREAT, 0600);
> if (fd < 0) {
> perror("create file");
> exit(1);
> }
> if (posix_fallocate(fd, 0, size)) {
> perror("fallocate");
> exit(1);
> }
> close(fd);
> }
>
> long *alloc_anon(long size)
> {
> long *start = malloc(size);
> memset(start, 1, size);
> return start;
> }
>
> long access_file(const char *filename, long size, long rounds)
> {
> int fd, i;
> volatile char *start1, *end1, *start2;
> const int page_size = getpagesize();
> long sum = 0;
>
> fd = open(filename, O_RDONLY);
> if (fd == -1) {
> perror("open");
> exit(1);
> }
>
> /*
> * Some applications, e.g. chrome, use a lot of executable file
> * pages, map some of the pages with PROT_EXEC flag to simulate
> * the behavior.
> */
> start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
> fd, 0);
> if (start1 == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
> end1 = start1 + size / 2;
>
> start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
> if (start2 == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
>
> for (i = 0; i < rounds; ++i) {
> struct timeval before, after;
> volatile char *ptr1 = start1, *ptr2 = start2;
> gettimeofday(&before, NULL);
> for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
> sum += *ptr1 + *ptr2;
> gettimeofday(&after, NULL);
> printf("File access time, round %d: %f (sec)\n", i,
> (after.tv_sec - before.tv_sec) +
> (after.tv_usec - before.tv_usec) / 1000000.0);
> }
> return sum;
> }
>
> int main(int argc, char *argv[])
> {
> const long MB = 1024 * 1024;
> long anon_mb, file_mb, file_rounds;
> const char filename[] = "large";
> long *ret1;
> long ret2;
>
> if (argc != 4) {
> printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS\n");
> exit(0);
> }
> anon_mb = atoi(argv[1]);
> file_mb = atoi(argv[2]);
> file_rounds = atoi(argv[3]);
>
> fallocate_file(filename, file_mb * MB);
> printf("Allocate %ld MB anonymous pages\n", anon_mb);
> ret1 = alloc_anon(anon_mb * MB);
> printf("Access %ld MB file pages\n", file_mb);
> ret2 = access_file(filename, file_mb * MB, file_rounds);
> printf("Print result to prevent optimization: %ld\n",
> *ret1 + ret2);
> return 0;
> }
> ---8<---
>
> Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the
> program fills ram with 2048 MB memory, access a 200 MB file for 10
> times. Without this patch, the file cache is dropped aggresively and
> every access to the file is from disk.
>
> $ ./thrash 2048 200 10
> Allocate 2048 MB anonymous pages
> Access 200 MB file pages
> File access time, round 0: 2.489316 (sec)
> File access time, round 1: 2.581277 (sec)
> File access time, round 2: 2.487624 (sec)
> File access time, round 3: 2.449100 (sec)
> File access time, round 4: 2.420423 (sec)
> File access time, round 5: 2.343411 (sec)
> File access time, round 6: 2.454833 (sec)
> File access time, round 7: 2.483398 (sec)
> File access time, round 8: 2.572701 (sec)
> File access time, round 9: 2.493014 (sec)
>
> With this patch, these file pages can be cached.
>
> $ ./thrash 2048 200 10
> Allocate 2048 MB anonymous pages
> Access 200 MB file pages
> File access time, round 0: 2.475189 (sec)
> File access time, round 1: 2.440777 (sec)
> File access time, round 2: 2.411671 (sec)
> File access time, round 3: 1.955267 (sec)
> File access time, round 4: 0.029924 (sec)
> File access time, round 5: 0.000808 (sec)
> File access time, round 6: 0.000771 (sec)
> File access time, round 7: 0.000746 (sec)
> File access time, round 8: 0.000738 (sec)
> File access time, round 9: 0.000747 (sec)
How does the reclaim behave with workloads with file backed data set
not fitting into the memory? Aren't we going to to swap a lot -
something that the heuristic is protecting from?
> Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty")
> Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty")
> Signed-off-by: Kuo-Hsin Yang <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Cc: <[email protected]> # 4.12+
--
Michal Hocko
SUSE Labs
On Wed, Jul 03, 2019 at 04:30:57PM +0200, Michal Hocko wrote:
>
> How does the reclaim behave with workloads with file backed data set
> not fitting into the memory? Aren't we going to to swap a lot -
> something that the heuristic is protecting from?
>
In common case, most of the pages in a large file backed data set are
non-executable. When there are a lot of non-executable file pages,
usually more file pages are scanned because of the recent_scanned /
recent_rotated ratio.
I modified the test program to set the accessed sizes of the executable
and non-executable file pages respectively. The test program runs on 2GB
RAM VM with kernel 5.2.0-rc7 and this patch, allocates 2000 MB anonymous
memory, then accesses 100 MB executable file pages and 2100 MB
non-executable file pages for 10 times. The test also prints the file
and anonymous page sizes in kB from /proc/meminfo. There are not too
many swaps in this test case. I got similar test result without this
patch.
$ ./thrash 2000 100 2100 10
Allocate 2000 MB anonymous pages
Active(anon): 1850964, Inactive(anon): 133140, Active(file): 1528, Inactive(file): 1352
Access 100 MB executable file pages
Access 2100 MB regular file pages
File access time, round 0: 26.833665 (sec)
Active(anon): 1476084, Inactive(anon): 492060, Active(file): 2236, Inactive(file): 2224
File access time, round 1: 26.362102 (sec)
Active(anon): 1471364, Inactive(anon): 490464, Active(file): 8508, Inactive(file): 8172
File access time, round 2: 26.828894 (sec)
Active(anon): 1469184, Inactive(anon): 489688, Active(file): 10012, Inactive(file): 9840
File access time, round 3: 27.105603 (sec)
Active(anon): 1468128, Inactive(anon): 489408, Active(file): 11000, Inactive(file): 10388
File access time, round 4: 26.936500 (sec)
Active(anon): 1466380, Inactive(anon): 488788, Active(file): 12872, Inactive(file): 12504
File access time, round 5: 26.294687 (sec)
Active(anon): 1466384, Inactive(anon): 488780, Active(file): 13332, Inactive(file): 12396
File access time, round 6: 27.382404 (sec)
Active(anon): 1466344, Inactive(anon): 488772, Active(file): 13100, Inactive(file): 12276
File access time, round 7: 26.607976 (sec)
Active(anon): 1466392, Inactive(anon): 488764, Active(file): 12892, Inactive(file): 11928
File access time, round 8: 26.477663 (sec)
Active(anon): 1466344, Inactive(anon): 488760, Active(file): 12920, Inactive(file): 12092
File access time, round 9: 26.552859 (sec)
Active(anon): 1465820, Inactive(anon): 488748, Active(file): 13300, Inactive(file): 12372
On Thu 04-07-19 17:47:16, Kuo-Hsin Yang wrote:
> On Wed, Jul 03, 2019 at 04:30:57PM +0200, Michal Hocko wrote:
> >
> > How does the reclaim behave with workloads with file backed data set
> > not fitting into the memory? Aren't we going to to swap a lot -
> > something that the heuristic is protecting from?
> >
>
> In common case, most of the pages in a large file backed data set are
> non-executable. When there are a lot of non-executable file pages,
> usually more file pages are scanned because of the recent_scanned /
> recent_rotated ratio.
>
> I modified the test program to set the accessed sizes of the executable
> and non-executable file pages respectively. The test program runs on 2GB
> RAM VM with kernel 5.2.0-rc7 and this patch, allocates 2000 MB anonymous
> memory, then accesses 100 MB executable file pages and 2100 MB
> non-executable file pages for 10 times. The test also prints the file
> and anonymous page sizes in kB from /proc/meminfo. There are not too
> many swaps in this test case. I got similar test result without this
> patch.
Could you record swap out stats please? Also what happens if you have
multiple readers?
Thanks!
--
Michal Hocko
SUSE Labs
On Thu, Jul 04, 2019 at 01:04:25PM +0200, Michal Hocko wrote:
> On Thu 04-07-19 17:47:16, Kuo-Hsin Yang wrote:
> > On Wed, Jul 03, 2019 at 04:30:57PM +0200, Michal Hocko wrote:
> > >
> > > How does the reclaim behave with workloads with file backed data set
> > > not fitting into the memory? Aren't we going to to swap a lot -
> > > something that the heuristic is protecting from?
> > >
> >
> > In common case, most of the pages in a large file backed data set are
> > non-executable. When there are a lot of non-executable file pages,
> > usually more file pages are scanned because of the recent_scanned /
> > recent_rotated ratio.
> >
> > I modified the test program to set the accessed sizes of the executable
> > and non-executable file pages respectively. The test program runs on 2GB
> > RAM VM with kernel 5.2.0-rc7 and this patch, allocates 2000 MB anonymous
> > memory, then accesses 100 MB executable file pages and 2100 MB
> > non-executable file pages for 10 times. The test also prints the file
> > and anonymous page sizes in kB from /proc/meminfo. There are not too
> > many swaps in this test case. I got similar test result without this
> > patch.
>
> Could you record swap out stats please? Also what happens if you have
> multiple readers?
Checked the swap out stats during the test [1], 19006 pages swapped out
with this patch, 3418 pages swapped out without this patch. There are
more swap out, but I think it's within reasonable range when file backed
data set doesn't fit into the memory.
$ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS PROCESSES
Allocate 2000 MB anonymous pages
active_anon: 1613644, inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB)
pswpout: 7972443, pgpgin: 478615246
Access 100 MB executable file pages
Access 2100 MB regular file pages
File access time, round 0: 12.165, (sec)
active_anon: 1433788, inactive_anon: 478116, active_file: 17896, inactive_file: 24328 (kB)
File access time, round 1: 11.493, (sec)
active_anon: 1430576, inactive_anon: 477144, active_file: 25440, inactive_file: 26172 (kB)
File access time, round 2: 11.455, (sec)
active_anon: 1427436, inactive_anon: 476060, active_file: 21112, inactive_file: 28808 (kB)
File access time, round 3: 11.454, (sec)
active_anon: 1420444, inactive_anon: 473632, active_file: 23216, inactive_file: 35036 (kB)
File access time, round 4: 11.479, (sec)
active_anon: 1413964, inactive_anon: 471460, active_file: 31728, inactive_file: 32224 (kB)
pswpout: 7991449 (+ 19006), pgpgin: 489924366 (+ 11309120)
With 4 processes accessing non-overlapping parts of a large file, 30316
pages swapped out with this patch, 5152 pages swapped out without this
patch. The swapout number is small comparing to pgpgin.
[1]: https://github.com/vovo/testing/blob/master/mem_thrash.c
On Fri 05-07-19 20:45:05, Kuo-Hsin Yang wrote:
> With 4 processes accessing non-overlapping parts of a large file, 30316
> pages swapped out with this patch, 5152 pages swapped out without this
> patch. The swapout number is small comparing to pgpgin.
which is 5 times more swapout. This may be seen to be a lot for
workloads that prefer no swapping (e.g. large in memory databases) with
an occasional heavy IO (e.g. backup). And I am worried those would
regress. I do agree that the current behavior is far from optimal
because the trashing is real. I believe that we really need a different
approach. Johannes has brought this up few years back (sorry I do not
have a link handy) but it was essentially about implementing refault
logic to anonymous memory and swap out based on the refault price. If
there is effectively no swapin then it simply makes more sense to swap
out rather than refault a page cache.
That being said, I am not nacking the patch. Let's see whether something
regresses as there is a no clear cut for the proper behavior. But I am
bringing that up because we really need a better and more robust plan
for the future.
--
Michal Hocko
SUSE Labs