2016-04-04 17:14:00

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 0/3] mm: support bigger cache workingsets and protect against writes

Hi,

this is a follow-up to http://www.spinics.net/lists/linux-mm/msg101739.html
where Andres reported his database workingset being pushed out by the
minimum size enforcement of the inactive file list - currently 50% of cache
- as well as repeatedly written file pages that are never actually read.

Two changes fell out of the discussions. The first change observes that
pages that are only ever written don't benefit from caching beyond what the
writeback cache does for partial page writes, and so we shouldn't promote
them to the active file list where they compete with pages whose cached data
is actually accessed repeatedly. This change comes in two patches - one for
in-cache write accesses and one for refaults triggered by writes, neither of
which should promote a cache page.

Second, with the refault detection we don't need to set 50% of the cache
aside for used-once cache anymore since we can detect frequently used pages
even when they are evicted between accesses. We can allow the active list to
be bigger and thus protect a bigger workingset that isn't challenged by
streamers. Depending on the access patterns, this can increase major faults
during workingset transitions for better performance during stable phases.

Andres, I tried reproducing your postgres scenario, but I could never get
the WAL to interfere even with wal_log = hot_standby mode. It's a 8G
machine, I set shared_buffers = 2GB, ran pgbench -i -s 290, and then -c 32
-j 32 -M prepared -t 150000. Any input on how to trigger the thrashing you
observed would be appreciated. But it would be great if you could test these
patches on your known-problematic setup as well.

Thanks!

include/linux/memcontrol.h | 25 -----------
mm/filemap.c | 8 +++-
mm/page_alloc.c | 44 ------------------
mm/vmscan.c | 104 +++++++++++++++++--------------------------
4 files changed, 48 insertions(+), 133 deletions(-)


2016-04-04 17:14:07

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 1/3] mm: workingset: only do workingset activations on reads

From: Rik van Riel <[email protected]>

When rewriting a page, the data in that page is replaced with new
data. This means that evicting something else from the active file
list, in order to cache data that will be replaced by something else,
is likely to be a waste of memory.

It is better to save the active list for frequently read pages, because
reads actually use the data that is in the page.

This patch ignores partial writes, because it is unclear whether the
complexity of identifying those is worth any potential performance
gain obtained from better caching pages that see repeated partial
writes at large enough intervals to not get caught by the use-twice
promotion code used for the inactive file list.

Reported-by: Andres Freund <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/filemap.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a8c69c8..ca33816 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -713,8 +713,12 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
* The page might have been evicted from cache only
* recently, in which case it should be activated like
* any other repeatedly accessed page.
+ * The exception is pages getting rewritten; evicting other
+ * data from the working set, only to cache data that will
+ * get overwritten with something else, is a waste of memory.
*/
- if (shadow && workingset_refault(shadow)) {
+ if (!(gfp_mask & __GFP_WRITE) &&
+ shadow && workingset_refault(shadow)) {
SetPageActive(page);
workingset_activation(page);
} else
--
2.8.0

2016-04-04 17:14:05

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 2/3] mm: filemap: only do access activations on reads

Andres Freund observed that his database workload is struggling with
the transaction journal creating pressure on frequently read pages.

Access patterns like transaction journals frequently write the same
pages over and over, but in the majority of cases those pages are
never read back. There are no caching benefits to be had for those
pages, so activating them and having them put pressure on pages that
do benefit from caching is a bad choice.

Leave page activations to read accesses and don't promote pages based
on writes alone.

It could be said that partially written pages do contain cache-worthy
data, because even if *userspace* does not access the unwritten part,
the kernel still has to read it from the filesystem for correctness.
However, a counter argument is that these pages enjoy at least *some*
protection over other inactive file pages through the writeback cache,
in the sense that dirty pages are written back with a delay and cache
reclaim leaves them alone until they have been written back to
disk. Should that turn out to be insufficient and we see increased
read IO from partial writes under memory pressure, we can always go
back and update grab_cache_page_write_begin() to take (pos, len) so
that it can tell partial writes from pages that don't need partial
reads. But for now, keep it simple.

Reported-by: Andres Freund <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/filemap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ca33816..edfec5e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2579,7 +2579,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags)
{
struct page *page;
- int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;
+ int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;

if (flags & AOP_FLAG_NOFS)
fgp_flags |= FGP_NOFS;
--
2.8.0

2016-04-04 17:14:24

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 3/3] mm: vmscan: reduce size of inactive file list

From: Rik van Riel <[email protected]>

The inactive file list should still be large enough to contain
readahead windows and freshly written file data, but it no
longer is the only source for detecting multiple accesses to
file pages. The workingset refault measurement code causes
recently evicted file pages that get accessed again after a
shorter interval to be promoted directly to the active list.

With that mechanism in place, we can afford to (on a larger
system) dedicate more memory to the active file list, so we
can actually cache more of the frequently used file pages
in memory, and not have them pushed out by streaming writes,
once-used streaming file reads, etc.

This can help things like database workloads, where only
half the page cache can currently be used to cache the
database working set. This patch automatically increases
that fraction on larger systems, using the same ratio that
has already been used for anonymous memory.

Reported-by: Andres Freund <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
[[email protected]: cgroup-awareness]
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 25 -----------
mm/page_alloc.c | 44 -------------------
mm/vmscan.c | 104 ++++++++++++++++++---------------------------
3 files changed, 42 insertions(+), 131 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1191d79..3694f88 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -415,25 +415,6 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
return mz->lru_size[lru];
}

-static inline bool mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
-{
- unsigned long inactive_ratio;
- unsigned long inactive;
- unsigned long active;
- unsigned long gb;
-
- inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON);
- active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON);
-
- gb = (inactive + active) >> (30 - PAGE_SHIFT);
- if (gb)
- inactive_ratio = int_sqrt(10 * gb);
- else
- inactive_ratio = 1;
-
- return inactive * inactive_ratio < active;
-}
-
void mem_cgroup_handle_over_high(void);

void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -646,12 +627,6 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
return true;
}

-static inline bool
-mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
-{
- return true;
-}
-
static inline unsigned long
mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59de90d..67db15d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6395,49 +6395,6 @@ void setup_per_zone_wmarks(void)
}

/*
- * The inactive anon list should be small enough that the VM never has to
- * do too much work, but large enough that each inactive page has a chance
- * to be referenced again before it is swapped out.
- *
- * The inactive_anon ratio is the target ratio of ACTIVE_ANON to
- * INACTIVE_ANON pages on this zone's LRU, maintained by the
- * pageout code. A zone->inactive_ratio of 3 means 3:1 or 25% of
- * the anonymous pages are kept on the inactive list.
- *
- * total target max
- * memory ratio inactive anon
- * -------------------------------------
- * 10MB 1 5MB
- * 100MB 1 50MB
- * 1GB 3 250MB
- * 10GB 10 0.9GB
- * 100GB 31 3GB
- * 1TB 101 10GB
- * 10TB 320 32GB
- */
-static void __meminit calculate_zone_inactive_ratio(struct zone *zone)
-{
- unsigned int gb, ratio;
-
- /* Zone size in gigabytes */
- gb = zone->managed_pages >> (30 - PAGE_SHIFT);
- if (gb)
- ratio = int_sqrt(10 * gb);
- else
- ratio = 1;
-
- zone->inactive_ratio = ratio;
-}
-
-static void __meminit setup_per_zone_inactive_ratio(void)
-{
- struct zone *zone;
-
- for_each_zone(zone)
- calculate_zone_inactive_ratio(zone);
-}
-
-/*
* Initialise min_free_kbytes.
*
* For small machines we want it small (128k min). For large machines
@@ -6482,7 +6439,6 @@ int __meminit init_per_zone_wmark_min(void)
setup_per_zone_wmarks();
refresh_zone_stat_thresholds();
setup_per_zone_lowmem_reserve();
- setup_per_zone_inactive_ratio();
return 0;
}
module_init(init_per_zone_wmark_min)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b934223e..6f4f18c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1865,83 +1865,63 @@ static void shrink_active_list(unsigned long nr_to_scan,
free_hot_cold_page_list(&l_hold, true);
}

-#ifdef CONFIG_SWAP
-static bool inactive_anon_is_low_global(struct zone *zone)
-{
- unsigned long active, inactive;
-
- active = zone_page_state(zone, NR_ACTIVE_ANON);
- inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-
- return inactive * zone->inactive_ratio < active;
-}
-
-/**
- * inactive_anon_is_low - check if anonymous pages need to be deactivated
- * @lruvec: LRU vector to check
+/*
+ * The inactive anon list should be small enough that the VM never has
+ * to do too much work.
*
- * Returns true if the zone does not have enough inactive anon pages,
- * meaning some active anon pages need to be deactivated.
- */
-static bool inactive_anon_is_low(struct lruvec *lruvec)
-{
- /*
- * If we don't have swap space, anonymous page deactivation
- * is pointless.
- */
- if (!total_swap_pages)
- return false;
-
- if (!mem_cgroup_disabled())
- return mem_cgroup_inactive_anon_is_low(lruvec);
-
- return inactive_anon_is_low_global(lruvec_zone(lruvec));
-}
-#else
-static inline bool inactive_anon_is_low(struct lruvec *lruvec)
-{
- return false;
-}
-#endif
-
-/**
- * inactive_file_is_low - check if file pages need to be deactivated
- * @lruvec: LRU vector to check
+ * The inactive file list should be small enough to leave most memory
+ * to the established workingset on the scan-resistant active list,
+ * but large enough to avoid thrashing the aggregate readahead window.
*
- * When the system is doing streaming IO, memory pressure here
- * ensures that active file pages get deactivated, until more
- * than half of the file pages are on the inactive list.
+ * Both inactive lists should also be large enough that each inactive
+ * page has a chance to be referenced again before it is reclaimed.
*
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
+ * The inactive_ratio is the target ratio of ACTIVE to INACTIVE pages
+ * on this LRU, maintained by the pageout code. A zone->inactive_ratio
+ * of 3 means 3:1 or 25% of the pages are kept on the inactive list.
*
- * This uses a different ratio than the anonymous pages, because
- * the page cache uses a use-once replacement algorithm.
+ * total target max
+ * memory ratio inactive
+ * -------------------------------------
+ * 10MB 1 5MB
+ * 100MB 1 50MB
+ * 1GB 3 250MB
+ * 10GB 10 0.9GB
+ * 100GB 31 3GB
+ * 1TB 101 10GB
+ * 10TB 320 32GB
*/
-static bool inactive_file_is_low(struct lruvec *lruvec)
+static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
{
+ unsigned long inactive_ratio;
unsigned long inactive;
unsigned long active;
+ unsigned long gb;

- inactive = lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
- active = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
+ /*
+ * If we don't have swap space, anonymous page deactivation
+ * is pointless.
+ */
+ if (!file && !total_swap_pages)
+ return false;

- return active > inactive;
-}
+ inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
+ active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);

-static bool inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
-{
- if (is_file_lru(lru))
- return inactive_file_is_low(lruvec);
+ gb = (inactive + active) >> (30 - PAGE_SHIFT);
+ if (gb)
+ inactive_ratio = int_sqrt(10 * gb);
else
- return inactive_anon_is_low(lruvec);
+ inactive_ratio = 1;
+
+ return inactive * inactive_ratio < active;
}

static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
if (is_active_lru(lru)) {
- if (inactive_list_is_low(lruvec, lru))
+ if (inactive_list_is_low(lruvec, is_file_lru(lru)))
shrink_active_list(nr_to_scan, lruvec, sc, lru);
return 0;
}
@@ -2062,7 +2042,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* lruvec even if it has plenty of old anonymous pages unless the
* system is under heavy pressure.
*/
- if (!inactive_file_is_low(lruvec) &&
+ if (!inactive_list_is_low(lruvec, true) &&
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
@@ -2304,7 +2284,7 @@ static void shrink_zone_memcg(struct zone *zone, struct mem_cgroup *memcg,
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
- if (inactive_anon_is_low(lruvec))
+ if (inactive_list_is_low(lruvec, false))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);

@@ -2965,7 +2945,7 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
do {
struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);

- if (inactive_anon_is_low(lruvec))
+ if (inactive_list_is_low(lruvec, false))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);

--
2.8.0

2016-04-04 18:52:50

by Andres Freund

[permalink] [raw]
Subject: Re: [PATCH 0/3] mm: support bigger cache workingsets and protect against writes

Hi Johannes,

On 2016-04-04 13:13:35 -0400, Johannes Weiner wrote:
> this is a follow-up to http://www.spinics.net/lists/linux-mm/msg101739.html
> where Andres reported his database workingset being pushed out by the
> minimum size enforcement of the inactive file list - currently 50% of cache
> - as well as repeatedly written file pages that are never actually read.

Thanks for following up!


> Andres, I tried reproducing your postgres scenario, but I could never get
> the WAL to interfere even with wal_log = hot_standby mode. It's a 8G
> machine, I set shared_buffers = 2GB, ran pgbench -i -s 290, and then -c 32
> -j 32 -M prepared -t 150000. Any input on how to trigger the thrashing you
> observed would be appreciated. But it would be great if you could test these
> patches on your known-problematic setup as well.

I'm unfortunately in the process of moving to the US (as in, I'm packing
boxes), so I can't get back to you just now. I'll try ASAP (early next
week).

Regards,

Andres

2016-04-04 21:22:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: filemap: only do access activations on reads

On Mon, 4 Apr 2016 13:13:37 -0400 Johannes Weiner <[email protected]> wrote:

> Andres Freund observed that his database workload is struggling with
> the transaction journal creating pressure on frequently read pages.
>
> Access patterns like transaction journals frequently write the same
> pages over and over, but in the majority of cases those pages are
> never read back. There are no caching benefits to be had for those
> pages, so activating them and having them put pressure on pages that
> do benefit from caching is a bad choice.

Read-after-write is a pretty common pattern: temporary files for
example. What are the opportunities for regressions here?

Did you consider providing userspace with a way to hint "this file is
probably write-then-not-read"?

2016-04-04 21:39:52

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: filemap: only do access activations on reads

On Mon, 2016-04-04 at 14:22 -0700, Andrew Morton wrote:
> On Mon,  4 Apr 2016 13:13:37 -0400 Johannes Weiner <[email protected]
> g> wrote:
>
> >
> > Andres Freund observed that his database workload is struggling
> > with
> > the transaction journal creating pressure on frequently read pages.
> >
> > Access patterns like transaction journals frequently write the same
> > pages over and over, but in the majority of cases those pages are
> > never read back. There are no caching benefits to be had for those
> > pages, so activating them and having them put pressure on pages
> > that
> > do benefit from caching is a bad choice.
> Read-after-write is a pretty common pattern: temporary files for
> example.  What are the opportunities for regressions here?
>
> Did you consider providing userspace with a way to hint "this file is
> probably write-then-not-read"?

I suspect the opportunity for regressions is fairly small,
considering that temporary files usually have a very short
life span, and will likely be read-after-written before they
get evicted from the inactive list.

As for hinting, I suspect it may make sense to differentiate
between whole page and partial page writes, where partial
page writes use FGP_ACCESSED, and whole page writes do not,
under the assumption that if we write a partial page, there
may be a higher chance that other parts of the page get
accessed again for other writes (or reads).

I do not know whether that assumption holds :)

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-04-04 21:56:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: filemap: only do access activations on reads

On Mon, 04 Apr 2016 17:39:47 -0400 Rik van Riel <[email protected]> wrote:

> On Mon, 2016-04-04 at 14:22 -0700, Andrew Morton wrote:
> > On Mon,____4 Apr 2016 13:13:37 -0400 Johannes Weiner <[email protected]
> > g> wrote:
> >
> > >
> > > Andres Freund observed that his database workload is struggling
> > > with
> > > the transaction journal creating pressure on frequently read pages.
> > >
> > > Access patterns like transaction journals frequently write the same
> > > pages over and over, but in the majority of cases those pages are
> > > never read back. There are no caching benefits to be had for those
> > > pages, so activating them and having them put pressure on pages
> > > that
> > > do benefit from caching is a bad choice.
> > Read-after-write is a pretty common pattern: temporary files for
> > example.____What are the opportunities for regressions here?
> >
> > Did you consider providing userspace with a way to hint "this file is
> > probably write-then-not-read"?
>
> I suspect the opportunity for regressions is fairly small,
> considering that temporary files usually have a very short
> life span, and will likely be read-after-written before they
> get evicted from the inactive list.

The opportunity for regressions in the current code is fairly small,
but Andres found one :( If there's any possibility at all, someone will
hit it.

One possible way to move forward is to write testcases to deliberately
hit the predicted problem, gain an understanding of how hard it is to
hit, how bad the effects are.

> As for hinting, I suspect it may make sense to differentiate
> between whole page and partial page writes, where partial
> page writes use FGP_ACCESSED, and whole page writes do not,
> under the assumption that if we write a partial page, there
> may be a higher chance that other parts of the page get
> accessed again for other writes (or reads).

hm, the FGP_foo documentation is a mess. There's some placed randomly
at pagecache_get_page() and FGP_WRITE got missed altogether.

The ext4 journal would be a decent (but not very significant) candidate
for a "this is never read from" interface. I guess the fs could
manually deactivate (or even free?) the pages.

2016-04-04 22:48:03

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: filemap: only do access activations on reads

On Mon, Apr 04, 2016 at 02:22:33PM -0700, Andrew Morton wrote:
> On Mon, 4 Apr 2016 13:13:37 -0400 Johannes Weiner <[email protected]> wrote:
>
> > Andres Freund observed that his database workload is struggling with
> > the transaction journal creating pressure on frequently read pages.
> >
> > Access patterns like transaction journals frequently write the same
> > pages over and over, but in the majority of cases those pages are
> > never read back. There are no caching benefits to be had for those
> > pages, so activating them and having them put pressure on pages that
> > do benefit from caching is a bad choice.
>
> Read-after-write is a pretty common pattern: temporary files for
> example. What are the opportunities for regressions here?

The read(s) following the write will call mark_page_accessed() and so
promote the pages if their data is in fact repeatedly accessed. That
makes sense, because the writes really don't say anything about the
cache-worthiness. One write followed by one read shouldn't mean the
data is strongly benefiting from being cached. Only multiple reads.

What complicates that a little bit is that when the multiple reads do
happen on write-instantiated pages, the pages might have already been
aged somewhat in between, whereas fresh-faulting reads start counting
accesses from the head of the LRU right away. If both have re-use
distances shorter than memory, the LRU offset of pages instantiated by
writes could push the second access past eviction.

In that case, they would likely get picked up by refault detection and
promoted after all. So it would be one more IO, but nothing permanent.

This is also somewhat compensated by the dirty cache delaying reclaim
and giving these pages another round-trip anyway - unless dirty limits
cause the pages to be written back before they reach the LRU tail.

It's really hard to tell whether that would even be an issue since it
depends on whether a workload matching those parameters even exist. A
synthetic test doesn't really say us much about that. I think all we
can do here is decide whether the cache semantics make logical sense.

One thing I proposed in the thread that would compensate for the LRU
offset of write-instantiated pages would be to set PageReferenced on
these pages but never call mark_page_accessed() from the write. This
wouldn't be perfect because the distance between write and read does
not necessarily predict the distance between the subsequent reads, but
it would mean that the first read would promote the pages, whereas
repeatedly written files would never be activated or refault-activate.

Would that make sense? Is there something I'm missing?

> Did you consider providing userspace with a way to hint "this file is
> probably write-then-not-read"?

Yes, but I'm not too confident in that working out :(

2016-04-05 17:51:12

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: filemap: only do access activations on reads

On Mon, Apr 04, 2016 at 05:39:47PM -0400, Rik van Riel wrote:
> As for hinting, I suspect it may make sense to differentiate
> between whole page and partial page writes, where partial
> page writes use FGP_ACCESSED, and whole page writes do not,
> under the assumption that if we write a partial page, there
> may be a higher chance that other parts of the page get
> accessed again for other writes (or reads).

The writeback cache should handle at least the multiple subpage writes
case.

What I find a little weird about counting accesses from partial writes
only is when a write covers a full page and then parts of the next. We
would cache only a small piece of what's likely one coherent chunk.

Or when a user writes out several pages in a loop of subpage chunks.

This will get even worse to program against when we start having page
cache transparently backed by pages of different sizes.

Because of that I think it'd be better to apply LRU aging decisions
based on type of access rather based on specific request sizes.