Cc list similar to "congestion_wait() and GFP_NOFAIL" as they're loosely
related.
This is a prototype series that removes all calls to congestion_wait
in mm/ and deletes wait_iff_congested. It's not a clever
implementation but congestion_wait has been broken for a long time
(https://lore.kernel.org/linux-mm/[email protected]/).
Even if it worked, it was never a great idea. While excessive
dirty/writeback pages at the tail of the LRU is one possibility that
reclaim may be slow, there is also the problem of too many pages being
isolated and reclaim failing for other reasons (elevated references,
too many pages isolated, excessive LRU contention etc).
This series replaces the reclaim conditions with event driven ones
o If there are too many dirty/writeback pages, sleep until a timeout
or enough pages get cleaned
o If too many pages are isolated, sleep until enough isolated pages
are either reclaimed or put back on the LRU
o If no progress is being made, let direct reclaim tasks sleep until
another task makes progress
This has been lightly tested only and the testing was useless as the
relevant code was not executed. The workload configurations I had that
used to trigger these corner cases no longer work (yey?) and I'll need
to implement a new synthetic workload. If someone is aware of a realistic
workload that forces reclaim activity to the point where reclaim stalls
then kindly share the details.
--
2.31.1
Mel Gorman (5):
mm/vmscan: Throttle reclaim until some writeback completes if
congested
mm/vmscan: Throttle reclaim and compaction when too may pages are
isolated
mm/vmscan: Throttle reclaim when no progress is being made
mm/writeback: Throttle based on page writeback instead of congestion
mm/page_alloc: Remove the throttling logic from the page allocator
include/linux/backing-dev.h | 1 -
include/linux/mmzone.h | 12 ++++
include/trace/events/vmscan.h | 38 +++++++++++
include/trace/events/writeback.h | 7 --
mm/backing-dev.c | 48 --------------
mm/compaction.c | 2 +-
mm/filemap.c | 1 +
mm/internal.h | 11 ++++
mm/memcontrol.c | 10 +--
mm/page-writeback.c | 11 +++-
mm/page_alloc.c | 26 ++------
mm/vmscan.c | 110 ++++++++++++++++++++++++++++---
mm/vmstat.c | 1 +
13 files changed, 180 insertions(+), 98 deletions(-)
--
2.31.1
Page reclaim throttles on congestion if too many parallel reclaim instances
have isolated too many pages. This makes no sense, excessive parallelisation
has nothing to do with writeback or congestion.
This patch creates an additional workqueue to sleep on when too many
pages are isolated. The throttled tasks are woken when the number
of isolated pages is reduced or a timeout occurs. There may be
some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
the tasks will throttle again if necessary.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 4 +++-
include/trace/events/vmscan.h | 4 +++-
mm/compaction.c | 2 +-
mm/internal.h | 2 ++
mm/page_alloc.c | 6 +++++-
mm/vmscan.c | 22 ++++++++++++++++------
6 files changed, 30 insertions(+), 10 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef0a63ebd21d..ca65d6a64bdd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -275,6 +275,8 @@ enum lru_list {
enum vmscan_throttle_state {
VMSCAN_THROTTLE_WRITEBACK,
+ VMSCAN_THROTTLE_ISOLATED,
+ NR_VMSCAN_THROTTLE,
};
#define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
@@ -846,7 +848,7 @@ typedef struct pglist_data {
int node_id;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
- wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
+ wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
atomic_t nr_reclaim_throttled; /* nr of throtted tasks */
unsigned long nr_reclaim_start; /* nr pages written while throttled
* when throttling started. */
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index c317f9fe0d17..d4905bd9e9c4 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -28,10 +28,12 @@
) : "RECLAIM_WB_NONE"
#define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
+#define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED)
#define show_throttle_flags(flags) \
(flags) ? __print_flags(flags, "|", \
- {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \
+ {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \
+ {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \
) : "VMSCAN_THROTTLE_NONE"
diff --git a/mm/compaction.c b/mm/compaction.c
index bfc93da1c2c7..221c9c10ad7e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -822,7 +822,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (cc->mode == MIGRATE_ASYNC)
return -EAGAIN;
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
if (fatal_signal_pending(current))
return -EINTR;
diff --git a/mm/internal.h b/mm/internal.h
index e25b3686bfab..e6cd22fb5a43 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -118,6 +118,8 @@ extern unsigned long highest_memmap_pfn;
*/
extern int isolate_lru_page(struct page *page);
extern void putback_lru_page(struct page *page);
+extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+ long timeout);
/*
* in mm/rmap.c:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d849ddfc1e51..78e538067651 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7389,6 +7389,8 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
{
+ int i;
+
pgdat_resize_init(pgdat);
pgdat_init_split_queue(pgdat);
@@ -7396,7 +7398,9 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
- init_waitqueue_head(&pgdat->reclaim_wait);
+
+ for (i = 0; i < NR_VMSCAN_THROTTLE; i++)
+ init_waitqueue_head(&pgdat->reclaim_wait[i]);
pgdat_page_ext_init(pgdat);
lruvec_init(&pgdat->__lruvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b58ea0b13286..eb81dcac15b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
unlock_page(page);
}
-static void
-reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
long timeout)
{
- wait_queue_head_t *wqh = &pgdat->reclaim_wait;
+ wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
unsigned long start = jiffies;
long ret;
DEFINE_WAIT(wait);
@@ -1044,7 +1043,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
READ_ONCE(pgdat->nr_reclaim_start);
if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
- wake_up_interruptible_all(&pgdat->reclaim_wait);
+ wake_up_interruptible_all(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
}
/* possible outcome of pageout() */
@@ -2159,6 +2158,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
struct scan_control *sc)
{
unsigned long inactive, isolated;
+ bool too_many;
if (current_is_kswapd())
return 0;
@@ -2182,6 +2182,17 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
inactive >>= 3;
+ too_many = isolated > inactive;
+
+ /* Wake up tasks throttled due to too_many_isolated. */
+ if (!too_many) {
+ wait_queue_head_t *wqh;
+
+ wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
+ if (waitqueue_active(wqh))
+ wake_up_interruptible_all(wqh);
+ }
+
return isolated > inactive;
}
@@ -2291,8 +2302,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
return 0;
/* wait a bit for the reclaimer. */
- msleep(100);
- stalled = true;
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
--
2.31.1
The page allocator stalls based on the number of pages that are
waiting for writeback to start but this should now be redundant.
shrink_inactive_list() will wake flusher threads if the LRU tail are
unqueued dirty pages so the flusher should be active. If it fails to make
progress due to pages under writeback not being completed quickly then
it should stall on VMSCAN_THROTTLE_WRITEBACK.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 21 +--------------------
1 file changed, 1 insertion(+), 20 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78e538067651..8fa0109ff417 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4795,30 +4795,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
trace_reclaim_retry_zone(z, order, reclaimable,
available, min_wmark, *no_progress_loops, wmark);
if (wmark) {
- /*
- * If we didn't make any progress and have a lot of
- * dirty + writeback pages then we should wait for
- * an IO to complete to slow down the reclaim and
- * prevent from pre mature OOM
- */
- if (!did_some_progress) {
- unsigned long write_pending;
-
- write_pending = zone_page_state_snapshot(zone,
- NR_ZONE_WRITE_PENDING);
-
- if (2 * write_pending > reclaimable) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
- return true;
- }
- }
-
ret = true;
- goto out;
+ break;
}
}
-out:
/*
* Memory allocation/reclaim might be called from a WQ context and the
* current implementation of the WQ concurrency control doesn't
--
2.31.1
On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> This has been lightly tested only and the testing was useless as the
> relevant code was not executed. The workload configurations I had that
> used to trigger these corner cases no longer work (yey?) and I'll need
> to implement a new synthetic workload. If someone is aware of a realistic
> workload that forces reclaim activity to the point where reclaim stalls
> then kindly share the details.
The stereeotypical "stalling on I/O" problem is to plug in one of the
crap USB drives you were given at a trade show and simply
dd if=/dev/zero of=/dev/sdb
sync
You can also set up qemu to have extremely slow I/O performance:
https://serverfault.com/questions/675704/extremely-slow-qemu-storage-performance-with-qcow2-images
On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> > This has been lightly tested only and the testing was useless as the
> > relevant code was not executed. The workload configurations I had that
> > used to trigger these corner cases no longer work (yey?) and I'll need
> > to implement a new synthetic workload. If someone is aware of a realistic
> > workload that forces reclaim activity to the point where reclaim stalls
> > then kindly share the details.
>
> The stereeotypical "stalling on I/O" problem is to plug in one of the
> crap USB drives you were given at a trade show and simply
> dd if=/dev/zero of=/dev/sdb
> sync
>
The test machines are 1500KM away so plugging in a USB stick but worst
comes to the worst, I could test it on a laptop. I considered using the
IO controller but I'm not sure that would throttle background writeback.
I dismissed doing this for a few reasons though -- the dirtying should
be rate limited based on the speed of the BDI so it will not necessarily
trigger the condition. It also misses the other interesting cases --
throttling due to excessive isolation and throttling due to failing to
make progress.
I've prototyped a synthetic case that uses 4..(NR_CPUS*4) workers. 1
worker measures mmap/munmap latency. 1 worker under fio is randomly reading
files. The remaining workers are split between fio doing random write IO
on separate files and anonymous memory hogs reading large mappings every
5 seconds. The aggregate WSS is approximately totalmem*2 split between 60%
anon and 40% file-backed (40% to be 2xdirty_ratio). After a warmup period
based on the writeback speed, it runs for 5 minutes per number of workers.
The primary metric of "goodness" will be the mmap latency because it's
the smallest worker that should be able to make quick progress and I
want to see how much it is interfered with during reclaim. I'll be
graphing the throttling times to see what processes get throttled and
for how long.
I was hoping though that there was a canonical realistic case that the
FS people use to stress the paths where the allocator fails to return
memory. While my synthetic workload *might* work to trigger the cases,
I would prefer to have something that can compare this basic approach
with anything that is more clever.
Similarly, it would be nice to have a reasonable test case that phase
changes what memory is hot while there is heavy IO in the background to
detect whether the hot WSS is being properly protected. I used to use
memcached and a heavy writer to simulate this but it's weak because there
is no phase change so it's poor at evaluating vmscan.
> You can also set up qemu to have extremely slow I/O performance:
> https://serverfault.com/questions/675704/extremely-slow-qemu-storage-performance-with-qcow2-images
>
Similar problem to the slow USB case, it's only catching one part of the
picture except now I have to worry about differences that are related
to the VM configuration (e.g. pinning virtual CPUs to physical CPUs
and replicating topology). Fine for a functional test, not so fine for
measuring if the patch is any good performance-wise.
--
Mel Gorman
SUSE Labs
On Mon, Sep 20, 2021 at 01:50:58PM +0100, Mel Gorman wrote:
> On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote:
> > On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> > > This has been lightly tested only and the testing was useless as the
> > > relevant code was not executed. The workload configurations I had that
> > > used to trigger these corner cases no longer work (yey?) and I'll need
> > > to implement a new synthetic workload. If someone is aware of a realistic
> > > workload that forces reclaim activity to the point where reclaim stalls
> > > then kindly share the details.
> >
> > The stereeotypical "stalling on I/O" problem is to plug in one of the
> > crap USB drives you were given at a trade show and simply
> > dd if=/dev/zero of=/dev/sdb
> > sync
> >
>
> The test machines are 1500KM away so plugging in a USB stick but worst
> comes to the worst, I could test it on a laptop.
There's a device mapper target dm-delay [1] that as it says delays the
reads and writes, so you could try to emulate the slow USB that way.
[1] https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/delay.html
On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> > This has been lightly tested only and the testing was useless as the
> > relevant code was not executed. The workload configurations I had that
> > used to trigger these corner cases no longer work (yey?) and I'll need
> > to implement a new synthetic workload. If someone is aware of a realistic
> > workload that forces reclaim activity to the point where reclaim stalls
> > then kindly share the details.
>
> The stereeotypical "stalling on I/O" problem is to plug in one of the
> crap USB drives you were given at a trade show and simply
> dd if=/dev/zero of=/dev/sdb
> sync
>
> You can also set up qemu to have extremely slow I/O performance:
> https://serverfault.com/questions/675704/extremely-slow-qemu-storage-performance-with-qcow2-images
>
Ok, I managed to get something working and nothing blew up.
The workload was similar to what I described except the dirty file data
is related to dirty_ratio, the memory hogs no longer sleep and I disabled
the parallel readers. There is still a configuration with the parallel
readers but I won't have the results till tomorrow.
Surprising no one, vanilla kernel throttling barely works.
1 writeback_wait_iff_congested: usec_delayed=4000
3 writeback_congestion_wait: usec_delayed=108000
196 writeback_congestion_wait: usec_delayed=104000
16697 writeback_wait_iff_congested: usec_delayed=0
too_many_isolated it not tracked at all so we don't know what that looks
like but kswapd "blocking" on dirty pages at the tail basically never
stalls. The few congestion_wait's that did happen stalled for the full
duration as the bdi is not tracking congestion at all.
With the series, the breakdown of reasons to stall were
5703 reason=VMSCAN_THROTTLE_WRITEBACK
29644 reason=VMSCAN_THROTTLE_NOPROGRESS
1979999 reason=VMSCAN_THROTTLE_ISOLATED
kswapd stalls were rare but they did happen and surprise surprise, it
was dirty pages
914 reason=VMSCAN_THROTTLE_WRITEBACK
All of them stalled for the full timeout so there might be a bug in
patch 1 because that sounds suspicious.
As "too many pages isolated" was the top reason, the frequency of each
stall time is as follows
1 usect_delayed=164000
1 usect_delayed=192000
1 usect_delayed=200000
1 usect_delayed=208000
1 usect_delayed=220000
1 usect_delayed=244000
1 usect_delayed=308000
1 usect_delayed=312000
1 usect_delayed=316000
1 usect_delayed=332000
1 usect_delayed=588000
1 usect_delayed=620000
1 usect_delayed=836000
3 usect_delayed=116000
4 usect_delayed=124000
4 usect_delayed=128000
6 usect_delayed=120000
9 usect_delayed=112000
11 usect_delayed=100000
13 usect_delayed=48000
13 usect_delayed=96000
14 usect_delayed=40000
15 usect_delayed=88000
15 usect_delayed=92000
16 usect_delayed=80000
18 usect_delayed=68000
19 usect_delayed=76000
22 usect_delayed=84000
23 usect_delayed=108000
23 usect_delayed=60000
25 usect_delayed=44000
25 usect_delayed=52000
29 usect_delayed=36000
30 usect_delayed=56000
30 usect_delayed=64000
33 usect_delayed=72000
57 usect_delayed=32000
91 usect_delayed=20000
107 usect_delayed=24000
125 usect_delayed=28000
131 usect_delayed=16000
180 usect_delayed=12000
186 usect_delayed=8000
1379 usect_delayed=104000
16493 usect_delayed=4000
1960837 usect_delayed=0
In other words, the vast majority of stalls were for 0 time and the task
was immediately woken again. The next most common stall time was 1 tick
but a sizable number reach the full timeout. Everything else is somewhere
in between so the event trigger appears to be ok.
I don't know how the application itself performed as I still have to
write the analysis script and assuming I can look at this tomorrow, I'll
probably start with why VMSCAN_THROTTLE_WRITEBACK always stalled for the
full timeout.
--
Mel Gorman
SUSE Labs
On Mon, 20 Sep 2021, Mel Gorman wrote:
> @@ -2291,8 +2302,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> return 0;
>
> /* wait a bit for the reclaimer. */
> - msleep(100);
> - stalled = true;
> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
Why drop the assignment to "stalled"?
Doing that changes the character of the loop - and makes the 'stalled'
variable always 'false'.
NeilBrown
On Tue, Sep 21, 2021 at 09:27:56AM +1000, NeilBrown wrote:
> On Mon, 20 Sep 2021, Mel Gorman wrote:
> > @@ -2291,8 +2302,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> > return 0;
> >
> > /* wait a bit for the reclaimer. */
> > - msleep(100);
> > - stalled = true;
> > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
>
> Why drop the assignment to "stalled"?
> Doing that changes the character of the loop - and makes the 'stalled'
> variable always 'false'.
>
This was a thought that was never completed. The intent was that if
there are too many pages isolated that it should not return prematurely
and do busy work elsewhere. It potentially means an allocation request
moves to lower zones or remote nodes prematurely but I never did the
full removal. Even if I had, on reflection, that type of behavioural
change does not belong in this series.
I've restored the "stalled = true".
--
Mel Gorman
SUSE Labs
On Mon, Sep 20, 2021 at 04:11:52PM +0200, David Sterba wrote:
> On Mon, Sep 20, 2021 at 01:50:58PM +0100, Mel Gorman wrote:
> > On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote:
> > > On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> > > > This has been lightly tested only and the testing was useless as the
> > > > relevant code was not executed. The workload configurations I had that
> > > > used to trigger these corner cases no longer work (yey?) and I'll need
> > > > to implement a new synthetic workload. If someone is aware of a realistic
> > > > workload that forces reclaim activity to the point where reclaim stalls
> > > > then kindly share the details.
> > >
> > > The stereeotypical "stalling on I/O" problem is to plug in one of the
> > > crap USB drives you were given at a trade show and simply
> > > dd if=/dev/zero of=/dev/sdb
> > > sync
> > >
> >
> > The test machines are 1500KM away so plugging in a USB stick but worst
> > comes to the worst, I could test it on a laptop.
>
> There's a device mapper target dm-delay [1] that as it says delays the
> reads and writes, so you could try to emulate the slow USB that way.
>
> [1] https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/delay.html
Ah, thanks for that tip. I wondered if something like this existed and
clearly did not search hard enough. I was able to reproduce the problem
without throttling but this could still be useful if examining cases
where there are 2 or more BDIs with variable speeds.
--
Mel Gorman
SUSE Labs
On Mon, Sep 20, 2021 at 1:55 AM Mel Gorman <[email protected]> wrote:
>
> Page reclaim throttles on congestion if too many parallel reclaim instances
> have isolated too many pages. This makes no sense, excessive parallelisation
> has nothing to do with writeback or congestion.
>
> This patch creates an additional workqueue to sleep on when too many
> pages are isolated. The throttled tasks are woken when the number
> of isolated pages is reduced or a timeout occurs. There may be
> some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
> the tasks will throttle again if necessary.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/mmzone.h | 4 +++-
> include/trace/events/vmscan.h | 4 +++-
> mm/compaction.c | 2 +-
> mm/internal.h | 2 ++
> mm/page_alloc.c | 6 +++++-
> mm/vmscan.c | 22 ++++++++++++++++------
> 6 files changed, 30 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ef0a63ebd21d..ca65d6a64bdd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -275,6 +275,8 @@ enum lru_list {
>
> enum vmscan_throttle_state {
> VMSCAN_THROTTLE_WRITEBACK,
> + VMSCAN_THROTTLE_ISOLATED,
> + NR_VMSCAN_THROTTLE,
> };
>
> #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
> @@ -846,7 +848,7 @@ typedef struct pglist_data {
> int node_id;
> wait_queue_head_t kswapd_wait;
> wait_queue_head_t pfmemalloc_wait;
> - wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
> + wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
> atomic_t nr_reclaim_throttled; /* nr of throtted tasks */
> unsigned long nr_reclaim_start; /* nr pages written while throttled
> * when throttling started. */
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index c317f9fe0d17..d4905bd9e9c4 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -28,10 +28,12 @@
> ) : "RECLAIM_WB_NONE"
>
> #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
> +#define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED)
>
> #define show_throttle_flags(flags) \
> (flags) ? __print_flags(flags, "|", \
> - {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \
> + {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \
> + {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \
> ) : "VMSCAN_THROTTLE_NONE"
>
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index bfc93da1c2c7..221c9c10ad7e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -822,7 +822,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> if (cc->mode == MIGRATE_ASYNC)
> return -EAGAIN;
>
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
It seems waking up tasks is missed in compaction's
too_many_isolated(). There are two too_many_isolated(), one is for
compaction, the other is for reclaimer. I saw the waking up code was
added to the reclaimer's in the below. Or the compaction one is left
out intentionally?
>
> if (fatal_signal_pending(current))
> return -EINTR;
> diff --git a/mm/internal.h b/mm/internal.h
> index e25b3686bfab..e6cd22fb5a43 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -118,6 +118,8 @@ extern unsigned long highest_memmap_pfn;
> */
> extern int isolate_lru_page(struct page *page);
> extern void putback_lru_page(struct page *page);
> +extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> + long timeout);
>
> /*
> * in mm/rmap.c:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d849ddfc1e51..78e538067651 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7389,6 +7389,8 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
>
> static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
> {
> + int i;
> +
> pgdat_resize_init(pgdat);
>
> pgdat_init_split_queue(pgdat);
> @@ -7396,7 +7398,9 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>
> init_waitqueue_head(&pgdat->kswapd_wait);
> init_waitqueue_head(&pgdat->pfmemalloc_wait);
> - init_waitqueue_head(&pgdat->reclaim_wait);
> +
> + for (i = 0; i < NR_VMSCAN_THROTTLE; i++)
> + init_waitqueue_head(&pgdat->reclaim_wait[i]);
>
> pgdat_page_ext_init(pgdat);
> lruvec_init(&pgdat->__lruvec);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b58ea0b13286..eb81dcac15b2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
> unlock_page(page);
> }
>
> -static void
> -reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> +void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> long timeout)
> {
> - wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> + wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
> unsigned long start = jiffies;
> long ret;
> DEFINE_WAIT(wait);
> @@ -1044,7 +1043,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
> READ_ONCE(pgdat->nr_reclaim_start);
>
> if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> - wake_up_interruptible_all(&pgdat->reclaim_wait);
> + wake_up_interruptible_all(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
> }
>
> /* possible outcome of pageout() */
> @@ -2159,6 +2158,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
> struct scan_control *sc)
> {
> unsigned long inactive, isolated;
> + bool too_many;
>
> if (current_is_kswapd())
> return 0;
> @@ -2182,6 +2182,17 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
> if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
> inactive >>= 3;
>
> + too_many = isolated > inactive;
> +
> + /* Wake up tasks throttled due to too_many_isolated. */
> + if (!too_many) {
> + wait_queue_head_t *wqh;
> +
> + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
> + if (waitqueue_active(wqh))
> + wake_up_interruptible_all(wqh);
> + }
> +
> return isolated > inactive;
Just return too_many?
> }
>
> @@ -2291,8 +2302,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> return 0;
>
> /* wait a bit for the reclaimer. */
> - msleep(100);
> - stalled = true;
> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> --
> 2.31.1
>
>
On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> Cc list similar to "congestion_wait() and GFP_NOFAIL" as they're loosely
> related.
>
> This is a prototype series that removes all calls to congestion_wait
> in mm/ and deletes wait_iff_congested. It's not a clever
> implementation but congestion_wait has been broken for a long time
> (https://lore.kernel.org/linux-mm/[email protected]/).
> Even if it worked, it was never a great idea. While excessive
> dirty/writeback pages at the tail of the LRU is one possibility that
> reclaim may be slow, there is also the problem of too many pages being
> isolated and reclaim failing for other reasons (elevated references,
> too many pages isolated, excessive LRU contention etc).
>
> This series replaces the reclaim conditions with event driven ones
>
> o If there are too many dirty/writeback pages, sleep until a timeout
> or enough pages get cleaned
> o If too many pages are isolated, sleep until enough isolated pages
> are either reclaimed or put back on the LRU
> o If no progress is being made, let direct reclaim tasks sleep until
> another task makes progress
>
> This has been lightly tested only and the testing was useless as the
> relevant code was not executed. The workload configurations I had that
> used to trigger these corner cases no longer work (yey?) and I'll need
> to implement a new synthetic workload. If someone is aware of a realistic
> workload that forces reclaim activity to the point where reclaim stalls
> then kindly share the details.
Got a git tree pointer so I can pull it into a test kernel so I can
see what impact it has on behaviour before I try to make sense of
the code?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Sep 21, 2021 at 11:45:19AM -0700, Yang Shi wrote:
> On Mon, Sep 20, 2021 at 1:55 AM Mel Gorman <[email protected]> wrote:
> >
> > Page reclaim throttles on congestion if too many parallel reclaim instances
> > have isolated too many pages. This makes no sense, excessive parallelisation
> > has nothing to do with writeback or congestion.
> >
> > This patch creates an additional workqueue to sleep on when too many
> > pages are isolated. The throttled tasks are woken when the number
> > of isolated pages is reduced or a timeout occurs. There may be
> > some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
> > the tasks will throttle again if necessary.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > include/linux/mmzone.h | 4 +++-
> > include/trace/events/vmscan.h | 4 +++-
> > mm/compaction.c | 2 +-
> > mm/internal.h | 2 ++
> > mm/page_alloc.c | 6 +++++-
> > mm/vmscan.c | 22 ++++++++++++++++------
> > 6 files changed, 30 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index ef0a63ebd21d..ca65d6a64bdd 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -275,6 +275,8 @@ enum lru_list {
> >
> > enum vmscan_throttle_state {
> > VMSCAN_THROTTLE_WRITEBACK,
> > + VMSCAN_THROTTLE_ISOLATED,
> > + NR_VMSCAN_THROTTLE,
> > };
> >
> > #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
> > @@ -846,7 +848,7 @@ typedef struct pglist_data {
> > int node_id;
> > wait_queue_head_t kswapd_wait;
> > wait_queue_head_t pfmemalloc_wait;
> > - wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
> > + wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
> > atomic_t nr_reclaim_throttled; /* nr of throtted tasks */
> > unsigned long nr_reclaim_start; /* nr pages written while throttled
> > * when throttling started. */
> > diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> > index c317f9fe0d17..d4905bd9e9c4 100644
> > --- a/include/trace/events/vmscan.h
> > +++ b/include/trace/events/vmscan.h
> > @@ -28,10 +28,12 @@
> > ) : "RECLAIM_WB_NONE"
> >
> > #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
> > +#define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED)
> >
> > #define show_throttle_flags(flags) \
> > (flags) ? __print_flags(flags, "|", \
> > - {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \
> > + {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \
> > + {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \
> > ) : "VMSCAN_THROTTLE_NONE"
> >
> >
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index bfc93da1c2c7..221c9c10ad7e 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -822,7 +822,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > if (cc->mode == MIGRATE_ASYNC)
> > return -EAGAIN;
> >
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
>
> It seems waking up tasks is missed in compaction's
> too_many_isolated(). There are two too_many_isolated(), one is for
> compaction, the other is for reclaimer. I saw the waking up code was
> added to the reclaimer's in the below. Or the compaction one is left
> out intentionally?
>
Compaction one was left out accidentally, I'll fix it. Thanks.
--
Mel Gorman
SUSE Labs
On Wed, Sep 22, 2021 at 06:46:21AM +1000, Dave Chinner wrote:
> On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> > Cc list similar to "congestion_wait() and GFP_NOFAIL" as they're loosely
> > related.
> >
> > This is a prototype series that removes all calls to congestion_wait
> > in mm/ and deletes wait_iff_congested. It's not a clever
> > implementation but congestion_wait has been broken for a long time
> > (https://lore.kernel.org/linux-mm/[email protected]/).
> > Even if it worked, it was never a great idea. While excessive
> > dirty/writeback pages at the tail of the LRU is one possibility that
> > reclaim may be slow, there is also the problem of too many pages being
> > isolated and reclaim failing for other reasons (elevated references,
> > too many pages isolated, excessive LRU contention etc).
> >
> > This series replaces the reclaim conditions with event driven ones
> >
> > o If there are too many dirty/writeback pages, sleep until a timeout
> > or enough pages get cleaned
> > o If too many pages are isolated, sleep until enough isolated pages
> > are either reclaimed or put back on the LRU
> > o If no progress is being made, let direct reclaim tasks sleep until
> > another task makes progress
> >
> > This has been lightly tested only and the testing was useless as the
> > relevant code was not executed. The workload configurations I had that
> > used to trigger these corner cases no longer work (yey?) and I'll need
> > to implement a new synthetic workload. If someone is aware of a realistic
> > workload that forces reclaim activity to the point where reclaim stalls
> > then kindly share the details.
>
> Got a git tree pointer so I can pull it into a test kernel so I can
> see what impact it has on behaviour before I try to make sense of
> the code?
>
The current version I'm testing is at
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v2r5
Only one test has completed and I won't be able to analyse the results
in detail for a few days but it's doing *something* for the workload that
is hammering reclaim
5.15.0-rc1 5.15.0-rc1
vanillamm-reclaimcongest-v2r5
Duration User 10891.30 9945.59
Duration System 5673.78 2649.43
Duration Elapsed 2402.85 2407.96
System CPU usage dropped by a lot. Workload completes runs for a fixed
duration so a difference in elapsed is not interesting
Ops Direct pages scanned 518791317.00 219956338.00
Ops Kswapd pages scanned 128555233.00 165439373.00
Ops Kswapd pages reclaimed 87830801.00 72216420.00
Ops Direct pages reclaimed 16114049.00 10408389.00
Ops Kswapd efficiency % 68.32 43.65
Ops Kswapd velocity 53501.15 68705.20
Ops Direct efficiency % 3.11 4.73
Ops Direct velocity 215906.66 91345.5
Ops Percentage direct scans 80.14 57.07
Ops Page writes by reclaim 4225921.00 2032865.00
Large reductions in direct pages scanned. The rate kswapd scans is roughly
the same (velocity) where as direct velocity is down (presumably because
it's getting throttled). Pages written from reclaim context are about
halved. Kswapd scan rates are increased slightly but probably because
direct reclaimers throttled. Reclaim efficiency is low but that's expected
given the workload is basically trying to make it as hard as possible
for reclaim to make progress.
Kswapd is only getting throttled on writeback and is being woken before
the timeout of 100000
1 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
12 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
17 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
129 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
205 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
The number of throttle events for direct reclaimers were
16909 reason=VMSCAN_THROTTLE_ISOLATED
77844 reason=VMSCAN_THROTTLE_NOPROGRESS
113415 reason=VMSCAN_THROTTLE_WRITEBACK
For the throttle events, 33% of them were NOPROGRESS hitting the full
timeout and 33% were WRITEBACK hitting the full timeout. If anything,
that would suggest increasing the max timeout as presumably they woke up
uselessly like Neil had suggested.
--
Mel Gorman
SUSE Labs