LinuxLists.cc - MGLRU premature memcg OOM on slow writes

2024-02-09 02:31:49

Subject: MGLRU premature memcg OOM on slow writes

Hi Yu,

When running with MGLRU I'm encountering premature OOMs when transferring files
to a slow disk.

On non-MGLRU setups, writeback flushers are awakened and get to work. But on
MGLRU, one can see OOM killer outputs like the following when doing an rsync
with a memory.max of 32M:

---

% systemd-run --user -t -p MemoryMax=32M -- rsync -rv ... /mnt/usb
Running as unit: run-u640.service
Press ^] three times within 1s to disconnect TTY.
sending incremental file list
...
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(713) [generator=3.2.7]

---

[41368.535735] Memory cgroup out of memory: Killed process 128824 (rsync) total-vm:14008kB, anon-rss:256kB, file-rss:5504kB, shmem-rss:0kB, UID:1000 pgtables:64kB oom_score_adj:200
[41369.847965] rsync invoked oom-killer: gfp_mask=0x408d40(GFP_NOFS|__GFP_NOFAIL|__GFP_ZERO|__GFP_ACCOUNT), order=0, oom_score_adj=200
[41369.847972] CPU: 1 PID: 128826 Comm: rsync Tainted: G S OE 6.7.4-arch1-1 #1 20d30c48b78a04be2046f4b305b40455f0b5b38b
[41369.847975] Hardware name: LENOVO 20WNS23A0G/20WNS23A0G, BIOS N35ET53W (1.53 ) 03/22/2023
[41369.847977] Call Trace:
[41369.847978] <TASK>
[41369.847980] dump_stack_lvl+0x47/0x60
[41369.847985] dump_header+0x45/0x1b0
[41369.847988] oom_kill_process+0xfa/0x200
[41369.847990] out_of_memory+0x244/0x590
[41369.847992] mem_cgroup_out_of_memory+0x134/0x150
[41369.847995] try_charge_memcg+0x76d/0x870
[41369.847998] ? try_charge_memcg+0xcd/0x870
[41369.848000] obj_cgroup_charge+0xb8/0x1b0
[41369.848002] kmem_cache_alloc+0xaa/0x310
[41369.848005] ? alloc_buffer_head+0x1e/0x80
[41369.848007] alloc_buffer_head+0x1e/0x80
[41369.848009] folio_alloc_buffers+0xab/0x180
[41369.848012] ? __pfx_fat_get_block+0x10/0x10 [fat 0a109de409393851f8a884f020fb5682aab8dcd1]
[41369.848021] create_empty_buffers+0x1d/0xb0
[41369.848023] __block_write_begin_int+0x524/0x600
[41369.848026] ? __pfx_fat_get_block+0x10/0x10 [fat 0a109de409393851f8a884f020fb5682aab8dcd1]
[41369.848031] ? __filemap_get_folio+0x168/0x2e0
[41369.848033] ? __pfx_fat_get_block+0x10/0x10 [fat 0a109de409393851f8a884f020fb5682aab8dcd1]
[41369.848038] block_write_begin+0x52/0x120
[41369.848040] fat_write_begin+0x34/0x80 [fat 0a109de409393851f8a884f020fb5682aab8dcd1]
[41369.848046] ? __pfx_fat_get_block+0x10/0x10 [fat 0a109de409393851f8a884f020fb5682aab8dcd1]
[41369.848051] generic_perform_write+0xd6/0x240
[41369.848054] generic_file_write_iter+0x65/0xd0
[41369.848056] vfs_write+0x23a/0x400
[41369.848060] ksys_write+0x6f/0xf0
[41369.848063] do_syscall_64+0x61/0xe0
[41369.848065] ? do_user_addr_fault+0x304/0x670
[41369.848069] ? exc_page_fault+0x7f/0x180
[41369.848071] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[41369.848074] RIP: 0033:0x7965df71a184
[41369.848116] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 3e 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48
[41369.848117] RSP: 002b:00007fffee661738 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[41369.848119] RAX: ffffffffffffffda RBX: 0000570f66343bb0 RCX: 00007965df71a184
[41369.848121] RDX: 0000000000040000 RSI: 0000570f66343bb0 RDI: 0000000000000003
[41369.848122] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000570f66343b20
[41369.848122] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000649
[41369.848123] R13: 0000570f651f8b40 R14: 0000000000008000 R15: 0000570f6633bba0
[41369.848125] </TASK>
[41369.848126] memory: usage 32768kB, limit 32768kB, failcnt 21239
[41369.848126] swap: usage 2112kB, limit 9007199254740988kB, failcnt 0
[41369.848127] Memory cgroup stats for /user.slice/user-1000.slice/[email protected]/app.slice/run-u640.service:
[41369.848174] anon 0
[41369.848175] file 26927104
[41369.848176] kernel 6615040
[41369.848176] kernel_stack 32768
[41369.848177] pagetables 122880
[41369.848177] sec_pagetables 0
[41369.848177] percpu 480
[41369.848178] sock 0
[41369.848178] vmalloc 0
[41369.848178] shmem 0
[41369.848179] zswap 312451
[41369.848179] zswapped 1458176
[41369.848179] file_mapped 0
[41369.848180] file_dirty 26923008
[41369.848180] file_writeback 0
[41369.848180] swapcached 12288
[41369.848181] anon_thp 0
[41369.848181] file_thp 0
[41369.848181] shmem_thp 0
[41369.848182] inactive_anon 0
[41369.848182] active_anon 12288
[41369.848182] inactive_file 15908864
[41369.848183] active_file 11014144
[41369.848183] unevictable 0
[41369.848183] slab_reclaimable 5963640
[41369.848184] slab_unreclaimable 89048
[41369.848184] slab 6052688
[41369.848185] workingset_refault_anon 4031
[41369.848185] workingset_refault_file 9236
[41369.848185] workingset_activate_anon 691
[41369.848186] workingset_activate_file 2553
[41369.848186] workingset_restore_anon 691
[41369.848186] workingset_restore_file 0
[41369.848187] workingset_nodereclaim 0
[41369.848187] pgscan 40473
[41369.848187] pgsteal 20881
[41369.848188] pgscan_kswapd 0
[41369.848188] pgscan_direct 40473
[41369.848188] pgscan_khugepaged 0
[41369.848189] pgsteal_kswapd 0
[41369.848189] pgsteal_direct 20881
[41369.848190] pgsteal_khugepaged 0
[41369.848190] pgfault 6019
[41369.848190] pgmajfault 4033
[41369.848191] pgrefill 30578988
[41369.848191] pgactivate 2925
[41369.848191] pgdeactivate 0
[41369.848192] pglazyfree 0
[41369.848192] pglazyfreed 0
[41369.848192] zswpin 1520
[41369.848193] zswpout 1141
[41369.848193] thp_fault_alloc 0
[41369.848193] thp_collapse_alloc 0
[41369.848194] thp_swpout 0
[41369.848194] thp_swpout_fallback 0
[41369.848194] Tasks state (memory values in pages):
[41369.848195] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[41369.848195] [ 128825] 1000 128825 3449 864 65536 192 200 rsync
[41369.848198] [ 128826] 1000 128826 3523 288 57344 288 200 rsync
[41369.848199] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/user.slice/user-1000.slice/[email protected]/app.slice/run-u640.service,task_memcg=/user.slice/user-1000.slice/[email protected]/app.slice/run-u640.service,task=rsync,pid=128825,uid=1000
[41369.848207] Memory cgroup out of memory: Killed process 128825 (rsync) total-vm:13796kB, anon-rss:0kB, file-rss:3456kB, shmem-rss:0kB, UID:1000 pgtables:64kB oom_score_adj:200

---

Importantly, note that there appears to be no attempt to write back before
declaring OOM -- file_writeback is 0 when file_dirty is 26923008. The issue is
consistently reproducible (and thanks Johannes for looking at this with me).

On non-MGLRU, flushers are active and are making forward progress in preventing
OOM.

This is writing to a slow disk with about ~10MiB/s available write speed, so
the CPU and read speed is far faster than the write speed the disk
can take.

Is this a known problem in MGLRU? If not, could you point me to where MGLRU
tries to handle flusher wakeup on slow I/O? I didn't immediately find it.

Thanks,

Chris

2024-02-29 17:54:57

by Chris Down

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

Hi Yu,

Following up since it's been a few weeks since I reported this. If MGLRU does
not handle writeback pressure on slow devices without OOM, that seems like a
pretty significant problem, so I'd appreciate your opinion on the issue.

Thanks,

Chris

2024-02-29 23:51:46

by Axel Rasmussen

[permalink] [raw]

Subject: MGLRU premature memcg OOM on slow writes

Hi Chris,

A couple of dumb questions. In your test, do you have any of the following
configured / enabled?

/proc/sys/vm/laptop_mode
memory.low
memory.min

Besides that, it looks like the place non-MGLRU reclaim wakes up the
flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
looks like it simply will not do this.

Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
makes sense to me at least that doing writeback every time we age is too
aggressive, but doing it in evict_folios() makes some sense to me, basically to
copy the behavior the non-MGLRU path (shrink_inactive_list()) has.

I can send a patch which tries to implement this next week. In the meantime, Yu,
please let me know if what I've said here makes no sense for some reason. :)

[1]: https://lore.kernel.org/lkml/[email protected]/

2024-03-01 00:31:06

by Chris Down

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

Axel Rasmussen writes:
>A couple of dumb questions. In your test, do you have any of the following
>configured / enabled?
>
>/proc/sys/vm/laptop_mode
>memory.low
>memory.min

None of these are enabled. The issue is trivially reproducible by writing to
any slow device with memory.max enabled, but from the code it looks like MGLRU
is also susceptible to this on global reclaim (although it's less likely due to
page diversity).

>Besides that, it looks like the place non-MGLRU reclaim wakes up the
>flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
>Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
>looks like it simply will not do this.
>
>Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
>makes sense to me at least that doing writeback every time we age is too
>aggressive, but doing it in evict_folios() makes some sense to me, basically to
>copy the behavior the non-MGLRU path (shrink_inactive_list()) has.

Thanks! We may also need reclaim_throttle(), depending on how you implement it.
Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
(lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
thing at a time :-)

2024-03-01 11:26:57

by Hillf Danton

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Thu, 29 Feb 2024 15:51:33 -0800 Axel Rasmussen <[email protected]>
>
> Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> makes sense to me at least that doing writeback every time we age is too
> aggressive, but doing it in evict_folios() makes some sense to me, basically to
> copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
>
> I can send a patch which tries to implement this next week. In the meantime, Yu,

Better after working out why flusher failed to do the job, given background
writeback and balance_dirty_pages_ratelimited().
If pushing kswapd on the back makes any sense, what prevents you from pushing
flusher instead, given they are two different things by define?

> please let me know if what I've said here makes no sense for some reason. :)
>
> [1]: https://lore.kernel.org/lkml/[email protected]/

2024-03-08 19:19:20

by Axel Rasmussen

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Thu, Feb 29, 2024 at 4:30 PM Chris Down <[email protected]> wrote:
>
> Axel Rasmussen writes:
> >A couple of dumb questions. In your test, do you have any of the following
> >configured / enabled?
> >
> >/proc/sys/vm/laptop_mode
> >memory.low
> >memory.min
>
> None of these are enabled. The issue is trivially reproducible by writing to
> any slow device with memory.max enabled, but from the code it looks like MGLRU
> is also susceptible to this on global reclaim (although it's less likely due to
> page diversity).
>
> >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> >looks like it simply will not do this.
> >
> >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> >makes sense to me at least that doing writeback every time we age is too
> >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
>
> Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> thing at a time :-)

Hmm, so I have a patch which I think will help with this situation,
but I'm having some trouble reproducing the problem on 6.8-rc7 (so
then I can verify the patch fixes it).

If I understand the issue right, all we should need to do is get a
slow filesystem, and then generate a bunch of dirty file pages on it,
while running in a tightly constrained memcg. To that end, I tried the
following script. But, in reality I seem to get little or no
accumulation of dirty file pages.

I thought maybe fio does something different than rsync which you said
you originally tried, so I also tried rsync (copying /usr/bin into
this loop mount) and didn't run into an OOM situation either.

Maybe some dirty ratio settings need tweaking or something to get the
behavior you see? Or maybe my test has a dumb mistake in it. :)

#!/usr/bin/env bash

echo 0 > /proc/sys/vm/laptop_mode || exit 1
echo y > /sys/kernel/mm/lru_gen/enabled || exit 1

echo "Allocate disk image"
IMAGE_SIZE_MIB=1024
IMAGE_PATH=/tmp/slow.img
dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1

echo "Setup loop device"
LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1
LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1

echo "Create dm-slow"
DM_NAME=dm-slow
DM_DEV=/dev/mapper/$DM_NAME
echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1

echo "Create fs"
mkfs.ext4 "$DM_DEV" || exit 1

echo "Mount fs"
MOUNT_PATH="/tmp/$DM_NAME"
mkdir -p "$MOUNT_PATH" || exit 1
mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1

echo "Generate dirty file pages"
systemd-run --wait --pipe --collect -p MemoryMax=32M \
fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \
-numjobs=10 -nrfiles=90 -filesize=1048576 \
-fallocate=posix \
-blocksize=4k -ioengine=mmap \
-direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \
-runtime=300 -time_based

2024-03-08 21:22:31

by Johannes Weiner

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Fri, Mar 08, 2024 at 11:18:28AM -0800, Axel Rasmussen wrote:
> On Thu, Feb 29, 2024 at 4:30 PM Chris Down <[email protected]> wrote:
> >
> > Axel Rasmussen writes:
> > >A couple of dumb questions. In your test, do you have any of the following
> > >configured / enabled?
> > >
> > >/proc/sys/vm/laptop_mode
> > >memory.low
> > >memory.min
> >
> > None of these are enabled. The issue is trivially reproducible by writing to
> > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > is also susceptible to this on global reclaim (although it's less likely due to
> > page diversity).
> >
> > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > >looks like it simply will not do this.
> > >
> > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > >makes sense to me at least that doing writeback every time we age is too
> > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> >
> > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > thing at a time :-)
>
>
> Hmm, so I have a patch which I think will help with this situation,
> but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> then I can verify the patch fixes it).
>
> If I understand the issue right, all we should need to do is get a
> slow filesystem, and then generate a bunch of dirty file pages on it,
> while running in a tightly constrained memcg. To that end, I tried the
> following script. But, in reality I seem to get little or no
> accumulation of dirty file pages.
>
> I thought maybe fio does something different than rsync which you said
> you originally tried, so I also tried rsync (copying /usr/bin into
> this loop mount) and didn't run into an OOM situation either.
>
> Maybe some dirty ratio settings need tweaking or something to get the
> behavior you see? Or maybe my test has a dumb mistake in it. :)
>
>
>
> #!/usr/bin/env bash
>
> echo 0 > /proc/sys/vm/laptop_mode || exit 1
> echo y > /sys/kernel/mm/lru_gen/enabled || exit 1
>
> echo "Allocate disk image"
> IMAGE_SIZE_MIB=1024
> IMAGE_PATH=/tmp/slow.img
> dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1
>
> echo "Setup loop device"
> LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1
> LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1
>
> echo "Create dm-slow"
> DM_NAME=dm-slow
> DM_DEV=/dev/mapper/$DM_NAME
> echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1
>
> echo "Create fs"
> mkfs.ext4 "$DM_DEV" || exit 1
>
> echo "Mount fs"
> MOUNT_PATH="/tmp/$DM_NAME"
> mkdir -p "$MOUNT_PATH" || exit 1
> mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1
>
> echo "Generate dirty file pages"
> systemd-run --wait --pipe --collect -p MemoryMax=32M \
> fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \
> -numjobs=10 -nrfiles=90 -filesize=1048576 \
> -fallocate=posix \
> -blocksize=4k -ioengine=mmap \
> -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \
> -runtime=300 -time_based

By doing only the writes in the cgroup, you might just be running into
balance_dirty_pages(), which wakes the flushers and slows the
writing/allocating task before hitting the cg memory limit.

I think the key to what happens in Chris's case is:

1) The cgroup has a certain share of dirty pages, but in aggregate
they are below the cgroup dirty limit (dirty < mdtc->avail * ratio)
such that no writeback/dirty throttling is triggered from
balance_dirty_pages().

2) An unthrottled burst of (non-dirtying) allocations causes reclaim
demand that suddenly exceeds the reclaimable clean pages on the LRU.

Now you get into a situation where allocation and reclaim rate exceeds
the writeback rate and the only reclaimable pages left on the LRU are
dirty. In this case reclaim needs to wake the flushers and wait for
writeback instead of blowing through the priority cycles and OOMing.

Chris might be causing 2) from the read side of the copy also being in
the cgroup. Especially if he's copying larger files that can saturate
the readahead window and cause bigger allocation bursts. Those
readahead pages are accounted to the cgroup and on the LRU as soon as
they're allocated, but remain locked and unreclaimable until the read
IO finishes.

2024-03-11 09:11:55

by Yafang Shao

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@googlecom> wrote:
>
> On Thu, Feb 29, 2024 at 4:30 PM Chris Down <[email protected]> wrote:
> >
> > Axel Rasmussen writes:
> > >A couple of dumb questions. In your test, do you have any of the following
> > >configured / enabled?
> > >
> > >/proc/sys/vm/laptop_mode
> > >memory.low
> > >memory.min
> >
> > None of these are enabled. The issue is trivially reproducible by writing to
> > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > is also susceptible to this on global reclaim (although it's less likely due to
> > page diversity).
> >
> > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > >looks like it simply will not do this.
> > >
> > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > >makes sense to me at least that doing writeback every time we age is too
> > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> >
> > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > thing at a time :-)
>
>
> Hmm, so I have a patch which I think will help with this situation,
> but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> then I can verify the patch fixes it).

We encountered the same premature OOM issue caused by numerous dirty pages.
The issue disappears after we revert the commit 14aa8b2d5c2e
"mm/mglru: don't sync disk for each aging cycle"

To aid in replicating the issue, we've developed a straightforward
script, which consistently reproduces it, even on the latest kernel.
You can find the script provided below:

```
#!/bin/bash

MEMCG="/sys/fs/cgroup/memory/mglru"
ENABLE=$1

# Avoid waking up the flusher
sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4))
sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4))

if [ ! -d ${MEMCG} ]; then
mkdir -p ${MEMCG}
fi

echo $$ > ${MEMCG}/cgroup.procs
echo 1g > ${MEMCG}/memory.limit_in_bytes

if [ $ENABLE -eq 0 ]; then
echo 0 > /sys/kernel/mm/lru_gen/enabled
else
echo 0x7 > /sys/kernel/mm/lru_gen/enabled
fi

dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023
rm -rf /data0/mglru.test
```

This issue disappears as well after we disable the mglru.

We hope this script proves helpful in identifying and addressing the
root cause. We eagerly await your insights and proposed fixes.

>
> If I understand the issue right, all we should need to do is get a
> slow filesystem, and then generate a bunch of dirty file pages on it,
> while running in a tightly constrained memcg. To that end, I tried the
> following script. But, in reality I seem to get little or no
> accumulation of dirty file pages.
>
> I thought maybe fio does something different than rsync which you said
> you originally tried, so I also tried rsync (copying /usr/bin into
> this loop mount) and didn't run into an OOM situation either.
>
> Maybe some dirty ratio settings need tweaking or something to get the
> behavior you see? Or maybe my test has a dumb mistake in it. :)
>
>
>
> #!/usr/bin/env bash
>
> echo 0 > /proc/sys/vm/laptop_mode || exit 1
> echo y > /sys/kernel/mm/lru_gen/enabled || exit 1
>
> echo "Allocate disk image"
> IMAGE_SIZE_MIB=1024
> IMAGE_PATH=/tmp/slow.img
> dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1
>
> echo "Setup loop device"
> LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1
> LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1
>
> echo "Create dm-slow"
> DM_NAME=dm-slow
> DM_DEV=/dev/mapper/$DM_NAME
> echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1
>
> echo "Create fs"
> mkfs.ext4 "$DM_DEV" || exit 1
>
> echo "Mount fs"
> MOUNT_PATH="/tmp/$DM_NAME"
> mkdir -p "$MOUNT_PATH" || exit 1
> mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1
>
> echo "Generate dirty file pages"
> systemd-run --wait --pipe --collect -p MemoryMax=32M \
> fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \
> -numjobs=10 -nrfiles=90 -filesize=1048576 \
> -fallocate=posix \
> -blocksize=4k -ioengine=mmap \
> -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \
> -runtime=300 -time_based
>

--
Regards
Yafang

2024-03-12 16:45:09

by Axel Rasmussen

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <[email protected]> wrote:
>
> On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <[email protected]> wrote:
> >
> > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <[email protected]> wrote:
> > >
> > > Axel Rasmussen writes:
> > > >A couple of dumb questions. In your test, do you have any of the following
> > > >configured / enabled?
> > > >
> > > >/proc/sys/vm/laptop_mode
> > > >memory.low
> > > >memory.min
> > >
> > > None of these are enabled. The issue is trivially reproducible by writing to
> > > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > > is also susceptible to this on global reclaim (although it's less likely due to
> > > page diversity).
> > >
> > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > > >looks like it simply will not do this.
> > > >
> > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > > >makes sense to me at least that doing writeback every time we age is too
> > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> > >
> > > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > > thing at a time :-)
> >
> >
> > Hmm, so I have a patch which I think will help with this situation,
> > but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> > then I can verify the patch fixes it).
>
> We encountered the same premature OOM issue caused by numerous dirty pages.
> The issue disappears after we revert the commit 14aa8b2d5c2e
> "mm/mglru: don't sync disk for each aging cycle"
>
> To aid in replicating the issue, we've developed a straightforward
> script, which consistently reproduces it, even on the latest kernel.
> You can find the script provided below:
>
> ```
> #!/bin/bash
>
> MEMCG="/sys/fs/cgroup/memory/mglru"
> ENABLE=$1
>
> # Avoid waking up the flusher
> sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4))
> sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4))
>
> if [ ! -d ${MEMCG} ]; then
> mkdir -p ${MEMCG}
> fi
>
> echo $$ > ${MEMCG}/cgroup.procs
> echo 1g > ${MEMCG}/memory.limit_in_bytes
>
> if [ $ENABLE -eq 0 ]; then
> echo 0 > /sys/kernel/mm/lru_gen/enabled
> else
> echo 0x7 > /sys/kernel/mm/lru_gen/enabled
> fi
>
> dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023
> rm -rf /data0/mglru.test
> ```
>
> This issue disappears as well after we disable the mglru.
>
> We hope this script proves helpful in identifying and addressing the
> root cause. We eagerly await your insights and proposed fixes.

Thanks Yafang, I was able to reproduce the issue using this script.

Perhaps interestingly, I was not able to reproduce it with cgroupv2
memcgs. I know writeback semantics are quite a bit different there, so
perhaps that explains why.

Unfortunately, it also reproduces even with the commit I had in mind
(basically stealing the "if (all isolated pages are unqueued dirty) {
wakeup_flusher_threads(); reclaim_throttle(); }" from
shrink_inactive_list, and adding it to MGLRU's evict_folios()). So
I'll need to spend some more time on this; I'm planning to send
something out for testing next week.

>
> >
> > If I understand the issue right, all we should need to do is get a
> > slow filesystem, and then generate a bunch of dirty file pages on it,
> > while running in a tightly constrained memcg. To that end, I tried the
> > following script. But, in reality I seem to get little or no
> > accumulation of dirty file pages.
> >
> > I thought maybe fio does something different than rsync which you said
> > you originally tried, so I also tried rsync (copying /usr/bin into
> > this loop mount) and didn't run into an OOM situation either.
> >
> > Maybe some dirty ratio settings need tweaking or something to get the
> > behavior you see? Or maybe my test has a dumb mistake in it. :)
> >
> >
> >
> > #!/usr/bin/env bash
> >
> > echo 0 > /proc/sys/vm/laptop_mode || exit 1
> > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1
> >
> > echo "Allocate disk image"
> > IMAGE_SIZE_MIB=1024
> > IMAGE_PATH=/tmp/slow.img
> > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1
> >
> > echo "Setup loop device"
> > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1
> > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1
> >
> > echo "Create dm-slow"
> > DM_NAME=dm-slow
> > DM_DEV=/dev/mapper/$DM_NAME
> > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1
> >
> > echo "Create fs"
> > mkfs.ext4 "$DM_DEV" || exit 1
> >
> > echo "Mount fs"
> > MOUNT_PATH="/tmp/$DM_NAME"
> > mkdir -p "$MOUNT_PATH" || exit 1
> > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1
> >
> > echo "Generate dirty file pages"
> > systemd-run --wait --pipe --collect -p MemoryMax=32M \
> > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \
> > -numjobs=10 -nrfiles=90 -filesize=1048576 \
> > -fallocate=posix \
> > -blocksize=4k -ioengine=mmap \
> > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \
> > -runtime=300 -time_based
> >
>
>
> --
> Regards
> Yafang

2024-03-12 20:07:20

by Yu Zhao

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote:
> On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <[email protected]> wrote:
> >
> > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <[email protected]> wrote:
> > >
> > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <[email protected]> wrote:
> > > >
> > > > Axel Rasmussen writes:
> > > > >A couple of dumb questions. In your test, do you have any of the following
> > > > >configured / enabled?
> > > > >
> > > > >/proc/sys/vm/laptop_mode
> > > > >memory.low
> > > > >memory.min
> > > >
> > > > None of these are enabled. The issue is trivially reproducible by writing to
> > > > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > > > is also susceptible to this on global reclaim (although it's less likely due to
> > > > page diversity).
> > > >
> > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > > > >looks like it simply will not do this.
> > > > >
> > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > > > >makes sense to me at least that doing writeback every time we age is too
> > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> > > >
> > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > > > thing at a time :-)
> > >
> > >
> > > Hmm, so I have a patch which I think will help with this situation,
> > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> > > then I can verify the patch fixes it).
> >
> > We encountered the same premature OOM issue caused by numerous dirty pages.
> > The issue disappears after we revert the commit 14aa8b2d5c2e
> > "mm/mglru: don't sync disk for each aging cycle"
> >
> > To aid in replicating the issue, we've developed a straightforward
> > script, which consistently reproduces it, even on the latest kernel.
> > You can find the script provided below:
> >
> > ```
> > #!/bin/bash
> >
> > MEMCG="/sys/fs/cgroup/memory/mglru"
> > ENABLE=$1
> >
> > # Avoid waking up the flusher
> > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4))
> > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4))
> >
> > if [ ! -d ${MEMCG} ]; then
> > mkdir -p ${MEMCG}
> > fi
> >
> > echo $$ > ${MEMCG}/cgroup.procs
> > echo 1g > ${MEMCG}/memory.limit_in_bytes
> >
> > if [ $ENABLE -eq 0 ]; then
> > echo 0 > /sys/kernel/mm/lru_gen/enabled
> > else
> > echo 0x7 > /sys/kernel/mm/lru_gen/enabled
> > fi
> >
> > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023
> > rm -rf /data0/mglru.test
> > ```
> >
> > This issue disappears as well after we disable the mglru.
> >
> > We hope this script proves helpful in identifying and addressing the
> > root cause. We eagerly await your insights and proposed fixes.
>
> Thanks Yafang, I was able to reproduce the issue using this script.
>
> Perhaps interestingly, I was not able to reproduce it with cgroupv2
> memcgs. I know writeback semantics are quite a bit different there, so
> perhaps that explains why.
>
> Unfortunately, it also reproduces even with the commit I had in mind
> (basically stealing the "if (all isolated pages are unqueued dirty) {
> wakeup_flusher_threads(); reclaim_throttle(); }" from
> shrink_inactive_list, and adding it to MGLRU's evict_folios()). So
> I'll need to spend some more time on this; I'm planning to send
> something out for testing next week.

Hi Chris,

My apologies for not getting back to you sooner.

And thanks everyone for all the input!

My take is that Chris' premature OOM kills were NOT really due to
the flusher not waking up or missing throttling.

Yes, these two are among the differences between the active/inactive
LRU and MGLRU, but their roles, IMO, are not as important as the LRU
positions of dirty pages. The active/inactive LRU moves dirty pages
all the way to the end of the line (reclaim happens at the front)
whereas MGLRU moves them into the middle, during direct reclaim. The
rationale for MGLRU was that this way those dirty pages would still
be counted as "inactive" (or cold).

This theory can be quickly verified by comparing how much
nr_vmscan_immediate_reclaim grows, i.e.,

Before the copy
grep nr_vmscan_immediate_reclaim /proc/vmstat
And then after the copy
grep nr_vmscan_immediate_reclaim /proc/vmstat

The growth should be trivial for MGLRU and nontrivial for the
active/inactive LRU.

If this is indeed the case, I'd appreciate very much if anyone could
try the following (I'll try it myself too later next week).

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4255619a1a31..020f5d98b9a1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
}

/* waiting for writeback */
- if (folio_test_locked(folio) || folio_test_writeback(folio) ||
- (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
- gen = folio_inc_gen(lruvec, folio, true);
- list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
+ if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+ DEFINE_MAX_SEQ(lruvec);
+ int old_gen, new_gen = lru_gen_from_seq(max_seq);
+
+ old_gen = folio_update_gen(folio, new_gen);
+ lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+ list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
return true;
}

> > > If I understand the issue right, all we should need to do is get a
> > > slow filesystem, and then generate a bunch of dirty file pages on it,
> > > while running in a tightly constrained memcg. To that end, I tried the
> > > following script. But, in reality I seem to get little or no
> > > accumulation of dirty file pages.
> > >
> > > I thought maybe fio does something different than rsync which you said
> > > you originally tried, so I also tried rsync (copying /usr/bin into
> > > this loop mount) and didn't run into an OOM situation either.
> > >
> > > Maybe some dirty ratio settings need tweaking or something to get the
> > > behavior you see? Or maybe my test has a dumb mistake in it. :)
> > >
> > >
> > >
> > > #!/usr/bin/env bash
> > >
> > > echo 0 > /proc/sys/vm/laptop_mode || exit 1
> > > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1
> > >
> > > echo "Allocate disk image"
> > > IMAGE_SIZE_MIB=1024
> > > IMAGE_PATH=/tmp/slow.img
> > > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1
> > >
> > > echo "Setup loop device"
> > > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1
> > > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1
> > >
> > > echo "Create dm-slow"
> > > DM_NAME=dm-slow
> > > DM_DEV=/dev/mapper/$DM_NAME
> > > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1
> > >
> > > echo "Create fs"
> > > mkfs.ext4 "$DM_DEV" || exit 1
> > >
> > > echo "Mount fs"
> > > MOUNT_PATH="/tmp/$DM_NAME"
> > > mkdir -p "$MOUNT_PATH" || exit 1
> > > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1
> > >
> > > echo "Generate dirty file pages"
> > > systemd-run --wait --pipe --collect -p MemoryMax=32M \
> > > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \
> > > -numjobs=10 -nrfiles=90 -filesize=1048576 \
> > > -fallocate=posix \
> > > -blocksize=4k -ioengine=mmap \
> > > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \
> > > -runtime=300 -time_based

2024-03-12 20:11:41

by Yu Zhao

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote:
> > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <[email protected]> wrote:
> > >
> > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <[email protected]> wrote:
> > > >
> > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <[email protected]> wrote:
> > > > >
> > > > > Axel Rasmussen writes:
> > > > > >A couple of dumb questions. In your test, do you have any of the following
> > > > > >configured / enabled?
> > > > > >
> > > > > >/proc/sys/vm/laptop_mode
> > > > > >memory.low
> > > > > >memory.min
> > > > >
> > > > > None of these are enabled. The issue is trivially reproducible by writing to
> > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > > > > is also susceptible to this on global reclaim (although it's less likely due to
> > > > > page diversity).
> > > > >
> > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > > > > >looks like it simply will not do this.
> > > > > >
> > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > > > > >makes sense to me at least that doing writeback every time we age is too
> > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> > > > >
> > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > > > > thing at a time :-)
> > > >
> > > >
> > > > Hmm, so I have a patch which I think will help with this situation,
> > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> > > > then I can verify the patch fixes it).
> > >
> > > We encountered the same premature OOM issue caused by numerous dirty pages.
> > > The issue disappears after we revert the commit 14aa8b2d5c2e
> > > "mm/mglru: don't sync disk for each aging cycle"
> > >
> > > To aid in replicating the issue, we've developed a straightforward
> > > script, which consistently reproduces it, even on the latest kernel.
> > > You can find the script provided below:
> > >
> > > ```
> > > #!/bin/bash
> > >
> > > MEMCG="/sys/fs/cgroup/memory/mglru"
> > > ENABLE=$1
> > >
> > > # Avoid waking up the flusher
> > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4))
> > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4))
> > >
> > > if [ ! -d ${MEMCG} ]; then
> > > mkdir -p ${MEMCG}
> > > fi
> > >
> > > echo $$ > ${MEMCG}/cgroup.procs
> > > echo 1g > ${MEMCG}/memory.limit_in_bytes
> > >
> > > if [ $ENABLE -eq 0 ]; then
> > > echo 0 > /sys/kernel/mm/lru_gen/enabled
> > > else
> > > echo 0x7 > /sys/kernel/mm/lru_gen/enabled
> > > fi
> > >
> > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023
> > > rm -rf /data0/mglru.test
> > > ```
> > >
> > > This issue disappears as well after we disable the mglru.
> > >
> > > We hope this script proves helpful in identifying and addressing the
> > > root cause. We eagerly await your insights and proposed fixes.
> >
> > Thanks Yafang, I was able to reproduce the issue using this script.
> >
> > Perhaps interestingly, I was not able to reproduce it with cgroupv2
> > memcgs. I know writeback semantics are quite a bit different there, so
> > perhaps that explains why.
> >
> > Unfortunately, it also reproduces even with the commit I had in mind
> > (basically stealing the "if (all isolated pages are unqueued dirty) {
> > wakeup_flusher_threads(); reclaim_throttle(); }" from
> > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So
> > I'll need to spend some more time on this; I'm planning to send
> > something out for testing next week.
>
> Hi Chris,
>
> My apologies for not getting back to you sooner.
>
> And thanks everyone for all the input!
>
> My take is that Chris' premature OOM kills were NOT really due to
> the flusher not waking up or missing throttling.
>
> Yes, these two are among the differences between the active/inactive
> LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> positions of dirty pages. The active/inactive LRU moves dirty pages
> all the way to the end of the line (reclaim happens at the front)
> whereas MGLRU moves them into the middle, during direct reclaim. The
> rationale for MGLRU was that this way those dirty pages would still
> be counted as "inactive" (or cold).
>
> This theory can be quickly verified by comparing how much
> nr_vmscan_immediate_reclaim grows, i.e.,
>
> Before the copy
> grep nr_vmscan_immediate_reclaim /proc/vmstat
> And then after the copy
> grep nr_vmscan_immediate_reclaim /proc/vmstat
>
> The growth should be trivial for MGLRU and nontrivial for the
> active/inactive LRU.
>
> If this is indeed the case, I'd appreciate very much if anyone could
> try the following (I'll try it myself too later next week).
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4255619a1a31..020f5d98b9a1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> }
>
> /* waiting for writeback */
> - if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> - (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> - gen = folio_inc_gen(lruvec, folio, true);
> - list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> + DEFINE_MAX_SEQ(lruvec);
> + int old_gen, new_gen = lru_gen_from_seq(max_seq);
> +
> + old_gen = folio_update_gen(folio, new_gen);
> + lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);

Sorry missing one line here:

+ folio_set_reclaim(folio);

> return true;
> }
>
> > > > If I understand the issue right, all we should need to do is get a
> > > > slow filesystem, and then generate a bunch of dirty file pages on it,
> > > > while running in a tightly constrained memcg. To that end, I tried the
> > > > following script. But, in reality I seem to get little or no
> > > > accumulation of dirty file pages.
> > > >
> > > > I thought maybe fio does something different than rsync which you said
> > > > you originally tried, so I also tried rsync (copying /usr/bin into
> > > > this loop mount) and didn't run into an OOM situation either.
> > > >
> > > > Maybe some dirty ratio settings need tweaking or something to get the
> > > > behavior you see? Or maybe my test has a dumb mistake in it. :)
> > > >
> > > >
> > > >
> > > > #!/usr/bin/env bash
> > > >
> > > > echo 0 > /proc/sys/vm/laptop_mode || exit 1
> > > > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1
> > > >
> > > > echo "Allocate disk image"
> > > > IMAGE_SIZE_MIB=1024
> > > > IMAGE_PATH=/tmp/slow.img
> > > > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1
> > > >
> > > > echo "Setup loop device"
> > > > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1
> > > > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1
> > > >
> > > > echo "Create dm-slow"
> > > > DM_NAME=dm-slow
> > > > DM_DEV=/dev/mapper/$DM_NAME
> > > > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1
> > > >
> > > > echo "Create fs"
> > > > mkfs.ext4 "$DM_DEV" || exit 1
> > > >
> > > > echo "Mount fs"
> > > > MOUNT_PATH="/tmp/$DM_NAME"
> > > > mkdir -p "$MOUNT_PATH" || exit 1
> > > > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1
> > > >
> > > > echo "Generate dirty file pages"
> > > > systemd-run --wait --pipe --collect -p MemoryMax=32M \
> > > > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \
> > > > -numjobs=10 -nrfiles=90 -filesize=1048576 \
> > > > -fallocate=posix \
> > > > -blocksize=4k -ioengine=mmap \
> > > > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \
> > > > -runtime=300 -time_based

2024-03-12 21:08:44

by Johannes Weiner

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> Yes, these two are among the differences between the active/inactive
> LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> positions of dirty pages. The active/inactive LRU moves dirty pages
> all the way to the end of the line (reclaim happens at the front)
> whereas MGLRU moves them into the middle, during direct reclaim. The
> rationale for MGLRU was that this way those dirty pages would still
> be counted as "inactive" (or cold).

Note that activating the page is not a statement on the page's
hotness. It's simply to park it away from the scanner. We could as
well have moved it to the unevictable list - this is just easier.

folio_end_writeback() will call folio_rotate_reclaimable() and move it
back to the inactive tail, to make it the very next reclaim target as
soon as it's clean.

> This theory can be quickly verified by comparing how much
> nr_vmscan_immediate_reclaim grows, i.e.,
>
> Before the copy
> grep nr_vmscan_immediate_reclaim /proc/vmstat
> And then after the copy
> grep nr_vmscan_immediate_reclaim /proc/vmstat
>
> The growth should be trivial for MGLRU and nontrivial for the
> active/inactive LRU.
>
> If this is indeed the case, I'd appreciate very much if anyone could
> try the following (I'll try it myself too later next week).
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4255619a1a31..020f5d98b9a1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> }
>
> /* waiting for writeback */
> - if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> - (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> - gen = folio_inc_gen(lruvec, folio, true);
> - list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> + DEFINE_MAX_SEQ(lruvec);
> + int old_gen, new_gen = lru_gen_from_seq(max_seq);
> +
> + old_gen = folio_update_gen(folio, new_gen);
> + lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> return true;

Right, because MGLRU sorts these pages out before calling the scanner,
so they never get marked for immediate reclaim.

But that also implies they won't get rotated back to the tail when
writeback finishes. Doesn't that mean that you now have pages that

a) came from the oldest generation and were only deferred due to their
writeback state, and

b) are now clean and should be reclaimed. But since they're
permanently advanced to the next gen, you'll instead reclaim pages
that were originally ahead of them, and likely hotter.

Isn't that an age inversion?

Back to the broader question though: if reclaim demand outstrips clean
pages and the only viable candidates are dirty ones (e.g. an
allocation spike in the presence of dirty/writeback pages), there only
seem to be 3 options:

1) sleep-wait for writeback
2) continue scanning, aka busy-wait for writeback + age inversions
3) find nothing and declare OOM

Since you're not doing 1), it must be one of the other two, no? One
way or another it has to either pace-match to IO completions, or OOM.

2024-03-13 02:09:03

by Yu Zhao

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Tue, Mar 12, 2024 at 5:08 PM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> > Yes, these two are among the differences between the active/inactive
> > LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> > positions of dirty pages. The active/inactive LRU moves dirty pages
> > all the way to the end of the line (reclaim happens at the front)
> > whereas MGLRU moves them into the middle, during direct reclaim. The
> > rationale for MGLRU was that this way those dirty pages would still
> > be counted as "inactive" (or cold).
>
> Note that activating the page is not a statement on the page's
> hotness. It's simply to park it away from the scanner. We could as
> well have moved it to the unevictable list - this is just easier.
>
> folio_end_writeback() will call folio_rotate_reclaimable() and move it
> back to the inactive tail, to make it the very next reclaim target as
> soon as it's clean.
>
> > This theory can be quickly verified by comparing how much
> > nr_vmscan_immediate_reclaim grows, i.e.,
> >
> > Before the copy
> > grep nr_vmscan_immediate_reclaim /proc/vmstat
> > And then after the copy
> > grep nr_vmscan_immediate_reclaim /proc/vmstat
> >
> > The growth should be trivial for MGLRU and nontrivial for the
> > active/inactive LRU.
> >
> > If this is indeed the case, I'd appreciate very much if anyone could
> > try the following (I'll try it myself too later next week).
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4255619a1a31..020f5d98b9a1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > }
> >
> > /* waiting for writeback */
> > - if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > - gen = folio_inc_gen(lruvec, folio, true);
> > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > + DEFINE_MAX_SEQ(lruvec);
> > + int old_gen, new_gen = lru_gen_from_seq(max_seq);
> > +
> > + old_gen = folio_update_gen(folio, new_gen);
> > + lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> > return true;
>
> Right, because MGLRU sorts these pages out before calling the scanner,
> so they never get marked for immediate reclaim.
>
> But that also implies they won't get rotated back to the tail when
> writeback finishes.

Those dirty pages are marked by PG_reclaim either by

folio_inc_gen()
{
...
if (reclaiming)
new_flags |= BIT(PG_reclaim);
...
}

or [1], which I missed initially. So they should be rotated on writeback
finishing up.

[1] https://lore.kernel.org/linux-mm/[email protected]/

> Doesn't that mean that you now have pages that
>
> a) came from the oldest generation and were only deferred due to their
> writeback state, and
>
> b) are now clean and should be reclaimed. But since they're
> permanently advanced to the next gen, you'll instead reclaim pages
> that were originally ahead of them, and likely hotter.
>
> Isn't that an age inversion?
>
> Back to the broader question though: if reclaim demand outstrips clean
> pages and the only viable candidates are dirty ones (e.g. an
> allocation spike in the presence of dirty/writeback pages), there only
> seem to be 3 options:
>
> 1) sleep-wait for writeback
> 2) continue scanning, aka busy-wait for writeback + age inversions
> 3) find nothing and declare OOM
>
> Since you're not doing 1), it must be one of the other two, no? One
> way or another it has to either pace-match to IO completions, or OOM.

Yes, and in this case, 2) is possible but 3) is very likely.

MGLRU doesn't do 1) for sure (in the reclaim path of course). I didn't
find any throttling on dirty pages for cgroup v2 either in the
active/inactive LRU -- I assume Chris was on v2, and hence my take on
throttling on dirty pages in the reclaim path not being the key for
his case.

With the above change, I'm hoping balance_dirty_pages() will wake up
the flusher, again for Chris' case, so that MGLRU won't have to call
wakeup_flusher_threads(), since it can wake up the flusher too often
and in turn cause excessive IOs when considering SSD wearout.

2024-03-13 03:23:12

by Johannes Weiner

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Tue, Mar 12, 2024 at 10:08:13PM -0400, Yu Zhao wrote:
> On Tue, Mar 12, 2024 at 5:08 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> > > Yes, these two are among the differences between the active/inactive
> > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> > > positions of dirty pages. The active/inactive LRU moves dirty pages
> > > all the way to the end of the line (reclaim happens at the front)
> > > whereas MGLRU moves them into the middle, during direct reclaim. The
> > > rationale for MGLRU was that this way those dirty pages would still
> > > be counted as "inactive" (or cold).
> >
> > Note that activating the page is not a statement on the page's
> > hotness. It's simply to park it away from the scanner. We could as
> > well have moved it to the unevictable list - this is just easier.
> >
> > folio_end_writeback() will call folio_rotate_reclaimable() and move it
> > back to the inactive tail, to make it the very next reclaim target as
> > soon as it's clean.
> >
> > > This theory can be quickly verified by comparing how much
> > > nr_vmscan_immediate_reclaim grows, i.e.,
> > >
> > > Before the copy
> > > grep nr_vmscan_immediate_reclaim /proc/vmstat
> > > And then after the copy
> > > grep nr_vmscan_immediate_reclaim /proc/vmstat
> > >
> > > The growth should be trivial for MGLRU and nontrivial for the
> > > active/inactive LRU.
> > >
> > > If this is indeed the case, I'd appreciate very much if anyone could
> > > try the following (I'll try it myself too later next week).
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 4255619a1a31..020f5d98b9a1 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > > }
> > >
> > > /* waiting for writeback */
> > > - if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > - gen = folio_inc_gen(lruvec, folio, true);
> > > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > + DEFINE_MAX_SEQ(lruvec);
> > > + int old_gen, new_gen = lru_gen_from_seq(max_seq);
> > > +
> > > + old_gen = folio_update_gen(folio, new_gen);
> > > + lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> > > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> > > return true;
> >
> > Right, because MGLRU sorts these pages out before calling the scanner,
> > so they never get marked for immediate reclaim.
> >
> > But that also implies they won't get rotated back to the tail when
> > writeback finishes.
>
> Those dirty pages are marked by PG_reclaim either by
>
> folio_inc_gen()
> {
> ...
> if (reclaiming)
> new_flags |= BIT(PG_reclaim);
> ...
> }
>
> or [1], which I missed initially. So they should be rotated on writeback
> finishing up.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/

Ah, I missed that! Thanks.

> > Doesn't that mean that you now have pages that
> >
> > a) came from the oldest generation and were only deferred due to their
> > writeback state, and
> >
> > b) are now clean and should be reclaimed. But since they're
> > permanently advanced to the next gen, you'll instead reclaim pages
> > that were originally ahead of them, and likely hotter.
> >
> > Isn't that an age inversion?
> >
> > Back to the broader question though: if reclaim demand outstrips clean
> > pages and the only viable candidates are dirty ones (e.g. an
> > allocation spike in the presence of dirty/writeback pages), there only
> > seem to be 3 options:
> >
> > 1) sleep-wait for writeback
> > 2) continue scanning, aka busy-wait for writeback + age inversions
> > 3) find nothing and declare OOM
> >
> > Since you're not doing 1), it must be one of the other two, no? One
> > way or another it has to either pace-match to IO completions, or OOM.
>
> Yes, and in this case, 2) is possible but 3) is very likely.
>
> MGLRU doesn't do 1) for sure (in the reclaim path of course). I didn't
> find any throttling on dirty pages for cgroup v2 either in the
> active/inactive LRU -- I assume Chris was on v2, and hence my take on
> throttling on dirty pages in the reclaim path not being the key for
> his case.

It's kind of spread out, but it's there:

shrink_folio_list() will bump nr_dirty on dirty pages, and
nr_congested if immediate reclaim folios cycle back around.

shrink_inactive_list() will wake the flushers if all the dirty pages
it encountered are still unqueued.

shrink_node() will set LRUVEC_CGROUP_CONGESTED, and then call
reclaim_throttle() on it. (As Chris points out, though, the throttle
call was not long ago changed from VMSCAN_THROTTLE_WRITEBACK to
VMSCAN_THROTTLE_CONGESTED, and appears a bit more fragile now than it
used to be. Probably worth following up on this.)

2024-03-13 03:34:11

by Yafang Shao

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Wed, Mar 13, 2024 at 4:11 AM Yu Zhao <[email protected]> wrote:
>
> On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> > On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote:
> > > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <laoar.shao@gmailcom> wrote:
> > > >
> > > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <[email protected]> wrote:
> > > > >
> > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <[email protected]> wrote:
> > > > > >
> > > > > > Axel Rasmussen writes:
> > > > > > >A couple of dumb questions. In your test, do you have any of the following
> > > > > > >configured / enabled?
> > > > > > >
> > > > > > >/proc/sys/vm/laptop_mode
> > > > > > >memory.low
> > > > > > >memory.min
> > > > > >
> > > > > > None of these are enabled. The issue is trivially reproducible by writing to
> > > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > > > > > is also susceptible to this on global reclaim (although it's less likely due to
> > > > > > page diversity).
> > > > > >
> > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > > > > > >looks like it simply will not do this.
> > > > > > >
> > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > > > > > >makes sense to me at least that doing writeback every time we age is too
> > > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> > > > > >
> > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > > > > > thing at a time :-)
> > > > >
> > > > >
> > > > > Hmm, so I have a patch which I think will help with this situation,
> > > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> > > > > then I can verify the patch fixes it).
> > > >
> > > > We encountered the same premature OOM issue caused by numerous dirty pages.
> > > > The issue disappears after we revert the commit 14aa8b2d5c2e
> > > > "mm/mglru: don't sync disk for each aging cycle"
> > > >
> > > > To aid in replicating the issue, we've developed a straightforward
> > > > script, which consistently reproduces it, even on the latest kernel.
> > > > You can find the script provided below:
> > > >
> > > > ```
> > > > #!/bin/bash
> > > >
> > > > MEMCG="/sys/fs/cgroup/memory/mglru"
> > > > ENABLE=$1
> > > >
> > > > # Avoid waking up the flusher
> > > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4))
> > > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4))
> > > >
> > > > if [ ! -d ${MEMCG} ]; then
> > > > mkdir -p ${MEMCG}
> > > > fi
> > > >
> > > > echo $$ > ${MEMCG}/cgroup.procs
> > > > echo 1g > ${MEMCG}/memory.limit_in_bytes
> > > >
> > > > if [ $ENABLE -eq 0 ]; then
> > > > echo 0 > /sys/kernel/mm/lru_gen/enabled
> > > > else
> > > > echo 0x7 > /sys/kernel/mm/lru_gen/enabled
> > > > fi
> > > >
> > > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023
> > > > rm -rf /data0/mglru.test
> > > > ```
> > > >
> > > > This issue disappears as well after we disable the mglru.
> > > >
> > > > We hope this script proves helpful in identifying and addressing the
> > > > root cause. We eagerly await your insights and proposed fixes.
> > >
> > > Thanks Yafang, I was able to reproduce the issue using this script.
> > >
> > > Perhaps interestingly, I was not able to reproduce it with cgroupv2
> > > memcgs. I know writeback semantics are quite a bit different there, so
> > > perhaps that explains why.
> > >
> > > Unfortunately, it also reproduces even with the commit I had in mind
> > > (basically stealing the "if (all isolated pages are unqueued dirty) {
> > > wakeup_flusher_threads(); reclaim_throttle(); }" from
> > > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So
> > > I'll need to spend some more time on this; I'm planning to send
> > > something out for testing next week.
> >
> > Hi Chris,
> >
> > My apologies for not getting back to you sooner.
> >
> > And thanks everyone for all the input!
> >
> > My take is that Chris' premature OOM kills were NOT really due to
> > the flusher not waking up or missing throttling.
> >
> > Yes, these two are among the differences between the active/inactive
> > LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> > positions of dirty pages. The active/inactive LRU moves dirty pages
> > all the way to the end of the line (reclaim happens at the front)
> > whereas MGLRU moves them into the middle, during direct reclaim. The
> > rationale for MGLRU was that this way those dirty pages would still
> > be counted as "inactive" (or cold).
> >
> > This theory can be quickly verified by comparing how much
> > nr_vmscan_immediate_reclaim grows, i.e.,
> >
> > Before the copy
> > grep nr_vmscan_immediate_reclaim /proc/vmstat
> > And then after the copy
> > grep nr_vmscan_immediate_reclaim /proc/vmstat
> >
> > The growth should be trivial for MGLRU and nontrivial for the
> > active/inactive LRU.
> >
> > If this is indeed the case, I'd appreciate very much if anyone could
> > try the following (I'll try it myself too later next week).
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4255619a1a31..020f5d98b9a1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > }
> >
> > /* waiting for writeback */
> > - if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > - gen = folio_inc_gen(lruvec, folio, true);
> > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > + DEFINE_MAX_SEQ(lruvec);
> > + int old_gen, new_gen = lru_gen_from_seq(max_seq);
> > +
> > + old_gen = folio_update_gen(folio, new_gen);
> > + lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
>
> Sorry missing one line here:
>
> + folio_set_reclaim(folio);
>
> > return true;
> > }

Hi Yu,

I have validated it using the script provided for Axel, but
unfortunately, it still triggers an OOM error with your patch applied.
Here are the results with nr_vmscan_immediate_reclaim:

- non-MGLRU
$ grep nr_vmscan_immediate_reclaim /proc/vmstat
nr_vmscan_immediate_reclaim 47411776

$ ./test.sh 0
1023+0 records in
1023+0 records out
1072693248 bytes (1.1 GB, 1023 MiB) copied, 0.538058 s, 2.0 GB/s

$ grep nr_vmscan_immediate_reclaim /proc/vmstat
nr_vmscan_immediate_reclaim 47412544

- MGLRU
$ grep nr_vmscan_immediate_reclaim /proc/vmstat
nr_vmscan_immediate_reclaim 47412544

$ ./test.sh 1
Killed

$ grep nr_vmscan_immediate_reclaim /proc/vmstat
nr_vmscan_immediate_reclaim 115455600

The detailed OOM info as follows,

[Wed Mar 13 11:16:48 2024] dd invoked oom-killer:
gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),
order=3, oom_score_adj=0
[Wed Mar 13 11:16:48 2024] CPU: 12 PID: 6911 Comm: dd Not tainted 6.8.0-rc6+ #24
[Wed Mar 13 11:16:48 2024] Hardware name: Tencent Cloud CVM, BIOS
seabios-1.9.1-qemu-project.org 04/01/2014
[Wed Mar 13 11:16:48 2024] Call Trace:
[Wed Mar 13 11:16:48 2024] <TASK>
[Wed Mar 13 11:16:48 2024] dump_stack_lvl+0x6e/0x90
[Wed Mar 13 11:16:48 2024] dump_stack+0x10/0x20
[Wed Mar 13 11:16:48 2024] dump_header+0x47/0x2d0
[Wed Mar 13 11:16:48 2024] oom_kill_process+0x101/0x2e0
[Wed Mar 13 11:16:48 2024] out_of_memory+0xfc/0x430
[Wed Mar 13 11:16:48 2024] mem_cgroup_out_of_memory+0x13d/0x160
[Wed Mar 13 11:16:48 2024] try_charge_memcg+0x7be/0x850
[Wed Mar 13 11:16:48 2024] ? get_mem_cgroup_from_mm+0x5e/0x420
[Wed Mar 13 11:16:48 2024] ? rcu_read_unlock+0x25/0x70
[Wed Mar 13 11:16:48 2024] __mem_cgroup_charge+0x49/0x90
[Wed Mar 13 11:16:48 2024] __filemap_add_folio+0x277/0x450
[Wed Mar 13 11:16:48 2024] ? __pfx_workingset_update_node+0x10/0x10
[Wed Mar 13 11:16:48 2024] filemap_add_folio+0x3c/0xa0
[Wed Mar 13 11:16:48 2024] __filemap_get_folio+0x13d/0x2f0
[Wed Mar 13 11:16:48 2024] iomap_get_folio+0x4c/0x60
[Wed Mar 13 11:16:48 2024] iomap_write_begin+0x1bb/0x2e0
[Wed Mar 13 11:16:48 2024] iomap_write_iter+0xff/0x290
[Wed Mar 13 11:16:48 2024] iomap_file_buffered_write+0x91/0xf0
[Wed Mar 13 11:16:48 2024] xfs_file_buffered_write+0x9f/0x2d0 [xfs]
[Wed Mar 13 11:16:48 2024] ? vfs_write+0x261/0x530
[Wed Mar 13 11:16:48 2024] ? debug_smp_processor_id+0x17/0x20
[Wed Mar 13 11:16:48 2024] xfs_file_write_iter+0xe9/0x120 [xfs]
[Wed Mar 13 11:16:48 2024] vfs_write+0x37d/0x530
[Wed Mar 13 11:16:48 2024] ksys_write+0x6d/0xf0
[Wed Mar 13 11:16:48 2024] __x64_sys_write+0x19/0x20
[Wed Mar 13 11:16:48 2024] do_syscall_64+0x79/0x1a0
[Wed Mar 13 11:16:48 2024] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[Wed Mar 13 11:16:48 2024] RIP: 0033:0x7f63ea33e927
[Wed Mar 13 11:16:48 2024] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff
ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10
b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54
24 18 48 89 74 24
[Wed Mar 13 11:16:48 2024] RSP: 002b:00007ffc0e874768 EFLAGS: 00000246
ORIG_RAX: 0000000000000001
[Wed Mar 13 11:16:48 2024] RAX: ffffffffffffffda RBX: 0000000000100000
RCX: 00007f63ea33e927
[Wed Mar 13 11:16:48 2024] RDX: 0000000000100000 RSI: 00007f63dcafe000
RDI: 0000000000000001
[Wed Mar 13 11:16:48 2024] RBP: 00007f63dcafe000 R08: 00007f63dcafe000
R09: 0000000000000000
[Wed Mar 13 11:16:48 2024] R10: 0000000000000022 R11: 0000000000000246
R12: 0000000000000000
[Wed Mar 13 11:16:48 2024] R13: 0000000000000000 R14: 0000000000000000
R15: 00007f63dcafe000
[Wed Mar 13 11:16:48 2024] </TASK>
[Wed Mar 13 11:16:48 2024] memory: usage 1048556kB, limit 1048576kB, failcnt 153
[Wed Mar 13 11:16:48 2024] memory+swap: usage 1048556kB, limit
9007199254740988kB, failcnt 0
[Wed Mar 13 11:16:48 2024] kmem: usage 200kB, limit
9007199254740988kB, failcnt 0
[Wed Mar 13 11:16:48 2024] Memory cgroup stats for /mglru:
[Wed Mar 13 11:16:48 2024] cache 1072365568
[Wed Mar 13 11:16:48 2024] rss 1150976
[Wed Mar 13 11:16:48 2024] rss_huge 0
[Wed Mar 13 11:16:48 2024] shmem 0
[Wed Mar 13 11:16:48 2024] mapped_file 0
[Wed Mar 13 11:16:48 2024] dirty 1072365568
[Wed Mar 13 11:16:48 2024] writeback 0
[Wed Mar 13 11:16:48 2024] workingset_refault_anon 0
[Wed Mar 13 11:16:48 2024] workingset_refault_file 0
[Wed Mar 13 11:16:48 2024] swap 0
[Wed Mar 13 11:16:48 2024] swapcached 0
[Wed Mar 13 11:16:48 2024] pgpgin 2783
[Wed Mar 13 11:16:48 2024] pgpgout 1444
[Wed Mar 13 11:16:48 2024] pgfault 885
[Wed Mar 13 11:16:48 2024] pgmajfault 0
[Wed Mar 13 11:16:48 2024] inactive_anon 1146880
[Wed Mar 13 11:16:48 2024] active_anon 4096
[Wed Mar 13 11:16:48 2024] inactive_file 802357248
[Wed Mar 13 11:16:48 2024] active_file 270008320
[Wed Mar 13 11:16:48 2024] unevictable 0
[Wed Mar 13 11:16:48 2024] hierarchical_memory_limit 1073741824
[Wed Mar 13 11:16:48 2024] hierarchical_memsw_limit 9223372036854771712
[Wed Mar 13 11:16:48 2024] total_cache 1072365568
[Wed Mar 13 11:16:48 2024] total_rss 1150976
[Wed Mar 13 11:16:48 2024] total_rss_huge 0
[Wed Mar 13 11:16:48 2024] total_shmem 0
[Wed Mar 13 11:16:48 2024] total_mapped_file 0
[Wed Mar 13 11:16:48 2024] total_dirty 1072365568
[Wed Mar 13 11:16:48 2024] total_writeback 0
[Wed Mar 13 11:16:48 2024] total_workingset_refault_anon 0
[Wed Mar 13 11:16:48 2024] total_workingset_refault_file 0
[Wed Mar 13 11:16:48 2024] total_swap 0
[Wed Mar 13 11:16:48 2024] total_swapcached 0
[Wed Mar 13 11:16:48 2024] total_pgpgin 2783
[Wed Mar 13 11:16:48 2024] total_pgpgout 1444
[Wed Mar 13 11:16:48 2024] total_pgfault 885
[Wed Mar 13 11:16:48 2024] total_pgmajfault 0
[Wed Mar 13 11:16:48 2024] total_inactive_anon 1146880
[Wed Mar 13 11:16:48 2024] total_active_anon 4096
[Wed Mar 13 11:16:48 2024] total_inactive_file 802357248
[Wed Mar 13 11:16:48 2024] total_active_file 270008320
[Wed Mar 13 11:16:48 2024] total_unevictable 0
[Wed Mar 13 11:16:48 2024] Tasks state (memory values in pages):
[Wed Mar 13 11:16:48 2024] [ pid ] uid tgid total_vm rss
rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[Wed Mar 13 11:16:48 2024] [ 6911] 0 6911 55506 640
256 384 0 73728 0 0 dd
[Wed Mar 13 11:16:48 2024]
oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-1,oom_memcg=/mglru,task_memcg=/mglru,task=dd,pid=6911,uid=0

The key information extracted from the OOM info is as follows:

[Wed Mar 13 11:16:48 2024] cache 1072365568
[Wed Mar 13 11:16:48 2024] dirty 1072365568

This information reveals that all file pages are dirty pages.

As of now, it appears that the most effective solution to address this
issue is to revert the commit 14aa8b2d5c2e. Regarding this commit
14aa8b2d5c2e, its original intention was to eliminate potential SSD
wearout, although there's no concrete data available on how it might
impact SSD longevity. If the concern about SSD wearout is purely
theoretical, it might be reasonable to consider reverting this commit.

2024-03-13 11:00:27

by Hillf Danton

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Tue, 12 Mar 2024 17:08:22 -0400 Johannes Weiner <[email protected]>
>
> Back to the broader question though: if reclaim demand outstrips clean
> pages and the only viable candidates are dirty ones (e.g. an
> allocation spike in the presence of dirty/writeback pages), there only
> seem to be 3 options:
>
> 1) sleep-wait for writeback
> 2) continue scanning, aka busy-wait for writeback + age inversions
> 3) find nothing and declare OOM

4) make dirty ratio match your writeback bandwidth [1]

[1] Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
https://lore.kernel.org/lkml/CA+55aFzNe=3e=cDig+vEzZS5jm2c6apPV4s5NKG4eYL4_jxQjQ@mail.gmail.com/

2024-03-14 22:23:34

by Yu Zhao

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Wed, Mar 13, 2024 at 11:33:21AM +0800, Yafang Shao wrote:
> On Wed, Mar 13, 2024 at 4:11 AM Yu Zhao <[email protected]> wrote:
> >
> > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> > > On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote:
> > > > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <[email protected]> wrote:
> > > > >
> > > > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <[email protected]> wrote:
> > > > > > >
> > > > > > > Axel Rasmussen writes:
> > > > > > > >A couple of dumb questions. In your test, do you have any of the following
> > > > > > > >configured / enabled?
> > > > > > > >
> > > > > > > >/proc/sys/vm/laptop_mode
> > > > > > > >memory.low
> > > > > > > >memory.min
> > > > > > >
> > > > > > > None of these are enabled. The issue is trivially reproducible by writing to
> > > > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > > > > > > is also susceptible to this on global reclaim (although it's less likely due to
> > > > > > > page diversity).
> > > > > > >
> > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > > > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > > > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > > > > > > >looks like it simply will not do this.
> > > > > > > >
> > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > > > > > > >makes sense to me at least that doing writeback every time we age is too
> > > > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > > > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> > > > > > >
> > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > > > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > > > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > > > > > > thing at a time :-)
> > > > > >
> > > > > >
> > > > > > Hmm, so I have a patch which I think will help with this situation,
> > > > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> > > > > > then I can verify the patch fixes it).
> > > > >
> > > > > We encountered the same premature OOM issue caused by numerous dirty pages.
> > > > > The issue disappears after we revert the commit 14aa8b2d5c2e
> > > > > "mm/mglru: don't sync disk for each aging cycle"
> > > > >
> > > > > To aid in replicating the issue, we've developed a straightforward
> > > > > script, which consistently reproduces it, even on the latest kernel.
> > > > > You can find the script provided below:
> > > > >
> > > > > ```
> > > > > #!/bin/bash
> > > > >
> > > > > MEMCG="/sys/fs/cgroup/memory/mglru"
> > > > > ENABLE=$1
> > > > >
> > > > > # Avoid waking up the flusher
> > > > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4))
> > > > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4))
> > > > >
> > > > > if [ ! -d ${MEMCG} ]; then
> > > > > mkdir -p ${MEMCG}
> > > > > fi
> > > > >
> > > > > echo $$ > ${MEMCG}/cgroup.procs
> > > > > echo 1g > ${MEMCG}/memory.limit_in_bytes
> > > > >
> > > > > if [ $ENABLE -eq 0 ]; then
> > > > > echo 0 > /sys/kernel/mm/lru_gen/enabled
> > > > > else
> > > > > echo 0x7 > /sys/kernel/mm/lru_gen/enabled
> > > > > fi
> > > > >
> > > > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023
> > > > > rm -rf /data0/mglru.test
> > > > > ```
> > > > >
> > > > > This issue disappears as well after we disable the mglru.
> > > > >
> > > > > We hope this script proves helpful in identifying and addressing the
> > > > > root cause. We eagerly await your insights and proposed fixes.
> > > >
> > > > Thanks Yafang, I was able to reproduce the issue using this script.
> > > >
> > > > Perhaps interestingly, I was not able to reproduce it with cgroupv2
> > > > memcgs. I know writeback semantics are quite a bit different there, so
> > > > perhaps that explains why.
> > > >
> > > > Unfortunately, it also reproduces even with the commit I had in mind
> > > > (basically stealing the "if (all isolated pages are unqueued dirty) {
> > > > wakeup_flusher_threads(); reclaim_throttle(); }" from
> > > > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So
> > > > I'll need to spend some more time on this; I'm planning to send
> > > > something out for testing next week.
> > >
> > > Hi Chris,
> > >
> > > My apologies for not getting back to you sooner.
> > >
> > > And thanks everyone for all the input!
> > >
> > > My take is that Chris' premature OOM kills were NOT really due to
> > > the flusher not waking up or missing throttling.
> > >
> > > Yes, these two are among the differences between the active/inactive
> > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> > > positions of dirty pages. The active/inactive LRU moves dirty pages
> > > all the way to the end of the line (reclaim happens at the front)
> > > whereas MGLRU moves them into the middle, during direct reclaim. The
> > > rationale for MGLRU was that this way those dirty pages would still
> > > be counted as "inactive" (or cold).
> > >
> > > This theory can be quickly verified by comparing how much
> > > nr_vmscan_immediate_reclaim grows, i.e.,
> > >
> > > Before the copy
> > > grep nr_vmscan_immediate_reclaim /proc/vmstat
> > > And then after the copy
> > > grep nr_vmscan_immediate_reclaim /proc/vmstat
> > >
> > > The growth should be trivial for MGLRU and nontrivial for the
> > > active/inactive LRU.
> > >
> > > If this is indeed the case, I'd appreciate very much if anyone could
> > > try the following (I'll try it myself too later next week).
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 4255619a1a31..020f5d98b9a1 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > > }
> > >
> > > /* waiting for writeback */
> > > - if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > - gen = folio_inc_gen(lruvec, folio, true);
> > > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > + DEFINE_MAX_SEQ(lruvec);
> > > + int old_gen, new_gen = lru_gen_from_seq(max_seq);
> > > +
> > > + old_gen = folio_update_gen(folio, new_gen);
> > > + lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> > > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> >
> > Sorry missing one line here:
> >
> > + folio_set_reclaim(folio);
> >
> > > return true;
> > > }
>
> Hi Yu,
>
> I have validated it using the script provided for Axel, but
> unfortunately, it still triggers an OOM error with your patch applied.
> Here are the results with nr_vmscan_immediate_reclaim:

Thanks for debunking it!

> - non-MGLRU
> $ grep nr_vmscan_immediate_reclaim /proc/vmstat
> nr_vmscan_immediate_reclaim 47411776
>
> $ ./test.sh 0
> 1023+0 records in
> 1023+0 records out
> 1072693248 bytes (1.1 GB, 1023 MiB) copied, 0.538058 s, 2.0 GB/s
>
> $ grep nr_vmscan_immediate_reclaim /proc/vmstat
> nr_vmscan_immediate_reclaim 47412544
>
> - MGLRU
> $ grep nr_vmscan_immediate_reclaim /proc/vmstat
> nr_vmscan_immediate_reclaim 47412544
>
> $ ./test.sh 1
> Killed
>
> $ grep nr_vmscan_immediate_reclaim /proc/vmstat
> nr_vmscan_immediate_reclaim 115455600

The delta is ~260GB, I'm still thinking how that could happen -- is this reliably reproducible?

> The detailed OOM info as follows,
>
> [Wed Mar 13 11:16:48 2024] dd invoked oom-killer:
> gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),
> order=3, oom_score_adj=0
> [Wed Mar 13 11:16:48 2024] CPU: 12 PID: 6911 Comm: dd Not tainted 6.8.0-rc6+ #24
> [Wed Mar 13 11:16:48 2024] Hardware name: Tencent Cloud CVM, BIOS
> seabios-1.9.1-qemu-project.org 04/01/2014
> [Wed Mar 13 11:16:48 2024] Call Trace:
> [Wed Mar 13 11:16:48 2024] <TASK>
> [Wed Mar 13 11:16:48 2024] dump_stack_lvl+0x6e/0x90
> [Wed Mar 13 11:16:48 2024] dump_stack+0x10/0x20
> [Wed Mar 13 11:16:48 2024] dump_header+0x47/0x2d0
> [Wed Mar 13 11:16:48 2024] oom_kill_process+0x101/0x2e0
> [Wed Mar 13 11:16:48 2024] out_of_memory+0xfc/0x430
> [Wed Mar 13 11:16:48 2024] mem_cgroup_out_of_memory+0x13d/0x160
> [Wed Mar 13 11:16:48 2024] try_charge_memcg+0x7be/0x850
> [Wed Mar 13 11:16:48 2024] ? get_mem_cgroup_from_mm+0x5e/0x420
> [Wed Mar 13 11:16:48 2024] ? rcu_read_unlock+0x25/0x70
> [Wed Mar 13 11:16:48 2024] __mem_cgroup_charge+0x49/0x90
> [Wed Mar 13 11:16:48 2024] __filemap_add_folio+0x277/0x450
> [Wed Mar 13 11:16:48 2024] ? __pfx_workingset_update_node+0x10/0x10
> [Wed Mar 13 11:16:48 2024] filemap_add_folio+0x3c/0xa0
> [Wed Mar 13 11:16:48 2024] __filemap_get_folio+0x13d/0x2f0
> [Wed Mar 13 11:16:48 2024] iomap_get_folio+0x4c/0x60
> [Wed Mar 13 11:16:48 2024] iomap_write_begin+0x1bb/0x2e0
> [Wed Mar 13 11:16:48 2024] iomap_write_iter+0xff/0x290
> [Wed Mar 13 11:16:48 2024] iomap_file_buffered_write+0x91/0xf0
> [Wed Mar 13 11:16:48 2024] xfs_file_buffered_write+0x9f/0x2d0 [xfs]
> [Wed Mar 13 11:16:48 2024] ? vfs_write+0x261/0x530
> [Wed Mar 13 11:16:48 2024] ? debug_smp_processor_id+0x17/0x20
> [Wed Mar 13 11:16:48 2024] xfs_file_write_iter+0xe9/0x120 [xfs]
> [Wed Mar 13 11:16:48 2024] vfs_write+0x37d/0x530
> [Wed Mar 13 11:16:48 2024] ksys_write+0x6d/0xf0
> [Wed Mar 13 11:16:48 2024] __x64_sys_write+0x19/0x20
> [Wed Mar 13 11:16:48 2024] do_syscall_64+0x79/0x1a0
> [Wed Mar 13 11:16:48 2024] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [Wed Mar 13 11:16:48 2024] RIP: 0033:0x7f63ea33e927
> [Wed Mar 13 11:16:48 2024] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff
> ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10
> b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54
> 24 18 48 89 74 24
> [Wed Mar 13 11:16:48 2024] RSP: 002b:00007ffc0e874768 EFLAGS: 00000246
> ORIG_RAX: 0000000000000001
> [Wed Mar 13 11:16:48 2024] RAX: ffffffffffffffda RBX: 0000000000100000
> RCX: 00007f63ea33e927
> [Wed Mar 13 11:16:48 2024] RDX: 0000000000100000 RSI: 00007f63dcafe000
> RDI: 0000000000000001
> [Wed Mar 13 11:16:48 2024] RBP: 00007f63dcafe000 R08: 00007f63dcafe000
> R09: 0000000000000000
> [Wed Mar 13 11:16:48 2024] R10: 0000000000000022 R11: 0000000000000246
> R12: 0000000000000000
> [Wed Mar 13 11:16:48 2024] R13: 0000000000000000 R14: 0000000000000000
> R15: 00007f63dcafe000
> [Wed Mar 13 11:16:48 2024] </TASK>
> [Wed Mar 13 11:16:48 2024] memory: usage 1048556kB, limit 1048576kB, failcnt 153
> [Wed Mar 13 11:16:48 2024] memory+swap: usage 1048556kB, limit

I see you were actually on cgroup v1 -- this might be a different
problem than Chris' since he was on v2.

For v1, the throttling is done by commit 81a70c21d9
("mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1").
IOW, the active/inactive LRU throttles in both v1 and v2 (done
in different ways) whereas MGLRU doesn't in either case.

> 9007199254740988kB, failcnt 0
> [Wed Mar 13 11:16:48 2024] kmem: usage 200kB, limit
> 9007199254740988kB, failcnt 0
> [Wed Mar 13 11:16:48 2024] Memory cgroup stats for /mglru:
> [Wed Mar 13 11:16:48 2024] cache 1072365568
> [Wed Mar 13 11:16:48 2024] rss 1150976
> [Wed Mar 13 11:16:48 2024] rss_huge 0
> [Wed Mar 13 11:16:48 2024] shmem 0
> [Wed Mar 13 11:16:48 2024] mapped_file 0
> [Wed Mar 13 11:16:48 2024] dirty 1072365568
> [Wed Mar 13 11:16:48 2024] writeback 0
> [Wed Mar 13 11:16:48 2024] workingset_refault_anon 0
> [Wed Mar 13 11:16:48 2024] workingset_refault_file 0
> [Wed Mar 13 11:16:48 2024] swap 0
> [Wed Mar 13 11:16:48 2024] swapcached 0
> [Wed Mar 13 11:16:48 2024] pgpgin 2783
> [Wed Mar 13 11:16:48 2024] pgpgout 1444
> [Wed Mar 13 11:16:48 2024] pgfault 885
> [Wed Mar 13 11:16:48 2024] pgmajfault 0
> [Wed Mar 13 11:16:48 2024] inactive_anon 1146880
> [Wed Mar 13 11:16:48 2024] active_anon 4096
> [Wed Mar 13 11:16:48 2024] inactive_file 802357248
> [Wed Mar 13 11:16:48 2024] active_file 270008320
> [Wed Mar 13 11:16:48 2024] unevictable 0
> [Wed Mar 13 11:16:48 2024] hierarchical_memory_limit 1073741824
> [Wed Mar 13 11:16:48 2024] hierarchical_memsw_limit 9223372036854771712
> [Wed Mar 13 11:16:48 2024] total_cache 1072365568
> [Wed Mar 13 11:16:48 2024] total_rss 1150976
> [Wed Mar 13 11:16:48 2024] total_rss_huge 0
> [Wed Mar 13 11:16:48 2024] total_shmem 0
> [Wed Mar 13 11:16:48 2024] total_mapped_file 0
> [Wed Mar 13 11:16:48 2024] total_dirty 1072365568
> [Wed Mar 13 11:16:48 2024] total_writeback 0
> [Wed Mar 13 11:16:48 2024] total_workingset_refault_anon 0
> [Wed Mar 13 11:16:48 2024] total_workingset_refault_file 0
> [Wed Mar 13 11:16:48 2024] total_swap 0
> [Wed Mar 13 11:16:48 2024] total_swapcached 0
> [Wed Mar 13 11:16:48 2024] total_pgpgin 2783
> [Wed Mar 13 11:16:48 2024] total_pgpgout 1444
> [Wed Mar 13 11:16:48 2024] total_pgfault 885
> [Wed Mar 13 11:16:48 2024] total_pgmajfault 0
> [Wed Mar 13 11:16:48 2024] total_inactive_anon 1146880
> [Wed Mar 13 11:16:48 2024] total_active_anon 4096
> [Wed Mar 13 11:16:48 2024] total_inactive_file 802357248
> [Wed Mar 13 11:16:48 2024] total_active_file 270008320
> [Wed Mar 13 11:16:48 2024] total_unevictable 0
> [Wed Mar 13 11:16:48 2024] Tasks state (memory values in pages):
> [Wed Mar 13 11:16:48 2024] [ pid ] uid tgid total_vm rss
> rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
> [Wed Mar 13 11:16:48 2024] [ 6911] 0 6911 55506 640
> 256 384 0 73728 0 0 dd
> [Wed Mar 13 11:16:48 2024]
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-1,oom_memcg=/mglru,task_memcg=/mglru,task=dd,pid=6911,uid=0
>
> The key information extracted from the OOM info is as follows:
>
> [Wed Mar 13 11:16:48 2024] cache 1072365568
> [Wed Mar 13 11:16:48 2024] dirty 1072365568
>
> This information reveals that all file pages are dirty pages.

I'm surprised to see there was 0 pages under writeback:
[Wed Mar 13 11:16:48 2024] total_writeback 0
What's your dirty limit?

It's unfortunate that the mainline has no per-memcg dirty limit. (We
do at Google.)

> As of now, it appears that the most effective solution to address this
> issue is to revert the commit 14aa8b2d5c2e. Regarding this commit
> 14aa8b2d5c2e, its original intention was to eliminate potential SSD
> wearout, although there's no concrete data available on how it might
> impact SSD longevity. If the concern about SSD wearout is purely
> theoretical, it might be reasonable to consider reverting this commit.

The SSD wearout problem was real -- it wasn't really due to
wakeup_flusher_threads() itself; rather, the original MGLRU code call
the function improperly. It needs to be called under more restricted
conditions so that it doesn't cause the SDD wearout problem again.
However, IMO, wakeup_flusher_threads() is just another bandaid trying
to work around a more fundamental problem. There is no guarantee that
the flusher will target the dirty pages in the memcg under reclaim,
right?

Do you mind trying the following first to see if we can get around
the problem without calling wakeup_flusher_threads().

Thanks!

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4255619a1a31..d3cfbd95996d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -225,7 +225,7 @@ static bool writeback_throttling_sane(struct scan_control *sc)
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
return true;
#endif
- return false;
+ return lru_gen_enabled();
}
#else
static bool cgroup_reclaim(struct scan_control *sc)
@@ -4273,8 +4273,10 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
}

/* waiting for writeback */
- if (folio_test_locked(folio) || folio_test_writeback(folio) ||
- (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+ if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+ sc->nr.dirty += delta;
+ if (!folio_test_reclaim(folio))
+ sc->nr.congested += delta;
gen = folio_inc_gen(lruvec, folio, true);
list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
return true;

2024-03-15 02:39:20

by Yafang Shao

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Fri, Mar 15, 2024 at 6:23 AM Yu Zhao <[email protected]> wrote:
>
> On Wed, Mar 13, 2024 at 11:33:21AM +0800, Yafang Shao wrote:
> > On Wed, Mar 13, 2024 at 4:11 AM Yu Zhao <[email protected]> wrote:
> > >
> > > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> > > > On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote:
> > > > > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <[email protected]> wrote:
> > > > > >
> > > > > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Axel Rasmussen writes:
> > > > > > > > >A couple of dumb questions. In your test, do you have any of the following
> > > > > > > > >configured / enabled?
> > > > > > > > >
> > > > > > > > >/proc/sys/vm/laptop_mode
> > > > > > > > >memory.low
> > > > > > > > >memory.min
> > > > > > > >
> > > > > > > > None of these are enabled. The issue is trivially reproducible by writing to
> > > > > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > > > > > > > is also susceptible to this on global reclaim (although it's less likely due to
> > > > > > > > page diversity).
> > > > > > > >
> > > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > > > > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > > > > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > > > > > > > >looks like it simply will not do this.
> > > > > > > > >
> > > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > > > > > > > >makes sense to me at least that doing writeback every time we age is too
> > > > > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > > > > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> > > > > > > >
> > > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > > > > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > > > > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > > > > > > > thing at a time :-)
> > > > > > >
> > > > > > >
> > > > > > > Hmm, so I have a patch which I think will help with this situation,
> > > > > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> > > > > > > then I can verify the patch fixes it).
> > > > > >
> > > > > > We encountered the same premature OOM issue caused by numerous dirty pages.
> > > > > > The issue disappears after we revert the commit 14aa8b2d5c2e
> > > > > > "mm/mglru: don't sync disk for each aging cycle"
> > > > > >
> > > > > > To aid in replicating the issue, we've developed a straightforward
> > > > > > script, which consistently reproduces it, even on the latest kernel.
> > > > > > You can find the script provided below:
> > > > > >
> > > > > > ```
> > > > > > #!/bin/bash
> > > > > >
> > > > > > MEMCG="/sys/fs/cgroup/memory/mglru"
> > > > > > ENABLE=$1
> > > > > >
> > > > > > # Avoid waking up the flusher
> > > > > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4))
> > > > > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4))
> > > > > >
> > > > > > if [ ! -d ${MEMCG} ]; then
> > > > > > mkdir -p ${MEMCG}
> > > > > > fi
> > > > > >
> > > > > > echo $$ > ${MEMCG}/cgroup.procs
> > > > > > echo 1g > ${MEMCG}/memory.limit_in_bytes
> > > > > >
> > > > > > if [ $ENABLE -eq 0 ]; then
> > > > > > echo 0 > /sys/kernel/mm/lru_gen/enabled
> > > > > > else
> > > > > > echo 0x7 > /sys/kernel/mm/lru_gen/enabled
> > > > > > fi
> > > > > >
> > > > > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023
> > > > > > rm -rf /data0/mglru.test
> > > > > > ```
> > > > > >
> > > > > > This issue disappears as well after we disable the mglru.
> > > > > >
> > > > > > We hope this script proves helpful in identifying and addressing the
> > > > > > root cause. We eagerly await your insights and proposed fixes.
> > > > >
> > > > > Thanks Yafang, I was able to reproduce the issue using this script.
> > > > >
> > > > > Perhaps interestingly, I was not able to reproduce it with cgroupv2
> > > > > memcgs. I know writeback semantics are quite a bit different there, so
> > > > > perhaps that explains why.
> > > > >
> > > > > Unfortunately, it also reproduces even with the commit I had in mind
> > > > > (basically stealing the "if (all isolated pages are unqueued dirty) {
> > > > > wakeup_flusher_threads(); reclaim_throttle(); }" from
> > > > > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So
> > > > > I'll need to spend some more time on this; I'm planning to send
> > > > > something out for testing next week.
> > > >
> > > > Hi Chris,
> > > >
> > > > My apologies for not getting back to you sooner.
> > > >
> > > > And thanks everyone for all the input!
> > > >
> > > > My take is that Chris' premature OOM kills were NOT really due to
> > > > the flusher not waking up or missing throttling.
> > > >
> > > > Yes, these two are among the differences between the active/inactive
> > > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> > > > positions of dirty pages. The active/inactive LRU moves dirty pages
> > > > all the way to the end of the line (reclaim happens at the front)
> > > > whereas MGLRU moves them into the middle, during direct reclaim. The
> > > > rationale for MGLRU was that this way those dirty pages would still
> > > > be counted as "inactive" (or cold).
> > > >
> > > > This theory can be quickly verified by comparing how much
> > > > nr_vmscan_immediate_reclaim grows, i.e.,
> > > >
> > > > Before the copy
> > > > grep nr_vmscan_immediate_reclaim /proc/vmstat
> > > > And then after the copy
> > > > grep nr_vmscan_immediate_reclaim /proc/vmstat
> > > >
> > > > The growth should be trivial for MGLRU and nontrivial for the
> > > > active/inactive LRU.
> > > >
> > > > If this is indeed the case, I'd appreciate very much if anyone could
> > > > try the following (I'll try it myself too later next week).
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 4255619a1a31..020f5d98b9a1 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > > > }
> > > >
> > > > /* waiting for writeback */
> > > > - if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > > > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > > - gen = folio_inc_gen(lruvec, folio, true);
> > > > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > > + DEFINE_MAX_SEQ(lruvec);
> > > > + int old_gen, new_gen = lru_gen_from_seq(max_seq);
> > > > +
> > > > + old_gen = folio_update_gen(folio, new_gen);
> > > > + lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> > > > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> > >
> > > Sorry missing one line here:
> > >
> > > + folio_set_reclaim(folio);
> > >
> > > > return true;
> > > > }
> >
> > Hi Yu,
> >
> > I have validated it using the script provided for Axel, but
> > unfortunately, it still triggers an OOM error with your patch applied.
> > Here are the results with nr_vmscan_immediate_reclaim:
>
> Thanks for debunking it!
>
> > - non-MGLRU
> > $ grep nr_vmscan_immediate_reclaim /proc/vmstat
> > nr_vmscan_immediate_reclaim 47411776
> >
> > $ ./test.sh 0
> > 1023+0 records in
> > 1023+0 records out
> > 1072693248 bytes (1.1 GB, 1023 MiB) copied, 0.538058 s, 2.0 GB/s
> >
> > $ grep nr_vmscan_immediate_reclaim /proc/vmstat
> > nr_vmscan_immediate_reclaim 47412544
> >
> > - MGLRU
> > $ grep nr_vmscan_immediate_reclaim /proc/vmstat
> > nr_vmscan_immediate_reclaim 47412544
> >
> > $ ./test.sh 1
> > Killed
> >
> > $ grep nr_vmscan_immediate_reclaim /proc/vmstat
> > nr_vmscan_immediate_reclaim 115455600
>
> The delta is ~260GB, I'm still thinking how that could happen -- is this reliably reproducible?

Yes, it is reliably reproducible on cgroup1 with the script provided as follows:

$ ./test.sh 1

>
> > The detailed OOM info as follows,
> >
> > [Wed Mar 13 11:16:48 2024] dd invoked oom-killer:
> > gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),
> > order=3, oom_score_adj=0
> > [Wed Mar 13 11:16:48 2024] CPU: 12 PID: 6911 Comm: dd Not tainted 6.8.0-rc6+ #24
> > [Wed Mar 13 11:16:48 2024] Hardware name: Tencent Cloud CVM, BIOS
> > seabios-1.9.1-qemu-project.org 04/01/2014
> > [Wed Mar 13 11:16:48 2024] Call Trace:
> > [Wed Mar 13 11:16:48 2024] <TASK>
> > [Wed Mar 13 11:16:48 2024] dump_stack_lvl+0x6e/0x90
> > [Wed Mar 13 11:16:48 2024] dump_stack+0x10/0x20
> > [Wed Mar 13 11:16:48 2024] dump_header+0x47/0x2d0
> > [Wed Mar 13 11:16:48 2024] oom_kill_process+0x101/0x2e0
> > [Wed Mar 13 11:16:48 2024] out_of_memory+0xfc/0x430
> > [Wed Mar 13 11:16:48 2024] mem_cgroup_out_of_memory+0x13d/0x160
> > [Wed Mar 13 11:16:48 2024] try_charge_memcg+0x7be/0x850
> > [Wed Mar 13 11:16:48 2024] ? get_mem_cgroup_from_mm+0x5e/0x420
> > [Wed Mar 13 11:16:48 2024] ? rcu_read_unlock+0x25/0x70
> > [Wed Mar 13 11:16:48 2024] __mem_cgroup_charge+0x49/0x90
> > [Wed Mar 13 11:16:48 2024] __filemap_add_folio+0x277/0x450
> > [Wed Mar 13 11:16:48 2024] ? __pfx_workingset_update_node+0x10/0x10
> > [Wed Mar 13 11:16:48 2024] filemap_add_folio+0x3c/0xa0
> > [Wed Mar 13 11:16:48 2024] __filemap_get_folio+0x13d/0x2f0
> > [Wed Mar 13 11:16:48 2024] iomap_get_folio+0x4c/0x60
> > [Wed Mar 13 11:16:48 2024] iomap_write_begin+0x1bb/0x2e0
> > [Wed Mar 13 11:16:48 2024] iomap_write_iter+0xff/0x290
> > [Wed Mar 13 11:16:48 2024] iomap_file_buffered_write+0x91/0xf0
> > [Wed Mar 13 11:16:48 2024] xfs_file_buffered_write+0x9f/0x2d0 [xfs]
> > [Wed Mar 13 11:16:48 2024] ? vfs_write+0x261/0x530
> > [Wed Mar 13 11:16:48 2024] ? debug_smp_processor_id+0x17/0x20
> > [Wed Mar 13 11:16:48 2024] xfs_file_write_iter+0xe9/0x120 [xfs]
> > [Wed Mar 13 11:16:48 2024] vfs_write+0x37d/0x530
> > [Wed Mar 13 11:16:48 2024] ksys_write+0x6d/0xf0
> > [Wed Mar 13 11:16:48 2024] __x64_sys_write+0x19/0x20
> > [Wed Mar 13 11:16:48 2024] do_syscall_64+0x79/0x1a0
> > [Wed Mar 13 11:16:48 2024] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > [Wed Mar 13 11:16:48 2024] RIP: 0033:0x7f63ea33e927
> > [Wed Mar 13 11:16:48 2024] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff
> > ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10
> > b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54
> > 24 18 48 89 74 24
> > [Wed Mar 13 11:16:48 2024] RSP: 002b:00007ffc0e874768 EFLAGS: 00000246
> > ORIG_RAX: 0000000000000001
> > [Wed Mar 13 11:16:48 2024] RAX: ffffffffffffffda RBX: 0000000000100000
> > RCX: 00007f63ea33e927
> > [Wed Mar 13 11:16:48 2024] RDX: 0000000000100000 RSI: 00007f63dcafe000
> > RDI: 0000000000000001
> > [Wed Mar 13 11:16:48 2024] RBP: 00007f63dcafe000 R08: 00007f63dcafe000
> > R09: 0000000000000000
> > [Wed Mar 13 11:16:48 2024] R10: 0000000000000022 R11: 0000000000000246
> > R12: 0000000000000000
> > [Wed Mar 13 11:16:48 2024] R13: 0000000000000000 R14: 0000000000000000
> > R15: 00007f63dcafe000
> > [Wed Mar 13 11:16:48 2024] </TASK>
> > [Wed Mar 13 11:16:48 2024] memory: usage 1048556kB, limit 1048576kB, failcnt 153
> > [Wed Mar 13 11:16:48 2024] memory+swap: usage 1048556kB, limit
>
> I see you were actually on cgroup v1 -- this might be a different
> problem than Chris' since he was on v2.

Right, we are still using cgroup1. They might not be the same issue.

>
> For v1, the throttling is done by commit 81a70c21d9
> ("mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1").
> IOW, the active/inactive LRU throttles in both v1 and v2 (done
> in different ways) whereas MGLRU doesn't in either case.
>
> > 9007199254740988kB, failcnt 0
> > [Wed Mar 13 11:16:48 2024] kmem: usage 200kB, limit
> > 9007199254740988kB, failcnt 0
> > [Wed Mar 13 11:16:48 2024] Memory cgroup stats for /mglru:
> > [Wed Mar 13 11:16:48 2024] cache 1072365568
> > [Wed Mar 13 11:16:48 2024] rss 1150976
> > [Wed Mar 13 11:16:48 2024] rss_huge 0
> > [Wed Mar 13 11:16:48 2024] shmem 0
> > [Wed Mar 13 11:16:48 2024] mapped_file 0
> > [Wed Mar 13 11:16:48 2024] dirty 1072365568
> > [Wed Mar 13 11:16:48 2024] writeback 0
> > [Wed Mar 13 11:16:48 2024] workingset_refault_anon 0
> > [Wed Mar 13 11:16:48 2024] workingset_refault_file 0
> > [Wed Mar 13 11:16:48 2024] swap 0
> > [Wed Mar 13 11:16:48 2024] swapcached 0
> > [Wed Mar 13 11:16:48 2024] pgpgin 2783
> > [Wed Mar 13 11:16:48 2024] pgpgout 1444
> > [Wed Mar 13 11:16:48 2024] pgfault 885
> > [Wed Mar 13 11:16:48 2024] pgmajfault 0
> > [Wed Mar 13 11:16:48 2024] inactive_anon 1146880
> > [Wed Mar 13 11:16:48 2024] active_anon 4096
> > [Wed Mar 13 11:16:48 2024] inactive_file 802357248
> > [Wed Mar 13 11:16:48 2024] active_file 270008320
> > [Wed Mar 13 11:16:48 2024] unevictable 0
> > [Wed Mar 13 11:16:48 2024] hierarchical_memory_limit 1073741824
> > [Wed Mar 13 11:16:48 2024] hierarchical_memsw_limit 9223372036854771712
> > [Wed Mar 13 11:16:48 2024] total_cache 1072365568
> > [Wed Mar 13 11:16:48 2024] total_rss 1150976
> > [Wed Mar 13 11:16:48 2024] total_rss_huge 0
> > [Wed Mar 13 11:16:48 2024] total_shmem 0
> > [Wed Mar 13 11:16:48 2024] total_mapped_file 0
> > [Wed Mar 13 11:16:48 2024] total_dirty 1072365568
> > [Wed Mar 13 11:16:48 2024] total_writeback 0
> > [Wed Mar 13 11:16:48 2024] total_workingset_refault_anon 0
> > [Wed Mar 13 11:16:48 2024] total_workingset_refault_file 0
> > [Wed Mar 13 11:16:48 2024] total_swap 0
> > [Wed Mar 13 11:16:48 2024] total_swapcached 0
> > [Wed Mar 13 11:16:48 2024] total_pgpgin 2783
> > [Wed Mar 13 11:16:48 2024] total_pgpgout 1444
> > [Wed Mar 13 11:16:48 2024] total_pgfault 885
> > [Wed Mar 13 11:16:48 2024] total_pgmajfault 0
> > [Wed Mar 13 11:16:48 2024] total_inactive_anon 1146880
> > [Wed Mar 13 11:16:48 2024] total_active_anon 4096
> > [Wed Mar 13 11:16:48 2024] total_inactive_file 802357248
> > [Wed Mar 13 11:16:48 2024] total_active_file 270008320
> > [Wed Mar 13 11:16:48 2024] total_unevictable 0
> > [Wed Mar 13 11:16:48 2024] Tasks state (memory values in pages):
> > [Wed Mar 13 11:16:48 2024] [ pid ] uid tgid total_vm rss
> > rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
> > [Wed Mar 13 11:16:48 2024] [ 6911] 0 6911 55506 640
> > 256 384 0 73728 0 0 dd
> > [Wed Mar 13 11:16:48 2024]
> > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-1,oom_memcg=/mglru,task_memcg=/mglru,task=dd,pid=6911,uid=0
> >
> > The key information extracted from the OOM info is as follows:
> >
> > [Wed Mar 13 11:16:48 2024] cache 1072365568
> > [Wed Mar 13 11:16:48 2024] dirty 1072365568
> >
> > This information reveals that all file pages are dirty pages.
>
> I'm surprised to see there was 0 pages under writeback:
> [Wed Mar 13 11:16:48 2024] total_writeback 0
> What's your dirty limit?

The background dirty threshold is 2G, and the dirty threshold is 4G.

sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 * 2))
sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 * 4))

>
> It's unfortunate that the mainline has no per-memcg dirty limit. (We
> do at Google.)

Per-memcg dirty limit is a useful feature. We also support it in our
local kernel, but we didn't enable it for this test case.
It is unclear why the memcg maintainers insist on rejecting the
per-memcg dirty limit :(

>
> > As of now, it appears that the most effective solution to address this
> > issue is to revert the commit 14aa8b2d5c2e. Regarding this commit
> > 14aa8b2d5c2e, its original intention was to eliminate potential SSD
> > wearout, although there's no concrete data available on how it might
> > impact SSD longevity. If the concern about SSD wearout is purely
> > theoretical, it might be reasonable to consider reverting this commit.
>
> The SSD wearout problem was real -- it wasn't really due to
> wakeup_flusher_threads() itself; rather, the original MGLRU code call
> the function improperly. It needs to be called under more restricted
> conditions so that it doesn't cause the SDD wearout problem again.
> However, IMO, wakeup_flusher_threads() is just another bandaid trying
> to work around a more fundamental problem. There is no guarantee that
> the flusher will target the dirty pages in the memcg under reclaim,
> right?

Right, it is a system-wide fluser.

>
> Do you mind trying the following first to see if we can get around
> the problem without calling wakeup_flusher_threads().

I have tried it, but it still triggers the OOM. Below is the information.

[ 71.713649] dd invoked oom-killer:
gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),
order=3, oom_score_adj=0
[ 71.716317] CPU: 60 PID: 7218 Comm: dd Not tainted 6.8.0-rc6+ #26
[ 71.717677] Call Trace:
[ 71.717917] <TASK>
[ 71.718137] dump_stack_lvl+0x6e/0x90
[ 71.718485] dump_stack+0x10/0x20
[ 71.718799] dump_header+0x47/0x2d0
[ 71.719147] oom_kill_process+0x101/0x2e0
[ 71.719523] out_of_memory+0xfc/0x430
[ 71.719868] mem_cgroup_out_of_memory+0x13d/0x160
[ 71.720322] try_charge_memcg+0x7be/0x850
[ 71.720701] ? get_mem_cgroup_from_mm+0x5e/0x420
[ 71.721137] ? rcu_read_unlock+0x25/0x70
[ 71.721506] __mem_cgroup_charge+0x49/0x90
[ 71.721887] __filemap_add_folio+0x277/0x450
[ 71.722304] ? __pfx_workingset_update_node+0x10/0x10
[ 71.722773] filemap_add_folio+0x3c/0xa0
[ 71.723149] __filemap_get_folio+0x13d/0x2f0
[ 71.723551] iomap_get_folio+0x4c/0x60
[ 71.723911] iomap_write_begin+0x1bb/0x2e0
[ 71.724309] iomap_write_iter+0xff/0x290
[ 71.724683] iomap_file_buffered_write+0x91/0xf0
[ 71.725140] xfs_file_buffered_write+0x9f/0x2d0 [xfs]
[ 71.725793] ? vfs_write+0x261/0x530
[ 71.726148] ? debug_smp_processor_id+0x17/0x20
[ 71.726574] xfs_file_write_iter+0xe9/0x120 [xfs]
[ 71.727161] vfs_write+0x37d/0x530
[ 71.727501] ksys_write+0x6d/0xf0
[ 71.727821] __x64_sys_write+0x19/0x20
[ 71.728181] do_syscall_64+0x79/0x1a0
[ 71.728529] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 71.729002] RIP: 0033:0x7fd77053e927
[ 71.729340] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7
0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
74 24
[ 71.730988] RSP: 002b:00007fff032b7218 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[ 71.731664] RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007fd77053e927
[ 71.732308] RDX: 0000000000100000 RSI: 00007fd762cfe000 RDI: 0000000000000001
[ 71.732955] RBP: 00007fd762cfe000 R08: 00007fd762cfe000 R09: 0000000000000000
[ 71.733592] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000
[ 71.734237] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fd762cfe000
[ 71.735175] </TASK>
[ 71.736115] memory: usage 1048548kB, limit 1048576kB, failcnt 114
[ 71.736123] memory+swap: usage 1048548kB, limit 9007199254740988kB, failcnt 0
[ 71.736127] kmem: usage 184kB, limit 9007199254740988kB, failcnt 0
[ 71.736131] Memory cgroup stats for /mglru:
[ 71.736364] cache 1072300032
[ 71.736370] rss 1224704
[ 71.736373] rss_huge 0
[ 71.736376] shmem 0
[ 71.736380] mapped_file 0
[ 71.736383] dirty 1072300032
[ 71.736386] writeback 0
[ 71.736389] workingset_refault_anon 0
[ 71.736393] workingset_refault_file 0
[ 71.736396] swap 0
[ 71.736400] swapcached 0
[ 71.736403] pgpgin 2782
[ 71.736406] pgpgout 1427
[ 71.736410] pgfault 882
[ 71.736414] pgmajfault 0
[ 71.736417] inactive_anon 0
[ 71.736421] active_anon 1220608
[ 71.736424] inactive_file 0
[ 71.736428] active_file 1072300032
[ 71.736431] unevictable 0
[ 71.736435] hierarchical_memory_limit 1073741824
[ 71.736438] hierarchical_memsw_limit 9223372036854771712
[ 71.736442] total_cache 1072300032
[ 71.736445] total_rss 1224704
[ 71.736448] total_rss_huge 0
[ 71.736451] total_shmem 0
[ 71.736455] total_mapped_file 0
[ 71.736458] total_dirty 1072300032
[ 71.736462] total_writeback 0
[ 71.736465] total_workingset_refault_anon 0
[ 71.736469] total_workingset_refault_file 0
[ 71.736472] total_swap 0
[ 71.736475] total_swapcached 0
[ 71.736478] total_pgpgin 2782
[ 71.736482] total_pgpgout 1427
[ 71.736485] total_pgfault 882
[ 71.736488] total_pgmajfault 0
[ 71.736491] total_inactive_anon 0
[ 71.736494] total_active_anon 1220608
[ 71.736497] total_inactive_file 0
[ 71.736501] total_active_file 1072300032
[ 71.736504] total_unevictable 0
[ 71.736508] Tasks state (memory values in pages):
[ 71.736512] [ pid ] uid tgid total_vm rss rss_anon
rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[ 71.736522] [ 7215] 0 7215 55663 768 0
768 0 81920 0 0 test.sh
[ 71.736586] [ 7218] 0 7218 55506 640 256
384 0 69632 0 0 dd
[ 71.736596] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-1,oom_memcg=/mglru,task_memcg=/mglru,task=test.sh,pid=7215,uid=0
[ 71.736766] Memory cgroup out of memory: Killed process 7215
(test.sh) total-vm:222652kB, anon-rss:0kB, file-rss:3072kB,
shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0

And the key information:

[ 71.736442] total_cache 1072300032
[ 71.736458] total_dirty 1072300032
[ 71.736462] total_writeback 0

>
> Thanks!
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4255619a1a31..d3cfbd95996d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -225,7 +225,7 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> return true;
> #endif
> - return false;
> + return lru_gen_enabled();
> }
> #else
> static bool cgroup_reclaim(struct scan_control *sc)
> @@ -4273,8 +4273,10 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> }
>
> /* waiting for writeback */
> - if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> - (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> + sc->nr.dirty += delta;
> + if (!folio_test_reclaim(folio))
> + sc->nr.congested += delta;
> gen = folio_inc_gen(lruvec, folio, true);
> list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> return true;

--
Regards
Yafang

2024-03-15 14:27:31

by Johannes Weiner

[permalink] [raw]

Subject: Re: MGLRU premature memcg OOM on slow writes

On Fri, Mar 15, 2024 at 10:38:31AM +0800, Yafang Shao wrote:
> On Fri, Mar 15, 2024 at 6:23 AM Yu Zhao <[email protected]> wrote:
> > I'm surprised to see there was 0 pages under writeback:
> > [Wed Mar 13 11:16:48 2024] total_writeback 0
> > What's your dirty limit?
>
> The background dirty threshold is 2G, and the dirty threshold is 4G.
>
> sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 * 2))
> sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 * 4))
>
> >
> > It's unfortunate that the mainline has no per-memcg dirty limit. (We
> > do at Google.)
>
> Per-memcg dirty limit is a useful feature. We also support it in our
> local kernel, but we didn't enable it for this test case.
> It is unclear why the memcg maintainers insist on rejecting the
> per-memcg dirty limit :(

I don't think that assessment is fair. It's just that nobody has
seriously proposed it (at least not that I remember) since the
cgroup-aware writeback was merged in 2015.

We run millions of machines with different workloads, memory sizes,
and IO devices, and don't feel the need to tune the settings for the
global dirty limits away from the defaults.

Cgroups allot those allowances in proportion to observed writeback
speed and available memory in the container. We set IO rate and memory
limits per container, and it adapts as necessary.

If you have an actual usecase, I'm more than willing to hear you
out. I'm sure that the other maintainers feel the same.

If you're proposing it as a workaround for cgroup1 being
architecturally unable to implement proper writeback cache management,
then it's a more difficult argument. That's one of the big reasons why
cgroup2 exists after all.

> > > As of now, it appears that the most effective solution to address this
> > > issue is to revert the commit 14aa8b2d5c2e. Regarding this commit
> > > 14aa8b2d5c2e, its original intention was to eliminate potential SSD
> > > wearout, although there's no concrete data available on how it might
> > > impact SSD longevity. If the concern about SSD wearout is purely
> > > theoretical, it might be reasonable to consider reverting this commit.
> >
> > The SSD wearout problem was real -- it wasn't really due to
> > wakeup_flusher_threads() itself; rather, the original MGLRU code call
> > the function improperly. It needs to be called under more restricted
> > conditions so that it doesn't cause the SDD wearout problem again.
> > However, IMO, wakeup_flusher_threads() is just another bandaid trying
> > to work around a more fundamental problem. There is no guarantee that
> > the flusher will target the dirty pages in the memcg under reclaim,
> > right?
>
> Right, it is a system-wide fluser.

Is it possible it was woken up just too frequently?

Conventional reclaim wakes it based on actually observed dirty pages
off the LRU. I'm not super familiar with MGLRU, but it looks like it
woke it on every generational bump? That might indeed be too frequent,
and doesn't seem related to the writeback cache state.

We're monitoring write rates quite closely due to wearout concern as
well, especially because we use disk swap too. This is the first time
I'm hearing about reclaim-driven wakeups being a concern. (The direct
writepage calls were a huge problem. But not waking the flushers.)

Frankly, I don't think the issue is fixable without bringing the
wakeup back in some form. Even if you had per-cgroup dirty limits. As
soon as you have non-zero dirty pages, you can produce allocation
patterns that drive reclaim into them before background writeback
kicks in.

If reclaim doesn't wake the flushers and waits for writeback, the
premature OOM margin is the size of the background limit - 1.

Yes, cgroup1 and cgroup2 react differently to seeing pages under
writeback: cgroup1 does wait_on_page_writeback(); cgroup2 samples
batches of pages and throttles at a higher level. But both of them
need the flushers woken, or there is nothing to wait for.

Unless you want to wait for dirty expiration :)