mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
and started happening on linux -next master branch kernel tag next-20200430
and next-20200501. We did not bisect this problem.
metadata
git branch: master
git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
git commit: e4a08b64261ab411b15580c369a3b8fbed28bbc1
git describe: next-20200430
make_kernelversion: 5.7.0-rc3
kernel-config:
https://builds.tuxbuild.com/1YrE_XUQ6odA52tSBM919w/kernel.config
Steps to reproduce: (always reproducible)
---------------------------
mkfs -t ext4 <external-STORAGE_DEV>
Test log:
------------
+ mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: 05e8451c-1dd6-4d94-b030-0f806653e4b4
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Allocating group tables: 0/7453 done
Writing inode tables: 0/7453 done
Creating journal (262144 blocks): [ 34.739137] mkfs.ext4 invoked
oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
oom_score_adj=0
[ 34.748889] CPU: 0 PID: 393 Comm: mkfs.ext4 Not tainted
5.7.0-rc3-next-20200430 #1
[ 34.756450] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[ 34.763844] Call Trace:
[ 34.766305] dump_stack+0x54/0x6e
[ 34.769629] dump_header+0x3d/0x1c6
[ 34.773126] ? oom_badness.part.0+0x10/0x120
[ 34.777397] ? ___ratelimit+0x8f/0xdc
[ 34.781056] oom_kill_process.cold+0x9/0xe
[ 34.785152] out_of_memory+0x1ab/0x260
[ 34.788898] __alloc_pages_nodemask+0xe0e/0xec0
[ 34.793430] pagecache_get_page+0xae/0x260
[ 34.797521] grab_cache_page_write_begin+0x1c/0x30
[ 34.802303] block_write_begin+0x1e/0x90
[ 34.806222] blkdev_write_begin+0x1e/0x20
[ 34.810225] ? bdev_evict_inode+0xd0/0xd0
[ 34.814230] generic_perform_write+0x97/0x180
[ 34.818579] __generic_file_write_iter+0x140/0x1f0
[ 34.823365] blkdev_write_iter+0xc0/0x190
[ 34.827376] __vfs_write+0x132/0x1e0
[ 34.830947] ? __audit_syscall_entry+0xa8/0xe0
[ 34.835385] vfs_write+0xa1/0x1a0
[ 34.838696] ksys_pwrite64+0x50/0x80
[ 34.842267] __ia32_sys_ia32_pwrite64+0x16/0x20
[ 34.846798] do_fast_syscall_32+0x6b/0x270
[ 34.850890] entry_SYSENTER_32+0xa5/0xf8
[ 34.854805] EIP: 0xb7f0d549
[ 34.857596] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01
10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f
34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90
8d 76
[ 34.876334] EAX: ffffffda EBX: 00000003 ECX: b7801010 EDX: 00400000
[ 34.882591] ESI: 38400000 EDI: 00000074 EBP: 07438400 ESP: bfd266f0
[ 34.888847] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
[ 34.895630] Mem-Info:
[ 34.897923] active_anon:5366 inactive_anon:2172 isolated_anon:0
[ 34.897923] active_file:4151 inactive_file:212494 isolated_file:0
[ 34.897923] unevictable:0 dirty:16505 writeback:6520 unstable:0
[ 34.897923] slab_reclaimable:5855 slab_unreclaimable:3531
[ 34.897923] mapped:6321 shmem:2236 pagetables:178 bounce:0
[ 34.897923] free:264202 free_pcp:1082 free_cma:0
[ 34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB
active_file:16604kB inactive_file:849976kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB
writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB
all_unreclaimable? yes
[ 34.955523] DMA free:3356kB min:68kB low:84kB high:100kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:11964kB unevictable:0kB
writepending:11980kB present:15964kB managed:15876kB mlocked:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB
[ 34.983385] lowmem_reserve[]: 0 825 1947 825
[ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:1096kB inactive_file:786400kB unevictable:0kB
writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
local_pcp:500kB free_cma:0kB
[ 35.017427] lowmem_reserve[]: 0 0 8980 0
[ 35.021362] HighMem free:1049496kB min:512kB low:1748kB high:2984kB
reserved_highatomic:0KB active_anon:21464kB inactive_anon:8688kB
active_file:15508kB inactive_file:51612kB unevictable:0kB
writepending:0kB present:1149540kB managed:1149540kB mlocked:0kB
kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:1524kB
local_pcp:292kB free_cma:0kB
[ 35.051717] lowmem_reserve[]: 0 0 0 0
[ 35.055374] DMA: 8*4kB (UE) 1*8kB (E) 1*16kB (E) 0*32kB 0*64kB
0*128kB 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB =
3384kB
[ 35.067446] Normal: 27*4kB (U) 23*8kB (U) 12*16kB (UE) 12*32kB (U)
4*64kB (UE) 2*128kB (U) 2*256kB (UE) 1*512kB (E) 0*1024kB 1*2048kB (U)
0*4096kB = 4452kB
[ 35.081347] HighMem: 2*4kB (UM) 0*8kB 1*16kB (M) 2*32kB (UM) 1*64kB
(U) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 0*2048kB 256*4096kB (M) =
1049496kB
[ 35.094634] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=4096kB
[ 35.103059] 218892 total pagecache pages
[ 35.106985] 0 pages in swap cache
[ 35.110303] Swap cache stats: add 0, delete 0, find 0/0
[ 35.115519] Free swap = 0kB
[ 35.118396] Total swap = 0kB
[ 35.121274] 512558 pages RAM
[ 35.124151] 287385 pages HighMem/MovableOnly
[ 35.128418] 9810 pages reserved
[ 35.131563] Tasks state (memory values in pages):
[ 35.136260] [ pid ] uid tgid total_vm rss pgtables_bytes
swapents oom_score_adj name
[ 35.144866] [ 224] 0 224 3425 1273 28672
0 0 systemd-journal
[ 35.153932] [ 241] 0 241 3260 828 20480
0 -1000 systemd-udevd
[ 35.162797] [ 244] 994 244 3929 456 24576
0 0 systemd-timesyn
[ 35.171837] [ 277] 993 277 1569 786 20480
0 0 systemd-network
[ 35.180891] [ 279] 992 279 1729 825 20480
0 0 systemd-resolve
[ 35.189948] [ 283] 0 283 2032 1087 24576
0 0 haveged
[ 35.198312] [ 284] 0 284 810 457 16384
0 0 crond
[ 35.206485] [ 285] 996 285 1175 812 20480
0 -900 dbus-daemon
[ 35.215177] [ 286] 0 286 11786 2558 49152
0 0 NetworkManager
[ 35.224121] [ 287] 0 287 922 174 12288
0 0 klogd
[ 35.232293] [ 288] 0 288 1468 1001 20480
0 0 systemd-logind
[ 35.241247] [ 289] 995 289 1213 791 20480
0 0 avahi-daemon
[ 35.250026] [ 290] 0 290 677 435 16384
0 0 atd
[ 35.258040] [ 302] 0 302 921 420 16384
0 0 syslogd
[ 35.266380] [ 303] 0 303 5638 1558 32768
0 0 thermald
[ 35.274828] [ 305] 995 305 1182 58 20480
0 0 avahi-daemon
[ 35.283659] [ 306] 0 306 594 16 16384
0 0 acpid
[ 35.291848] [ 320] 0 320 1347 334 20480
0 0 systemd-hostnam
[ 35.300906] [ 336] 65534 336 729 32 16384
0 0 dnsmasq
[ 35.309253] [ 337] 0 337 666 443 16384
0 0 agetty
[ 35.317528] [ 338] 0 338 947 710 16384
0 0 login
[ 35.325693] [ 339] 0 339 666 458 16384
0 0 agetty
[ 35.333994] [ 350] 998 350 19521 2816 73728
0 0 polkitd
[ 35.342330] [ 358] 0 358 1892 1149 20480
0 0 systemd
[ 35.350668] [ 359] 0 359 2341 329 20480
0 0 (sd-pam)
[ 35.359093] [ 363] 0 363 971 711 16384
0 0 sh
[ 35.367023] [ 367] 0 367 920 627 20480
0 0 su
[ 35.374937] [ 368] 0 368 971 668 16384
0 0 sh
[ 35.382864] [ 373] 0 373 903 613 16384
0 0 lava-test-runne
[ 35.391897] [ 383] 0 383 903 518 16384
0 0 lava-test-shell
[ 35.400935] [ 384] 0 384 903 612 16384
0 0 sh
[ 35.408847] [ 393] 0 393 1976 1713 20480
0 0 mkfs.ext4
[ 35.417384] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=polkitd,pid=350,uid=998
[ 35.429982] Out of memory: Killed process 350 (polkitd)
total-vm:78084kB, anon-rss:2976kB, file-rss:8288kB, shmem-rss:0kB,
UID:998 pgtables:72kB oom_score_adj:0
[ 35.444646] oom_reaper: reaped process 350 (polkitd), now
anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 35.444648] mkfs.ext4 invoked oom-killer:
gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0
[ 35.463429] CPU: 0 PID: 393 Comm: mkfs.ext4 Not tainted
5.7.0-rc3-next-20200430 #1
[ 35.470991] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[ 35.478377] Call Trace:
[ 35.480822] dump_stack+0x54/0x6e
[ 35.484139] dump_header+0x3d/0x1c6
[ 35.487634] ? oom_badness.part.0+0x10/0x120
[ 35.491922] ? ___ratelimit+0x8f/0xdc
[ 35.495578] oom_kill_process.cold+0x9/0xe
[ 35.499669] out_of_memory+0x1ab/0x260
[ 35.503414] __alloc_pages_nodemask+0xe0e/0xec0
[ 35.507939] pagecache_get_page+0xae/0x260
Git log from recent changes on fs and mm.
# fs$ git log --oneline ext4 | head
5868dada23f7 ext4: pass the inode to ext4_mpage_readpages
0c855f1fc999 ext4: convert from readpages to readahead
ebc0198b60e9 mm: add page_cache_readahead_unbounded
907ea529fc4c ext4: convert BUG_ON's to WARN_ON's in mballoc.c
a17a9d935dc4 ext4: increase wait time needed before reuse of deleted
inode numbers
648814111af2 ext4: remove set but not used variable 'es' in ext4_jbd2.c
05ca87c149ae ext4: remove set but not used variable 'es'
801674f34ecf ext4: do not zeroout extents beyond i_disksize
9033783c8cfd ext4: fix return-value types in several function comments
d87f639258a6 ext4: use non-movable memory for superblock readahead
# fs/f2fs$ git log --oneline . | head
a4928e314c45 Merge branch 'akpm-current/current'
f1c6758147a8 f2fs: pass the inode to f2fs_mpage_readpages
272e45338126 f2fs: convert from readpages to readahead
ebc0198b60e9 mm: add page_cache_readahead_unbounded
435cbab95e39 f2fs: fix quota_sync failure due to f2fs_lock_op
8b83ac81f428 f2fs: support read iostat
df4233997575 f2fs: Fix the accounting of dcc->undiscard_blks
ce4c638cdd52 f2fs: fix to handle error path of f2fs_ra_meta_pages()
3fa6a8c5b55d f2fs: report the discard cmd errors properly
141af6ba5216 f2fs: fix long latency due to discard during umount
# fs$ git log --oneline ext4 | head
5868dada23f7 ext4: pass the inode to ext4_mpage_readpages
0c855f1fc999 ext4: convert from readpages to readahead
ebc0198b60e9 mm: add page_cache_readahead_unbounded
907ea529fc4c ext4: convert BUG_ON's to WARN_ON's in mballoc.c
a17a9d935dc4 ext4: increase wait time needed before reuse of deleted
inode numbers
648814111af2 ext4: remove set but not used variable 'es' in ext4_jbd2.c
05ca87c149ae ext4: remove set but not used variable 'es'
801674f34ecf ext4: do not zeroout extents beyond i_disksize
9033783c8cfd ext4: fix return-value types in several function comments
d87f639258a6 ext4: use non-movable memory for superblock readahead
Test full log link,
https://lkft.validation.linaro.org/scheduler/job/1406110#L1223
https://lkft.validation.linaro.org/scheduler/job/1408508#L1250
--
Linaro LKFT
https://lkft.linaro.org
On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju <[email protected]> wrote:
> mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> and started happening on linux -next master branch kernel tag next-20200430
> and next-20200501. We did not bisect this problem.
It would be wonderful if you could do so, please. I can't immediately see
any MM change in this area which might cause this.
> metadata
> git branch: master
> git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> git commit: e4a08b64261ab411b15580c369a3b8fbed28bbc1
> git describe: next-20200430
> make_kernelversion: 5.7.0-rc3
> kernel-config:
> https://builds.tuxbuild.com/1YrE_XUQ6odA52tSBM919w/kernel.config
>
> Steps to reproduce: (always reproducible)
Reproducibility helps!
> oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> [ 34.793430] pagecache_get_page+0xae/0x260
> [ 34.897923] active_anon:5366 inactive_anon:2172 isolated_anon:0
> [ 34.897923] active_file:4151 inactive_file:212494 isolated_file:0
> [ 34.897923] unevictable:0 dirty:16505 writeback:6520 unstable:0
> [ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:1096kB inactive_file:786400kB unevictable:0kB
> writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
> kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
> local_pcp:500kB free_cma:0kB
ZONE_NORMAL has a huge amount of clean pagecache stuck on the
inactive list, not being reclaimed.
Thanks for looking into this problem.
On Sat, 2 May 2020 at 02:28, Andrew Morton <[email protected]> wrote:
>
> On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju <[email protected]> wrote:
>
> > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > and started happening on linux -next master branch kernel tag next-20200430
> > and next-20200501. We did not bisect this problem.
>
> It would be wonderful if you could do so, please. I can't immediately see
> any MM change in this area which might cause this.
We are planning a bisection soon on this problem.
>
> > metadata
> > git branch: master
> > git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> > git commit: e4a08b64261ab411b15580c369a3b8fbed28bbc1
> > git describe: next-20200430
> > make_kernelversion: 5.7.0-rc3
> > kernel-config:
> > https://builds.tuxbuild.com/1YrE_XUQ6odA52tSBM919w/kernel.config
> >
> > Steps to reproduce: (always reproducible)
>
> Reproducibility helps!
>
> > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
>
> > [ 34.793430] pagecache_get_page+0xae/0x260
>
> > [ 34.897923] active_anon:5366 inactive_anon:2172 isolated_anon:0
> > [ 34.897923] active_file:4151 inactive_file:212494 isolated_file:0
> > [ 34.897923] unevictable:0 dirty:16505 writeback:6520 unstable:0
>
> > [ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
> > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > active_file:1096kB inactive_file:786400kB unevictable:0kB
> > writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
> > kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
> > local_pcp:500kB free_cma:0kB
>
> ZONE_NORMAL has a huge amount of clean pagecache stuck on the
> inactive list, not being reclaimed.
FYI,
This issue is already reported here.
Now this problem is happening and easily reproducible on i386
and arm beagleboard x15 devices.
mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190703A01414
mke2fs 1.43.8 (1-Jan-2018)
Discarding device blocks: 4096/29306880
2625536/29306880
9441280/29306880 16257024/29306880
23072768/29306880
done
Creating filesystem with 29306880 4k blocks and 7331840 inodes
Filesystem UUID: a838d994-0a1e-403a-88d5-444d75aecc5a
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Allocating group tables: 0/895 done
Writing inode tables: 0/895 done
Creating journal (131072 blocks): [ 31.251333] mkfs.ext4 invoked
oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
oom_score_adj=0
[ 31.261172] CPU: 0 PID: 397 Comm: mkfs.ext4 Not tainted
5.7.0-rc6-next-20200518 #1
[ 31.268771] Hardware name: Generic DRA74X (Flattened Device Tree)
[ 31.274904] [<c0411500>] (unwind_backtrace) from [<c040b66c>]
(show_stack+0x10/0x14)
[ 31.282685] [<c040b66c>] (show_stack) from [<c08b1b14>]
(dump_stack+0xc4/0xd8)
[ 31.289940] [<c08b1b14>] (dump_stack) from [<c0547bf8>]
(dump_header+0x54/0x1ec)
[ 31.297367] [<c0547bf8>] (dump_header) from [<c0547008>]
(oom_kill_process+0x18c/0x198)
[ 31.305405] [<c0547008>] (oom_kill_process) from [<c0547a0c>]
(out_of_memory+0x250/0x368)
[ 31.313619] [<c0547a0c>] (out_of_memory) from [<c0599d80>]
(__alloc_pages_nodemask+0xce8/0x10bc)
[ 31.322445] [<c0599d80>] (__alloc_pages_nodemask) from [<c0541bb4>]
(pagecache_get_page+0x128/0x358)
[ 31.331619] [<c0541bb4>] (pagecache_get_page) from [<c0543a8c>]
(grab_cache_page_write_begin+0x18/0x2c)
[ 31.341054] [<c0543a8c>] (grab_cache_page_write_begin) from
[<c0619fb0>] (block_write_begin+0x20/0xc4)
[ 31.350401] [<c0619fb0>] (block_write_begin) from [<c053e718>]
(generic_perform_write+0xb8/0x1d8)
[ 31.359312] [<c053e718>] (generic_perform_write) from [<c054496c>]
(__generic_file_write_iter+0x164/0x1ec)
[ 31.369007] [<c054496c>] (__generic_file_write_iter) from
[<c061c8a4>] (blkdev_write_iter+0xc8/0x1a4)
[ 31.378269] [<c061c8a4>] (blkdev_write_iter) from [<c05d50d0>]
(__vfs_write+0x13c/0x1cc)
[ 31.386397] [<c05d50d0>] (__vfs_write) from [<c05d81d4>]
(vfs_write+0xb0/0x1bc)
[ 31.393738] [<c05d81d4>] (vfs_write) from [<c05d85e4>]
(ksys_pwrite64+0x60/0x8c)
[ 31.401167] [<c05d85e4>] (ksys_pwrite64) from [<c04001a0>]
(ret_fast_syscall+0x0/0x4c)
[ 31.409115] Exception stack(0xe810dfa8 to 0xe810dff0)
[ 31.414185] dfa0: a2000000 0000000d 00000003
b6952008 00400000 00000000
[ 31.422395] dfc0: a2000000 0000000d a2000000 000000b5 00400000
0003b768 b6952008 00da2000
[ 31.430604] dfe0: 00000064 beb891b8 b6f85108 b6e38f2c
[ 31.435809] Mem-Info:
[ 31.438098] active_anon:5813 inactive_anon:4129 isolated_anon:0
[ 31.438098] active_file:6080 inactive_file:118548 isolated_file:0
[ 31.438098] unevictable:0 dirty:13674 writeback:7440 unstable:0
[ 31.438098] slab_reclaimable:5651 slab_unreclaimable:4566
[ 31.438098] mapped:5585 shmem:4468 pagetables:182 bounce:0
[ 31.438098] free:347556 free_pcp:608 free_cma:57235
[ 31.472362] Node 0 active_anon:23252kB inactive_anon:16516kB
active_file:24320kB inactive_file:474192kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:22340kB dirty:54696kB
writeback:11196kB shmem:17872kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[ 31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:4736kB inactive_file:431688kB unevictable:0kB
writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
local_pcp:216kB free_cma:163840kB
[ 31.531339] lowmem_reserve[]: 0 0 1216 0
[ 31.535289] HighMem free:1203904kB min:512kB low:11592kB
high:22672kB reserved_highatomic:0KB active_anon:23252kB
inactive_anon:16516kB active_file:19584kB inactive_file:42420kB
unevictable:0kB writepending:0kB present:1310720kB managed:1310720kB
mlocked:0kB kernel_stack:0kB pagetables:728kB bounce:0kB
free_pcp:1584kB local_pcp:1232kB free_cma:65100kB
[ 31.566540] lowmem_reserve[]: 0 0 0 0
[ 31.570244] DMA: 87*4kB (UME) 53*8kB (UME) 26*16kB (UE) 6*32kB (UM)
1*64kB (E) 1*128kB (U) 5*256kB (ME) 5*512kB (ME) 4*1024kB (ME)
5*2048kB (M) 1*4096kB (M) 20*8192kB (C) = 187684kB
[ 31.586520] HighMem: 2*4kB (MC) 1*8kB (C) 1*16kB (M) 5*32kB (UM)
4*64kB (UMC) 2*128kB (UM) 2*256kB (UM) 1*512kB (C) 2*1024kB (MC)
2*2048kB (MC) 2*4096kB (UC) 145*8192kB (MC) = 1203904kB
[ 31.603150] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[ 31.611637] 129102 total pagecache pages
[ 31.615577] 0 pages in swap cache
[ 31.618902] Swap cache stats: add 0, delete 0, find 0/0
[ 31.624162] Free swap = 0kB
[ 31.627053] Total swap = 0kB
[ 31.629955] 523520 pages RAM
[ 31.632846] 327680 pages HighMem/MovableOnly
[ 31.637128] 28774 pages reserved
[ 31.640381] 57344 pages cma reserved
[ 31.643971] Tasks state (memory values in pages):
[ 31.648691] [ pid ] uid tgid total_vm rss pgtables_bytes
swapents oom_score_adj name
[ 31.657367] [ 183] 0 183 7370 1082 36864
0 0 systemd-journal
[ 31.666466] [ 209] 994 209 3742 326 40960
0 0 systemd-timesyn
[ 31.675570] [ 217] 0 217 3398 817 32768
0 -1000 systemd-udevd
[ 31.684498] [ 230] 993 230 1411 737 32768
0 0 systemd-network
[ 31.693598] [ 231] 992 231 1496 712 32768
0 0 systemd-resolve
[ 31.702702] [ 236] 996 236 1112 742 24576
0 -900 dbus-daemon
[ 31.711454] [ 241] 0 241 1895 1045 36864
0 0 haveged
[ 31.719857] [ 242] 0 242 1362 906 28672
0 0 systemd-logind
[ 31.728855] [ 243] 0 243 13412 2571 69632
0 0 NetworkManager
[ 31.737867] [ 244] 995 244 1197 608 28672
0 0 avahi-daemon
[ 31.746707] [ 245] 995 245 1164 59 28672
0 0 avahi-daemon
[ 31.755545] [ 246] 0 246 594 332 28672
0 0 atd
[ 31.763601] [ 248] 0 248 699 99 24576
0 0 syslogd
[ 31.772001] [ 251] 0 251 699 102 24576
0 0 klogd
[ 31.780231] [ 252] 0 252 676 365 24576
0 0 crond
[ 31.788443] [ 254] 0 254 1172 240 32768
0 0 systemd-hostnam
[ 31.797547] [ 264] 65534 264 605 32 24576
0 0 dnsmasq
[ 31.805948] [ 265] 0 265 556 357 28672
0 0 agetty
[ 31.814262] [ 266] 0 266 1131 613 32768
0 0 login
[ 31.822492] [ 268] 998 268 18201 2629 81920
0 0 polkitd
[ 31.830895] [ 350] 0 350 1840 1161 32768
0 0 systemd
[ 31.839286] [ 351] 0 351 2403 473 36864
0 0 (sd-pam)
[ 31.847774] [ 355] 0 355 827 611 24576
0 0 sh
[ 31.855742] [ 364] 0 364 7341 1145 53248
0 0 nm-dispatcher
[ 31.864667] [ 377] 0 377 711 510 28672
0 0 lava-test-runne
[ 31.873770] [ 387] 0 387 711 138 20480
0 0 lava-test-shell
[ 31.882869] [ 388] 0 388 711 523 20480
0 0 sh
[ 31.890837] [ 397] 0 397 1785 1518 36864
0 0 mkfs.ext4
[ 31.899397] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=polkitd,pid=268,uid=998
[ 31.910012] Out of memory: Killed process 268 (polkitd)
total-vm:72804kB, anon-rss:2948kB, file-rss:7568kB, shmem-rss:0kB,
UID:998 pgtables:80kB oom_score_adj:0
[ 31.927948] oom_reaper: reaped process 268 (polkitd), now
anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 31.937461] mkfs.ext4 invoked oom-killer:
gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0
[ 31.947273] CPU: 1 PID: 397 Comm: mkfs.ext4 Not tainted
5.7.0-rc6-next-20200518 #1
[ 31.954871] Hardware name: Generic DRA74X (Flattened Device Tree)
[ 31.961000] [<c0411500>] (unwind_backtrace) from [<c040b66c>]
(show_stack+0x10/0x14)
[ 31.968778] [<c040b66c>] (show_stack) from [<c08b1b14>]
(dump_stack+0xc4/0xd8)
[ 31.976032] [<c08b1b14>] (dump_stack) from [<c0547bf8>]
(dump_header+0x54/0x1ec)
[ 31.983458] [<c0547bf8>] (dump_header) from [<c0547008>]
(oom_kill_process+0x18c/0x198)
[ 31.991495] [<c0547008>] (oom_kill_process) from [<c0547a0c>]
(out_of_memory+0x250/0x368)
[ 31.999706] [<c0547a0c>] (out_of_memory) from [<c0599d80>]
(__alloc_pages_nodemask+0xce8/0x10bc)
[ 32.008532] [<c0599d80>] (__alloc_pages_nodemask) from [<c0541bb4>]
(pagecache_get_page+0x128/0x358)
[ 32.017704] [<c0541bb4>] (pagecache_get_page) from [<c0543a8c>]
(grab_cache_page_write_begin+0x18/0x2c)
[ 32.027138] [<c0543a8c>] (grab_cache_page_write_begin) from
[<c0619fb0>] (block_write_begin+0x20/0xc4)
[ 32.036484] [<c0619fb0>] (block_write_begin) from [<c053e718>]
(generic_perform_write+0xb8/0x1d8)
[ 32.045395] [<c053e718>] (generic_perform_write) from [<c054496c>]
(__generic_file_write_iter+0x164/0x1ec)
[ 32.055090] [<c054496c>] (__generic_file_write_iter) from
[<c061c8a4>] (blkdev_write_iter+0xc8/0x1a4)
[ 32.064350] [<c061c8a4>] (blkdev_write_iter) from [<c05d50d0>]
(__vfs_write+0x13c/0x1cc)
[ 32.072476] [<c05d50d0>] (__vfs_write) from [<c05d81d4>]
(vfs_write+0xb0/0x1bc)
[ 32.079814] [<c05d81d4>] (vfs_write) from [<c05d85e4>]
(ksys_pwrite64+0x60/0x8c)
[ 32.087241] [<c05d85e4>] (ksys_pwrite64) from [<c04001a0>]
(ret_fast_syscall+0x0/0x4c)
[ 32.095187] Exception stack(0xe810dfa8 to 0xe810dff0)
[ 32.100256] dfa0: a2000000 0000000d 00000003
b6952008 00400000 00000000
[ 32.108466] dfc0: a2000000 0000000d a2000000 000000b5 00400000
0003b768 b6952008 00da2000
[ 32.116673] dfe0: 00000064 beb891b8 b6f85108 b6e38f2c
[ 32.121786] Mem-Info:
[ 32.124070] active_anon:5056 inactive_anon:4129 isolated_anon:0
[ 32.124070] active_file:6289 inactive_file:118790 isolated_file:0
[ 32.124070] unevictable:0 dirty:14118 writeback:6 unstable:0
[ 32.124070] slab_reclaimable:5653 slab_unreclaimable:4209
[ 32.124070] mapped:4839 shmem:4468 pagetables:165 bounce:0
[ 32.124070] free:348249 free_pcp:562 free_cma:57235
[ 32.158031] Node 0 active_anon:20224kB inactive_anon:16516kB
active_file:25156kB inactive_file:475160kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:19356kB dirty:56472kB
writeback:24kB shmem:17872kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[ 32.186324] DMA free:186320kB min:22528kB low:28160kB high:33792kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:4736kB inactive_file:433580kB unevictable:0kB
writepending:56468kB present:783360kB managed:668264kB mlocked:0kB
kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:420kB
local_pcp:220kB free_cma:163840kB
[ 32.216693] lowmem_reserve[]: 0 0 1216 0
[ 32.220652] HighMem free:1206676kB min:512kB low:11592kB
high:22672kB reserved_highatomic:0KB active_anon:20224kB
inactive_anon:16516kB active_file:20420kB inactive_file:41584kB
unevictable:0kB writepending:0kB present:1310720kB managed:1310720kB
mlocked:0kB kernel_stack:0kB pagetables:660kB bounce:0kB
free_pcp:1816kB local_pcp:340kB free_cma:65100kB
[ 32.251805] lowmem_reserve[]: 0 0 0 0
[ 32.255482] DMA: 2*4kB (UM) 3*8kB (UME) 1*16kB (U) 1*32kB (M)
0*64kB 1*128kB (U) 5*256kB (ME) 5*512kB (ME) 4*1024kB (ME) 5*2048kB
(M) 1*4096kB (M) 20*8192kB (C) = 186320kB
[ 32.270871] HighMem: 183*4kB (UMC) 65*8kB (UMC) 21*16kB (M) 11*32kB
(UM) 6*64kB (UMC) 3*128kB (UM) 3*256kB (UM) 2*512kB (MC) 2*1024kB (MC)
2*2048kB (MC) 2*4096kB (UC) 145*8192kB (MC) = 1206676kB
[ 32.288273] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[ 32.296751] 129546 total pagecache pages
[ 32.300695] 0 pages in swap cache
[ 32.304019] Swap cache stats: add 0, delete 0, find 0/0
[ 32.309260] Free swap = 0kB
[ 32.312155] Total swap = 0kB
[ 32.315045] 523520 pages RAM
[ 32.317932] 327680 pages HighMem/MovableOnly
[ 32.322221] 28774 pages reserved
[ 32.325457] 57344 pages cma reserved
[ 32.329043] Tasks state (memory values in pages):
[ 32.333771] [ pid ] uid tgid total_vm rss pgtables_bytes
swapents oom_score_adj name
[ 32.342436] [ 183] 0 183 7370 1082 36864
0 0 systemd-journal
[ 32.351529] [ 209] 994 209 3742 326 40960
0 0 systemd-timesyn
[ 32.360620] [ 217] 0 217 3398 817 32768
0 -1000 systemd-udevd
[ 32.369528] [ 230] 993 230 1411 737 32768
0 0 systemd-network
[ 32.378620] [ 231] 992 231 1496 712 32768
0 0 systemd-resolve
[ 32.387713] [ 236] 996 236 1112 742 24576
0 -900 dbus-daemon
[ 32.396456] [ 241] 0 241 1895 1045 36864
0 0 haveged
[ 32.404850] [ 242] 0 242 1362 906 28672
0 0 systemd-logind
[ 32.413852] [ 243] 0 243 13412 2571 69632
0 0 NetworkManager
[ 32.422858] [ 244] 995 244 1197 608 28672
0 0 avahi-daemon
[ 32.431687] [ 245] 995 245 1164 59 28672
0 0 avahi-daemon
[ 32.440518] [ 246] 0 246 594 332 28672
0 0 atd
[ 32.448553] [ 248] 0 248 699 99 24576
0 0 syslogd
[ 32.456945] [ 251] 0 251 699 102 24576
0 0 klogd
[ 32.465171] [ 252] 0 252 676 365 24576
0 0 crond
[ 32.473390] [ 254] 0 254 1172 240 32768
0 0 systemd-hostnam
[ 32.482481] [ 264] 65534 264 605 32 24576
0 0 dnsmasq
[ 32.490876] [ 265] 0 265 556 357 28672
0 0 agetty
[ 32.499175] [ 266] 0 266 1131 613 32768
0 0 login
[ 32.507394] [ 350] 0 350 1840 1161 32768
0 0 systemd
[ 32.515788] [ 351] 0 351 2403 473 36864
0 0 (sd-pam)
[ 32.524268] [ 355] 0 355 827 611 24576
0 0 sh
[ 32.532227] [ 364] 0 364 7341 1145 53248
0 0 nm-dispatcher
[ 32.541142] [ 377] 0 377 711 510 28672
0 0 lava-test-runne
[ 32.550234] [ 387] 0 387 711 138 20480
0 0 lava-test-shell
[ 32.559316] [ 388] 0 388 711 523 20480
0 0 sh
[ 32.567273] [ 397] 0 397 1785 1518 36864
0 0 mkfs.ext4
ref:
https://lkft.validation.linaro.org/scheduler/job/1436647#L4261
https://lkft.validation.linaro.org/scheduler/job/1436562#L1247
--
Linaro LKFT
https://lkft.linaro.org
On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> Thanks for looking into this problem.
>
> On Sat, 2 May 2020 at 02:28, Andrew Morton <[email protected]> wrote:
> >
> > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju <[email protected]> wrote:
> >
> > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > > and started happening on linux -next master branch kernel tag next-20200430
> > > and next-20200501. We did not bisect this problem.
[...]
> Creating journal (131072 blocks): [ 31.251333] mkfs.ext4 invoked
> oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> oom_score_adj=0
[...]
> [ 31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:4736kB inactive_file:431688kB unevictable:0kB
> writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> local_pcp:216kB free_cma:163840kB
This is really unexpected. You are saying this is a regular i386 and DMA
should be bottom 16MB while yours is 780MB and the rest of the low mem
is in the Normal zone which is completely missing here. How have you got
to that configuration? I have to say I haven't seen anything like that
on i386.
The failing request is GFP_USER so highmem is not really allowed but
free pages are way above watermarks so the allocation should have just
succeeded.
--
Michal Hocko
SUSE Labs
On Tue, May 19, 2020 at 9:52 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> > Thanks for looking into this problem.
> >
> > On Sat, 2 May 2020 at 02:28, Andrew Morton <[email protected]> wrote:
> > >
> > > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju <[email protected]> wrote:
> > >
> > > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > > > and started happening on linux -next master branch kernel tag next-20200430
> > > > and next-20200501. We did not bisect this problem.
> [...]
> > Creating journal (131072 blocks): [ 31.251333] mkfs.ext4 invoked
> > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> > oom_score_adj=0
> [...]
> > [ 31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > active_file:4736kB inactive_file:431688kB unevictable:0kB
> > writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> > kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> > local_pcp:216kB free_cma:163840kB
>
> This is really unexpected. You are saying this is a regular i386 and DMA
> should be bottom 16MB while yours is 780MB and the rest of the low mem
> is in the Normal zone which is completely missing here. How have you got
> to that configuration? I have to say I haven't seen anything like that
> on i386.
I think that line comes from an ARM32 beaglebone-X15 machine showing
the same symptom. The i386 line from the log file that Naresh linked to at
https://lkft.validation.linaro.org/scheduler/job/1406110#L1223 is less
unusual:
[ 34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB
active_file:16604kB inactive_file:849976kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB
writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB
all_unreclaimable? yes
[ 34.955523] DMA free:3356kB min:68kB low:84kB high:100kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:11964kB unevictable:0kB
writepending:11980kB present:15964kB managed:15876kB mlocked:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB
[ 34.983385] lowmem_reserve[]: 0 825 1947 825
[ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:1096kB inactive_file:786400kB unevictable:0kB
writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
local_pcp:500kB free_cma:0kB
[ 35.017427] lowmem_reserve[]: 0 0 8980 0
[ 35.021362] HighMem free:1049496kB min:512kB low:1748kB high:2984kB
reserved_highatomic:0KB active_anon:21464kB inactive_anon:8688kB
active_file:15508kB inactive_file:51612kB unevictable:0kB
writepending:0kB present:1149540kB managed:1149540kB mlocked:0kB
kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:1524kB
local_pcp:292kB free_cma:0kB
[ 35.051717] lowmem_reserve[]: 0 0 0 0
[ 35.055374] DMA: 8*4kB (UE) 1*8kB (E) 1*16kB (E) 0*32kB 0*64kB
0*128kB 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB =
3384kB
[ 35.067446] Normal: 27*4kB (U) 23*8kB (U) 12*16kB (UE) 12*32kB (U)
4*64kB (UE) 2*128kB (U) 2*256kB (UE) 1*512kB (E) 0*1024kB 1*2048kB (U)
0*4096kB = 4452kB
[ 35.081347] HighMem: 2*4kB (UM) 0*8kB 1*16kB (M) 2*32kB (UM) 1*64kB
(U) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 0*2048kB 256*4096kB (M) =
1049496kB
Arnd
On Tue 19-05-20 10:11:25, Arnd Bergmann wrote:
> On Tue, May 19, 2020 at 9:52 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> > > Thanks for looking into this problem.
> > >
> > > On Sat, 2 May 2020 at 02:28, Andrew Morton <[email protected]> wrote:
> > > >
> > > > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju <[email protected]> wrote:
> > > >
> > > > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > > > > and started happening on linux -next master branch kernel tag next-20200430
> > > > > and next-20200501. We did not bisect this problem.
> > [...]
> > > Creating journal (131072 blocks): [ 31.251333] mkfs.ext4 invoked
> > > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> > > oom_score_adj=0
> > [...]
> > > [ 31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> > > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > > active_file:4736kB inactive_file:431688kB unevictable:0kB
> > > writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> > > kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> > > local_pcp:216kB free_cma:163840kB
> >
> > This is really unexpected. You are saying this is a regular i386 and DMA
> > should be bottom 16MB while yours is 780MB and the rest of the low mem
> > is in the Normal zone which is completely missing here. How have you got
> > to that configuration? I have to say I haven't seen anything like that
> > on i386.
>
> I think that line comes from an ARM32 beaglebone-X15 machine showing
> the same symptom. The i386 line from the log file that Naresh linked to at
> https://lkft.validation.linaro.org/scheduler/job/1406110#L1223 is less
> unusual:
OK, that makes more sense! At least for the memory layout.
> [ 34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB
> active_file:16604kB inactive_file:849976kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB
> writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB
> all_unreclaimable? yes
> [ 34.955523] DMA free:3356kB min:68kB low:84kB high:100kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file:11964kB unevictable:0kB
> writepending:11980kB present:15964kB managed:15876kB mlocked:0kB
> kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
> free_cma:0kB
> [ 34.983385] lowmem_reserve[]: 0 825 1947 825
> [ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:1096kB inactive_file:786400kB unevictable:0kB
> writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
> kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
> local_pcp:500kB free_cma:0kB
The lowmem is really low (way below the min watermark so even memory
reserves for high priority and atomic requests are depleted. There is
still 786MB of inactive page cache to be reclaimed. It doesn't seem to
be dirty or under the writeback but it still might be pinned by the
filesystem. I would suggest watching vmscan reclaim tracepoints and
check why the reclaim fails to reclaim anything.
--
Michal Hocko
SUSE Labs
On Wed, 20 May 2020 at 17:26, Naresh Kamboju <[email protected]> wrote:
>
>
> This issue is specific on 32-bit architectures i386 and arm on linux-next tree.
> As per the test results history this problem started happening from
> Bad : next-20200430
> Good : next-20200429
>
> steps to reproduce:
> dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573
> of=/dev/null bs=1M count=2048
> or
> mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
>
>
> Problem:
> [ 38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> order=0, oom_score_adj=0
As a part of investigation on this issue LKFT teammate Anders Roxell
git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the
reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above
protection"
This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection
checks"
This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
i386 test log shows mkfs -t ext4 pass
https://lkft.validation.linaro.org/scheduler/job/1443405#L1200
ref:
https://lore.kernel.org/linux-mm/[email protected]/
https://lore.kernel.org/linux-mm/CA+G9fYvzLm7n1BE7AJXd8_49fOgPgWWTiQ7sXkVre_zoERjQKg@mail.gmail.com/T/#t
--
Linaro LKFT
https://lkft.linaro.org
Hi Naresh,
Naresh Kamboju writes:
>As a part of investigation on this issue LKFT teammate Anders Roxell
>git bisected the problem and found bad commit(s) which caused this problem.
>
>The following two patches have been reverted on next-20200519 and retested the
>reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
>( invoked oom-killer is gone now)
>
>Revert "mm, memcg: avoid stale protection values when cgroup is above
>protection"
> This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
>
>Revert "mm, memcg: decouple e{low,min} state mutations from protection
>checks"
> This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in
either of those commits from a (very) cursory glance, but they should only be
taking effect if protections are set.
Since you have i386 hardware available, and I don't, could you please apply
only "avoid stale protection" again and check if it only happens with that
commit, or requires both? That would help narrow down the suspects.
Do you use any memcg protections in these tests?
Thank you!
Chris
On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju
<[email protected]> wrote:
>
> On Wed, 20 May 2020 at 17:26, Naresh Kamboju <[email protected]> wrote:
> >
> >
> > This issue is specific on 32-bit architectures i386 and arm on linux-next tree.
> > As per the test results history this problem started happening from
> > Bad : next-20200430
> > Good : next-20200429
> >
> > steps to reproduce:
> > dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573
> > of=/dev/null bs=1M count=2048
> > or
> > mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
> >
> >
> > Problem:
> > [ 38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> > order=0, oom_score_adj=0
>
> As a part of investigation on this issue LKFT teammate Anders Roxell
> git bisected the problem and found bad commit(s) which caused this problem.
>
> The following two patches have been reverted on next-20200519 and retested the
> reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> ( invoked oom-killer is gone now)
>
> Revert "mm, memcg: avoid stale protection values when cgroup is above
> protection"
> This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
>
> Revert "mm, memcg: decouple e{low,min} state mutations from protection
> checks"
> This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
>
My guess is that we made the same mistake in commit "mm, memcg:
decouple e{low,min} state mutations from protection
checks" that it read a stale memcg protection in
mem_cgroup_below_low() and mem_cgroup_below_min().
Bellow is a possble fix,
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7a2c56fc..6591b71 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -391,20 +391,28 @@ static inline unsigned long
mem_cgroup_protection(struct mem_cgroup *root,
void mem_cgroup_calculate_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg);
-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
{
if (mem_cgroup_disabled())
return false;
+ if (root == memcg)
+ return false;
+
return READ_ONCE(memcg->memory.elow) >=
page_counter_read(&memcg->memory);
}
-static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_min(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
{
if (mem_cgroup_disabled())
return false;
+ if (root == memcg)
+ return false;
+
return READ_ONCE(memcg->memory.emin) >=
page_counter_read(&memcg->memory);
}
@@ -896,12 +904,14 @@ static inline void
mem_cgroup_calculate_protection(struct mem_cgroup *root,
{
}
-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
{
return false;
}
-static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_min(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
{
return false;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c71660e..fdcdd88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2637,13 +2637,13 @@ static void shrink_node_memcgs(pg_data_t
*pgdat, struct scan_control *sc)
mem_cgroup_calculate_protection(target_memcg, memcg);
- if (mem_cgroup_below_min(memcg)) {
+ if (mem_cgroup_below_min(target_memcg, memcg)) {
/*
* Hard protection.
* If there is no reclaimable memory, OOM.
*/
continue;
- } else if (mem_cgroup_below_low(memcg)) {
+ } else if (mem_cgroup_below_low(target_memcg, memcg)) {
/*
* Soft protection.
* Respect the protection only as long as
> i386 test log shows mkfs -t ext4 pass
> https://lkft.validation.linaro.org/scheduler/job/1443405#L1200
>
> ref:
> https://lore.kernel.org/linux-mm/[email protected]/
> https://lore.kernel.org/linux-mm/CA+G9fYvzLm7n1BE7AJXd8_49fOgPgWWTiQ7sXkVre_zoERjQKg@mail.gmail.com/T/#t
>
> --
> Linaro LKFT
> https://lkft.linaro.org
--
Thanks
Yafang
On Thu, 21 May 2020 at 08:10, Yafang Shao <[email protected]> wrote:
>
> On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju
> <[email protected]> wrote:
> >
> > On Wed, 20 May 2020 at 17:26, Naresh Kamboju <[email protected]> wrote:
> > >
> > >
> > > This issue is specific on 32-bit architectures i386 and arm on linux-next tree.
> > > As per the test results history this problem started happening from
> > > mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
> > >
> > >
> > > Problem:
> > > [ 38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> > > order=0, oom_score_adj=0
> >
> My guess is that we made the same mistake in commit "mm, memcg:
> decouple e{low,min} state mutations from protection
> checks" that it read a stale memcg protection in
> mem_cgroup_below_low() and mem_cgroup_below_min().
>
> Bellow is a possble fix,
Sorry. The proposed fix did not work.
I have took your patch and applied on top of linux-next master branch and
tested and mkfs -t ext4 invoked oom-killer.
After patch applied test log link,
https://lkft.validation.linaro.org/scheduler/job/1443936#L1168
test log,
+ mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: ab107250-bf18-4357-a06a-67f2bfcc1048
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Allocating group tables: 0/7453 done
Writing inode tables: 0/7453 done
Creating journal (262144 blocks): [ 34.423940] mkfs.ext4 invoked
oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
oom_score_adj=0
[ 34.433694] CPU: 0 PID: 402 Comm: mkfs.ext4 Not tainted
5.7.0-rc6-next-20200519+ #1
[ 34.441342] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[ 34.448734] Call Trace:
[ 34.451196] dump_stack+0x54/0x76
[ 34.454517] dump_header+0x40/0x1f0
[ 34.458008] ? oom_badness+0x1f/0x120
[ 34.461673] ? ___ratelimit+0x6c/0xe0
[ 34.465332] oom_kill_process+0xc9/0x110
[ 34.469255] out_of_memory+0xd7/0x2f0
[ 34.472916] __alloc_pages_nodemask+0xdd1/0xe90
[ 34.477446] ? set_bh_page+0x33/0x50
[ 34.481016] ? __xa_set_mark+0x4d/0x70
[ 34.484762] pagecache_get_page+0xbe/0x250
[ 34.488859] grab_cache_page_write_begin+0x1a/0x30
[ 34.493645] block_write_begin+0x25/0x90
[ 34.497569] blkdev_write_begin+0x1e/0x20
[ 34.501574] ? bdev_evict_inode+0xc0/0xc0
[ 34.505578] generic_perform_write+0x95/0x190
[ 34.509927] __generic_file_write_iter+0xe0/0x1a0
[ 34.514626] blkdev_write_iter+0xbf/0x1c0
[ 34.518630] __vfs_write+0x122/0x1e0
[ 34.522200] vfs_write+0x8f/0x1b0
[ 34.525510] ksys_pwrite64+0x60/0x80
[ 34.529081] __ia32_sys_ia32_pwrite64+0x16/0x20
[ 34.533604] do_fast_syscall_32+0x66/0x240
[ 34.537697] entry_SYSENTER_32+0xa5/0xf8
[ 34.541613] EIP: 0xb7f3c549
[ 34.544403] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01
10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f
34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90
8d 76
[ 34.563140] EAX: ffffffda EBX: 00000003 ECX: b7830010 EDX: 00400000
[ 34.569397] ESI: 38400000 EDI: 00000074 EBP: 07438400 ESP: bff1e650
[ 34.575654] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
[ 34.582453] Mem-Info:
[ 34.584732] active_anon:5713 inactive_anon:2169 isolated_anon:0
[ 34.584732] active_file:4040 inactive_file:211204 isolated_file:0
[ 34.584732] unevictable:0 dirty:17270 writeback:6240 unstable:0
[ 34.584732] slab_reclaimable:5856 slab_unreclaimable:3439
[ 34.584732] mapped:6192 shmem:2258 pagetables:178 bounce:0
[ 34.584732] free:265105 free_pcp:1330 free_cma:0
[ 34.618483] Node 0 active_anon:22852kB inactive_anon:8676kB
active_file:16160kB inactive_file:844816kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:24768kB dirty:69080kB
writeback:19628kB shmem:9032kB writeback_tmp:0kB unstable:0kB
all_unreclaimable? yes
[ 34.642354] DMA free:3588kB min:68kB low:84kB high:100kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:11848kB unevictable:0kB
writepending:11856kB present:15964kB managed:15876kB mlocked:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB
[ 34.670194] lowmem_reserve[]: 0 824 1947 824
[ 34.674483] Normal free:4228kB min:3636kB low:4544kB high:5452kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:1136kB inactive_file:786456kB unevictable:0kB
writepending:68084kB present:884728kB managed:845324kB mlocked:0kB
kernel_stack:1104kB pagetables:0kB bounce:0kB free_pcp:3056kB
local_pcp:388kB free_cma:0kB
[ 34.704243] lowmem_reserve[]: 0 0 8980 0
[ 34.708189] HighMem free:1053028kB min:512kB low:1748kB high:2984kB
reserved_highatomic:0KB active_anon:22852kB inactive_anon:8676kB
active_file:15024kB inactive_file:46596kB unevictable:0kB
writepending:0kB present:1149544kB managed:1149544kB mlocked:0kB
kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:2160kB
local_pcp:736kB free_cma:0kB
[ 34.738563] lowmem_reserve[]: 0 0 0 0
[ 34.742245] DMA: 23*4kB (U) 2*8kB (U) 3*16kB (U) 2*32kB (UE) 2*64kB
(U) 1*128kB (U) 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB
= 3804kB
[ 34.755479] Normal: 25*4kB (UM) 27*8kB (UME) 16*16kB (UME) 14*32kB
(UME) 7*64kB (UME) 2*128kB (UM) 1*256kB (E) 1*512kB (E) 0*1024kB
1*2048kB (M) 0*4096kB = 4540kB
[ 34.770004] HighMem: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 1*64kB (M)
2*128kB (UM) 2*256kB (UM) 1*512kB (U) 1*1024kB (U) 1*2048kB (U)
256*4096kB (M) = 1053028kB
[ 34.784010] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=4096kB
[ 34.792466] 217507 total pagecache pages
[ 34.796387] 0 pages in swap cache
[ 34.799704] Swap cache stats: add 0, delete 0, find 0/0
[ 34.804923] Free swap = 0kB
[ 34.807834] Total swap = 0kB
[ 34.810738] 512559 pages RAM
[ 34.813640] 287386 pages HighMem/MovableOnly
[ 34.817931] 9873 pages reserved
- Naresh
On Thu, 21 May 2020 at 00:39, Chris Down <[email protected]> wrote:
>
> Hi Naresh,
>
> Naresh Kamboju writes:
> >As a part of investigation on this issue LKFT teammate Anders Roxell
> >git bisected the problem and found bad commit(s) which caused this problem.
> >
> >The following two patches have been reverted on next-20200519 and retested the
> >reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> >( invoked oom-killer is gone now)
> >
> >Revert "mm, memcg: avoid stale protection values when cgroup is above
> >protection"
> > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> >
> >Revert "mm, memcg: decouple e{low,min} state mutations from protection
> >checks"
> > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
>
> Thanks Anders and Naresh for tracking this down and reverting.
>
> I'll take a look tomorrow. I don't see anything immediately obviously wrong in
> either of those commits from a (very) cursory glance, but they should only be
> taking effect if protections are set.
>
> Since you have i386 hardware available, and I don't, could you please apply
> only "avoid stale protection" again and check if it only happens with that
> commit, or requires both? That would help narrow down the suspects.
Not both.
The bad commit is
"mm, memcg: decouple e{low,min} state mutations from protection checks"
>
> Do you use any memcg protections in these tests?
I see three MEMCG configs and please find the kernel config link
for more details.
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_KMEM=y
kernel config link,
https://builds.tuxbuild.com/8lg6WQibcwtQRRtIa0bcFA/kernel.config
- Naresh
On Thu, May 21, 2020 at 11:22 AM Naresh Kamboju
<[email protected]> wrote:
> On Thu, 21 May 2020 at 00:39, Chris Down <[email protected]> wrote:
> > Since you have i386 hardware available, and I don't, could you please apply
> > only "avoid stale protection" again and check if it only happens with that
> > commit, or requires both? That would help narrow down the suspects.
Note that Naresh is running an i386 kernel on regular 64-bit hardware that
most people have access to.
> kernel config link,
> https://builds.tuxbuild.com/8lg6WQibcwtQRRtIa0bcFA/kernel.config
Do you know if the same bug shows up running a kernel with that
configuration in qemu? I would expect it to, and that would make
it much easier to reproduce.
I would also not be surprised if it happens on all architectures but only
shows up on the 32-bit arm and x86 machines first because they have
a rather limited amount of lowmem. Maybe booting a 64-bit kernel
with "mem=512M" and then running "dd if=/dev/sda of=/dev/null bs=1M"
will also trigger it. I did not attempt to run this myself.
Arnd
On Wed 20-05-20 20:09:06, Chris Down wrote:
> Hi Naresh,
>
> Naresh Kamboju writes:
> > As a part of investigation on this issue LKFT teammate Anders Roxell
> > git bisected the problem and found bad commit(s) which caused this problem.
> >
> > The following two patches have been reverted on next-20200519 and retested the
> > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > ( invoked oom-killer is gone now)
> >
> > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > protection"
> > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> >
> > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > checks"
> > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
>
> Thanks Anders and Naresh for tracking this down and reverting.
>
> I'll take a look tomorrow. I don't see anything immediately obviously wrong
> in either of those commits from a (very) cursory glance, but they should
> only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be
effectively a nop. Btw. do you see the problem when booting with
cgroup_disable=memory kernel command line parameter?
I suspect that something might be initialized for memcg incorrectly and
the patch just makes it more visible for some reason.
--
Michal Hocko
SUSE Labs
On Thu, 21 May 2020 at 15:25, Michal Hocko <[email protected]> wrote:
>
> On Wed 20-05-20 20:09:06, Chris Down wrote:
> > Hi Naresh,
> >
> > Naresh Kamboju writes:
> > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > git bisected the problem and found bad commit(s) which caused this problem.
> > >
> > > The following two patches have been reverted on next-20200519 and retested the
> > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > ( invoked oom-killer is gone now)
> > >
> > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > protection"
> > > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > >
> > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > checks"
> > > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> >
> > Thanks Anders and Naresh for tracking this down and reverting.
> >
> > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > in either of those commits from a (very) cursory glance, but they should
> > only be taking effect if protections are set.
>
> Agreed. If memory.{low,min} is not used then the patch should be
> effectively a nop. Btw. do you see the problem when booting with
> cgroup_disable=memory kernel command line parameter?
With extra kernel command line parameters, cgroup_disable=memory
I have noticed a differ problem now.
+ mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Allocating group tables: 0/7453 done
Writing inode tables: 0/7453 done
Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL
pointer dereference, address: 000000c8
[ 35.508372] #PF: supervisor read access in kernel mode
[ 35.513506] #PF: error_code(0x0000) - not-present page
[ 35.518638] *pde = 00000000
[ 35.521514] Oops: 0000 [#1] SMP
[ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
5.7.0-rc6-next-20200519+ #1
[ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
[ 35.544724] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb
75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27
8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b
8a 98
[ 35.563461] EAX: 00000000 EBX: f5411000 ECX: 00000000 EDX: 00000000
[ 35.569718] ESI: 00000000 EDI: f4e13ea8 EBP: f4e13e10 ESP: f4e13e04
[ 35.575976] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010207
[ 35.582751] CR0: 80050033 CR2: 000000c8 CR3: 0bef4000 CR4: 003406d0
[ 35.589010] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 35.595266] DR6: fffe0ff0 DR7: 00000400
[ 35.599096] Call Trace:
[ 35.601544] shrink_lruvec+0x447/0x630
[ 35.605294] ? newidle_balance.isra.100+0x8e/0x3f0
[ 35.610080] ? pick_next_task_fair+0x3a/0x320
[ 35.614437] ? deactivate_task+0xcf/0x100
[ 35.618442] ? put_prev_entity+0x1a/0xd0
[ 35.622359] ? deactivate_task+0xcf/0x100
[ 35.626363] shrink_node+0x1be/0x640
[ 35.629932] ? shrink_node+0x1be/0x640
[ 35.633676] kswapd+0x32c/0x890
[ 35.636815] ? deactivate_task+0xcf/0x100
[ 35.640820] kthread+0xf1/0x110
[ 35.643963] ? do_try_to_free_pages+0x3b0/0x3b0
[ 35.648489] ? kthread_park+0xa0/0xa0
[ 35.652147] ret_from_fork+0x1c/0x28
[ 35.655726] Modules linked in: x86_pkg_temp_thermal
[ 35.660605] CR2: 00000000000000c8
[ 35.663916] ---[ end trace d85b8564ea55fb0d ]---
[ 35.663917] BUG: kernel NULL pointer dereference, address: 000000c8
[ 35.663918] #PF: supervisor read access in kernel mode
[ 35.668534] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
[ 35.674792] #PF: error_code(0x0000) - not-present page
[ 35.674792] *pde = 00000000
[ 35.679921] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb
75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27
8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b
8a 98
[ 35.685140] Oops: 0000 [#2] SMP
[ 35.685142] CPU: 2 PID: 391 Comm: mkfs.ext4 Tainted: G D
5.7.0-rc6-next-20200519+ #1
[ 35.690278] EAX: 00000000 EBX: f5411000 ECX: 00000000 EDX: 00000000
[ 35.690279] ESI: 00000000 EDI: f4e13ea8 EBP: f4e13e10 ESP: f4e13e04
[ 35.693155] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[ 35.693158] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
[ 35.711893] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010207
[ 35.711894] CR0: 80050033 CR2: 000000c8 CR3: 0bef4000 CR4: 003406d0
[ 35.715031] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb
75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27
8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b
8a 98
[ 35.724061] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 35.730317] EAX: 00000000 EBX: f5411000 ECX: 00000000 EDX: 00000000
[ 35.730318] ESI: 00000000 EDI: f2d73c14 EBP: f2d73b78 ESP: f2d73b6c
[ 35.736576] DR6: fffe0ff0 DR7: 00000400
[ 35.803603] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010207
[ 35.810380] CR0: 80050033 CR2: 000000c8 CR3: 33241000 CR4: 003406d0
[ 35.816636] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 35.822893] DR6: fffe0ff0 DR7: 00000400
[ 35.826725] Call Trace:
[ 35.829171] shrink_lruvec+0x447/0x630
[ 35.832921] ? check_preempt_curr+0x75/0x80
[ 35.837100] shrink_node+0x1be/0x640
[ 35.840670] ? shrink_node+0x1be/0x640
[ 35.844412] do_try_to_free_pages+0xc1/0x3b0
[ 35.848677] try_to_free_pages+0xba/0x1d0
[ 35.852683] __alloc_pages_nodemask+0x573/0xe90
[ 35.857232] ? set_bh_page+0x33/0x50
[ 35.860829] ? xas_load+0xf/0x70
[ 35.864050] ? __xa_set_mark+0x4d/0x70
[ 35.867795] ? find_get_entry+0x47/0x110
[ 35.871714] pagecache_get_page+0xbe/0x250
[ 35.875805] grab_cache_page_write_begin+0x1a/0x30
[ 35.880588] block_write_begin+0x25/0x90
[ 35.884504] blkdev_write_begin+0x1e/0x20
[ 35.888507] ? bdev_evict_inode+0xc0/0xc0
[ 35.892513] generic_perform_write+0x95/0x190
[ 35.896863] __generic_file_write_iter+0xe0/0x1a0
[ 35.901562] blkdev_write_iter+0xbf/0x1c0
[ 35.905564] __vfs_write+0x122/0x1e0
[ 35.909136] vfs_write+0x8f/0x1b0
[ 35.912454] ksys_pwrite64+0x60/0x80
[ 35.916024] __ia32_sys_ia32_pwrite64+0x16/0x20
[ 35.920549] do_fast_syscall_32+0x66/0x240
[ 35.924641] entry_SYSENTER_32+0xa5/0xf8
[ 35.928567] EIP: 0xb7f72549
[ 35.931357] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01
10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f
34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90
8d 76
[ 35.950093] EAX: ffffffda EBX: 00000003 ECX: b7866010 EDX: 00400000
[ 35.956351] ESI: 39000000 EDI: 00000074 EBP: 07439000 ESP: bf973700
[ 35.962607] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
[ 35.969384] Modules linked in: x86_pkg_temp_thermal
[ 35.974269] CR2: 00000000000000c8
[ 35.977582] ---[ end trace d85b8564ea55fb0e ]---
[ 35.977583] BUG: kernel NULL pointer dereference, address: 000000c8
[ 35.977584] #PF: supervisor read access in kernel mode
[ 35.982193] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
[ 35.982195] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb
75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27
8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b
8a 98
[ 35.988450] #PF: error_code(0x0000) - not-present page
[ 35.988451] *pde = 00000000
full test log link,
https://lkft.validation.linaro.org/scheduler/job/1443939#L1170
- Naresh
On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> On Thu, 21 May 2020 at 15:25, Michal Hocko <[email protected]> wrote:
> >
> > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > Hi Naresh,
> > >
> > > Naresh Kamboju writes:
> > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > >
> > > > The following two patches have been reverted on next-20200519 and retested the
> > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > ( invoked oom-killer is gone now)
> > > >
> > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > protection"
> > > > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > >
> > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > checks"
> > > > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > >
> > > Thanks Anders and Naresh for tracking this down and reverting.
> > >
> > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > in either of those commits from a (very) cursory glance, but they should
> > > only be taking effect if protections are set.
> >
> > Agreed. If memory.{low,min} is not used then the patch should be
> > effectively a nop. Btw. do you see the problem when booting with
> > cgroup_disable=memory kernel command line parameter?
>
> With extra kernel command line parameters, cgroup_disable=memory
> I have noticed a differ problem now.
>
> + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> mke2fs 1.43.8 (1-Jan-2018)
> Creating filesystem with 244190646 4k blocks and 61054976 inodes
> Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 102400000, 214990848
> Allocating group tables: 0/7453 done
> Writing inode tables: 0/7453 done
> Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL
> pointer dereference, address: 000000c8
> [ 35.508372] #PF: supervisor read access in kernel mode
> [ 35.513506] #PF: error_code(0x0000) - not-present page
> [ 35.518638] *pde = 00000000
> [ 35.521514] Oops: 0000 [#1] SMP
> [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> 5.7.0-rc6-next-20200519+ #1
> [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.2 05/23/2018
> [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
Could you get faddr2line for this offset?
--
Michal Hocko
SUSE Labs
On Thu, 21 May 2020, Michal Hocko wrote:
> On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> > On Thu, 21 May 2020 at 15:25, Michal Hocko <[email protected]> wrote:
> > >
> > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > Hi Naresh,
> > > >
> > > > Naresh Kamboju writes:
> > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > > >
> > > > > The following two patches have been reverted on next-20200519 and retested the
> > > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > > ( invoked oom-killer is gone now)
> > > > >
> > > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > > protection"
> > > > > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > >
> > > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > > checks"
> > > > > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > >
> > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > >
> > > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > > in either of those commits from a (very) cursory glance, but they should
> > > > only be taking effect if protections are set.
> > >
> > > Agreed. If memory.{low,min} is not used then the patch should be
> > > effectively a nop. Btw. do you see the problem when booting with
> > > cgroup_disable=memory kernel command line parameter?
> >
> > With extra kernel command line parameters, cgroup_disable=memory
> > I have noticed a differ problem now.
> >
> > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > mke2fs 1.43.8 (1-Jan-2018)
> > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > Superblock backups stored on blocks:
> > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > 102400000, 214990848
> > Allocating group tables: 0/7453 done
> > Writing inode tables: 0/7453 done
> > Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL
> > pointer dereference, address: 000000c8
> > [ 35.508372] #PF: supervisor read access in kernel mode
> > [ 35.513506] #PF: error_code(0x0000) - not-present page
> > [ 35.518638] *pde = 00000000
> > [ 35.521514] Oops: 0000 [#1] SMP
> > [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > 5.7.0-rc6-next-20200519+ #1
> > [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > 2.2 05/23/2018
> > [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
>
> Could you get faddr2line for this offset?
No need for that, I can help with the "cgroup_disabled=memory" crash:
I've been happily running with the fixup below, but haven't got to
send it in yet (and wouldn't normally be reading mail at this time!)
because of busy chasing a couple of other bugs (not necessarily mm);
and maybe the fix would be better with explicit mem_cgroup_disabled()
test, or maybe that should be where cgroup_memory_noswap is decided -
up to Johannes.
---
mm/memcontrol.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- 5.7-rc6-mm1/mm/memcontrol.c 2020-05-20 12:21:56.109693740 -0700
+++ linux/mm/memcontrol.c 2020-05-20 12:26:15.500478753 -0700
@@ -6954,7 +6954,8 @@ long mem_cgroup_get_nr_swap_pages(struct
{
long nr_swap_pages = get_nr_swap_pages();
- if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ if (!memcg || cgroup_memory_noswap ||
+ !cgroup_subsys_on_dfl(memory_cgrp_subsys))
return nr_swap_pages;
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
nr_swap_pages = min_t(long, nr_swap_pages,
On Thu 21-05-20 05:24:27, Hugh Dickins wrote:
> On Thu, 21 May 2020, Michal Hocko wrote:
> > On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> > > On Thu, 21 May 2020 at 15:25, Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > > Hi Naresh,
> > > > >
> > > > > Naresh Kamboju writes:
> > > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > > > >
> > > > > > The following two patches have been reverted on next-20200519 and retested the
> > > > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > > > ( invoked oom-killer is gone now)
> > > > > >
> > > > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > > > protection"
> > > > > > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > > >
> > > > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > > > checks"
> > > > > > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > > >
> > > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > > >
> > > > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > > > in either of those commits from a (very) cursory glance, but they should
> > > > > only be taking effect if protections are set.
> > > >
> > > > Agreed. If memory.{low,min} is not used then the patch should be
> > > > effectively a nop. Btw. do you see the problem when booting with
> > > > cgroup_disable=memory kernel command line parameter?
> > >
> > > With extra kernel command line parameters, cgroup_disable=memory
> > > I have noticed a differ problem now.
> > >
> > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > mke2fs 1.43.8 (1-Jan-2018)
> > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > Superblock backups stored on blocks:
> > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > 102400000, 214990848
> > > Allocating group tables: 0/7453 done
> > > Writing inode tables: 0/7453 done
> > > Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL
> > > pointer dereference, address: 000000c8
> > > [ 35.508372] #PF: supervisor read access in kernel mode
> > > [ 35.513506] #PF: error_code(0x0000) - not-present page
> > > [ 35.518638] *pde = 00000000
> > > [ 35.521514] Oops: 0000 [#1] SMP
> > > [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > 5.7.0-rc6-next-20200519+ #1
> > > [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > 2.2 05/23/2018
> > > [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> >
> > Could you get faddr2line for this offset?
>
> No need for that, I can help with the "cgroup_disabled=memory" crash:
> I've been happily running with the fixup below, but haven't got to
> send it in yet (and wouldn't normally be reading mail at this time!)
> because of busy chasing a couple of other bugs (not necessarily mm);
> and maybe the fix would be better with explicit mem_cgroup_disabled()
> test, or maybe that should be where cgroup_memory_noswap is decided -
> up to Johannes.
Thanks Hugh. I can see what is the problem now. I was looking at the
Linus' tree and we have a different code there
long nr_swap_pages = get_nr_swap_pages();
if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
return nr_swap_pages;
which would be impossible to crash so I was really wondering what is
going on here. But there are other changes in the mmotm which I haven't
reviewed yet. Looking at the next tree now it is a fallout from "mm:
memcontrol: prepare swap controller setup for integration".
!memcg check slightly more cryptic than an explicit mem_cgroup_disabled
but I would just leave it to Johannes as well.
>
> ---
>
> mm/memcontrol.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- 5.7-rc6-mm1/mm/memcontrol.c 2020-05-20 12:21:56.109693740 -0700
> +++ linux/mm/memcontrol.c 2020-05-20 12:26:15.500478753 -0700
> @@ -6954,7 +6954,8 @@ long mem_cgroup_get_nr_swap_pages(struct
> {
> long nr_swap_pages = get_nr_swap_pages();
>
> - if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + if (!memcg || cgroup_memory_noswap ||
> + !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> return nr_swap_pages;
> for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
> nr_swap_pages = min_t(long, nr_swap_pages,
--
Michal Hocko
SUSE Labs
On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> On Wed 20-05-20 20:09:06, Chris Down wrote:
> > Hi Naresh,
> >
> > Naresh Kamboju writes:
> > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > git bisected the problem and found bad commit(s) which caused this problem.
> > >
> > > The following two patches have been reverted on next-20200519 and retested the
> > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > ( invoked oom-killer is gone now)
> > >
> > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > protection"
> > > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > >
> > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > checks"
> > > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> >
> > Thanks Anders and Naresh for tracking this down and reverting.
> >
> > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > in either of those commits from a (very) cursory glance, but they should
> > only be taking effect if protections are set.
>
> Agreed. If memory.{low,min} is not used then the patch should be
> effectively a nop.
I was staring into the code and do not see anything. Could you give the
following debugging patch a try and see whether it triggers?
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc555903a332..df2e8df0eb71 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
* sc->priority further than desirable.
*/
scan = max(scan, SWAP_CLUSTER_MAX);
+
+ trace_printk("scan:%lu protection:%lu\n", scan, protection);
} else {
scan = lruvec_size;
}
@@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
mem_cgroup_calculate_protection(target_memcg, memcg);
if (mem_cgroup_below_min(memcg)) {
+ trace_printk("under min:%lu emin:%lu\n", memcg->memory.min, memcg->memory.emin);
/*
* Hard protection.
* If there is no reclaimable memory, OOM.
@@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
* there is an unprotected supply
* of reclaimable memory from other cgroups.
*/
+ trace_printk("under low:%lu elow:%lu\n", memcg->memory.low, memcg->memory.elow);
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
continue;
--
Michal Hocko
SUSE Labs
On Thu, 21 May 2020 at 22:04, Michal Hocko <[email protected]> wrote:
>
> On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > Hi Naresh,
> > >
> > > Naresh Kamboju writes:
> > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > >
> > > > The following two patches have been reverted on next-20200519 and retested the
> > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > ( invoked oom-killer is gone now)
> > > >
> > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > protection"
> > > > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > >
> > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > checks"
> > > > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > >
> > > Thanks Anders and Naresh for tracking this down and reverting.
> > >
> > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > in either of those commits from a (very) cursory glance, but they should
> > > only be taking effect if protections are set.
> >
> > Agreed. If memory.{low,min} is not used then the patch should be
> > effectively a nop.
>
> I was staring into the code and did not see anything. Could you give the
> following debugging patch a try and see whether it triggers?
These code paths did not touch it seems. but still see the reported problem.
Please find a detailed test log output [1]
And
One more test log with cgroup_disable=memory [2]
Test log link,
[1] https://pastebin.com/XJU7We1g
[2] https://pastebin.com/BZ0BMUVt
On Thu, May 21, 2020 at 02:44:44PM +0200, Michal Hocko wrote:
> On Thu 21-05-20 05:24:27, Hugh Dickins wrote:
> > On Thu, 21 May 2020, Michal Hocko wrote:
> > > On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> > > > On Thu, 21 May 2020 at 15:25, Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > > > Hi Naresh,
> > > > > >
> > > > > > Naresh Kamboju writes:
> > > > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > > > > >
> > > > > > > The following two patches have been reverted on next-20200519 and retested the
> > > > > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > > > > ( invoked oom-killer is gone now)
> > > > > > >
> > > > > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > > > > protection"
> > > > > > > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > > > >
> > > > > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > > > > checks"
> > > > > > > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > > > >
> > > > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > > > >
> > > > > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > > > > in either of those commits from a (very) cursory glance, but they should
> > > > > > only be taking effect if protections are set.
> > > > >
> > > > > Agreed. If memory.{low,min} is not used then the patch should be
> > > > > effectively a nop. Btw. do you see the problem when booting with
> > > > > cgroup_disable=memory kernel command line parameter?
> > > >
> > > > With extra kernel command line parameters, cgroup_disable=memory
> > > > I have noticed a differ problem now.
> > > >
> > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > 102400000, 214990848
> > > > Allocating group tables: 0/7453 done
> > > > Writing inode tables: 0/7453 done
> > > > Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL
> > > > pointer dereference, address: 000000c8
> > > > [ 35.508372] #PF: supervisor read access in kernel mode
> > > > [ 35.513506] #PF: error_code(0x0000) - not-present page
> > > > [ 35.518638] *pde = 00000000
> > > > [ 35.521514] Oops: 0000 [#1] SMP
> > > > [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > 5.7.0-rc6-next-20200519+ #1
> > > > [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > 2.2 05/23/2018
> > > > [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> > >
> > > Could you get faddr2line for this offset?
> >
> > No need for that, I can help with the "cgroup_disabled=memory" crash:
> > I've been happily running with the fixup below, but haven't got to
> > send it in yet (and wouldn't normally be reading mail at this time!)
> > because of busy chasing a couple of other bugs (not necessarily mm);
> > and maybe the fix would be better with explicit mem_cgroup_disabled()
> > test, or maybe that should be where cgroup_memory_noswap is decided -
> > up to Johannes.
>
> Thanks Hugh. I can see what is the problem now. I was looking at the
> Linus' tree and we have a different code there
>
> long nr_swap_pages = get_nr_swap_pages();
>
> if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> return nr_swap_pages;
>
> which would be impossible to crash so I was really wondering what is
> going on here. But there are other changes in the mmotm which I haven't
> reviewed yet. Looking at the next tree now it is a fallout from "mm:
> memcontrol: prepare swap controller setup for integration".
>
> !memcg check slightly more cryptic than an explicit mem_cgroup_disabled
> but I would just leave it to Johannes as well.
Very much appreciate you guys tracking it down so quickly. Sorry about
the breakage.
I think mem_cgroup_disabled() checks are pretty good markers of public
entry points to the memcg API, so I'd prefer that even if a bit more
verbose. What do you think?
---
From cd373ec232942a9bc43ee5e7d2171352019a58fb Mon Sep 17 00:00:00 2001
From: Hugh Dickins <[email protected]>
Date: Thu, 21 May 2020 14:58:36 -0400
Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
fix
Fix crash with cgroup_disable=memory:
> > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > 102400000, 214990848
> > > > Allocating group tables: 0/7453 done
> > > > Writing inode tables: 0/7453 done
> > > > Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL
> > > > pointer dereference, address: 000000c8
> > > > [ 35.508372] #PF: supervisor read access in kernel mode
> > > > [ 35.513506] #PF: error_code(0x0000) - not-present page
> > > > [ 35.518638] *pde = 00000000
> > > > [ 35.521514] Oops: 0000 [#1] SMP
> > > > [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > 5.7.0-rc6-next-20200519+ #1
> > > > [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > 2.2 05/23/2018
> > > > [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
do_memsw_account() used to be automatically false when the cgroup
controller was disabled. Now that it's replaced by
cgroup_memory_noswap, for which this isn't true, make the
mem_cgroup_disabled() checks explicit in the swap control API.
[[email protected]: use mem_cgroup_disabled() in all API functions]
Reported-by: Naresh Kamboju <[email protected]>
Debugged-by: Hugh Dickins <[email protected]>
Debugged-by: Michal Hocko <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++------
1 file changed, 41 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3e000a316b59..850bca380562 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6811,6 +6811,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page), page);
+ if (mem_cgroup_disabled())
+ return;
+
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
return;
@@ -6876,6 +6879,10 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
struct mem_cgroup *memcg;
unsigned short oldid;
+ if (mem_cgroup_disabled())
+ return 0;
+
+ /* Only cgroup2 has swap.max */
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return 0;
@@ -6920,6 +6927,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
struct mem_cgroup *memcg;
unsigned short id;
+ if (mem_cgroup_disabled())
+ return;
+
id = swap_cgroup_record(entry, 0, nr_pages);
rcu_read_lock();
memcg = mem_cgroup_from_id(id);
@@ -6940,12 +6950,25 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
{
long nr_swap_pages = get_nr_swap_pages();
- if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
- return nr_swap_pages;
+ if (mem_cgroup_disabled())
+ goto out;
+
+ /* Swap control disabled */
+ if (cgroup_memory_noswap)
+ goto out;
+
+ /*
+ * Only cgroup2 has swap.max, cgroup1 does mem+sw accounting,
+ * which does not place restrictions specifically on swap.
+ */
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ goto out;
+
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
nr_swap_pages = min_t(long, nr_swap_pages,
READ_ONCE(memcg->swap.max) -
page_counter_read(&memcg->swap));
+out:
return nr_swap_pages;
}
@@ -6957,18 +6980,30 @@ bool mem_cgroup_swap_full(struct page *page)
if (vm_swap_full())
return true;
- if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
- return false;
+
+ if (mem_cgroup_disabled())
+ goto out;
+
+ /* Swap control disabled */
+ if (cgroup_memory_noswap)
+ goto out;
+
+ /*
+ * Only cgroup2 has swap.max, cgroup1 does mem+sw accounting,
+ * which does not place restrictions specifically on swap.
+ */
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ goto out;
memcg = page->mem_cgroup;
if (!memcg)
- return false;
+ goto out;
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
if (page_counter_read(&memcg->swap) * 2 >=
READ_ONCE(memcg->swap.max))
return true;
-
+out:
return false;
}
--
2.26.2
On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
> On Thu, 21 May 2020, Johannes Weiner wrote:
> > do_memsw_account() used to be automatically false when the cgroup
> > controller was disabled. Now that it's replaced by
> > cgroup_memory_noswap, for which this isn't true, make the
> > mem_cgroup_disabled() checks explicit in the swap control API.
> >
> > [[email protected]: use mem_cgroup_disabled() in all API functions]
> > Reported-by: Naresh Kamboju <[email protected]>
> > Debugged-by: Hugh Dickins <[email protected]>
> > Debugged-by: Michal Hocko <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 41 insertions(+), 6 deletions(-)
>
> I'm certainly not against a mem_cgroup_disabled() check in the only
> place that's been observed to need it, as a fixup to merge into your
> original patch; but this seems rather an over-reaction - and I'm a
> little surprised that setting mem_cgroup_disabled() doesn't just
> force cgroup_memory_noswap, saving repetitious checks elsewhere
> (perhaps there's a difficulty in that, I haven't looked).
Fair enough, I changed it to set the flag at initialization time if
mem_cgroup_disabled(). I was never a fan of the old flags, where it
was never clear what was commandline, and what was internal runtime
state - do_swap_account? really_do_swap_account? But I think it's
straight-forward in this case now.
> Historically, I think we've added mem_cgroup_disabled() checks
> (accessing a cacheline we'd rather avoid) where they're necessary,
> rather than at every "interface".
To me that always seemed like bugs waiting to happen. Like this one!
It's a jump label nowadays, so I've been liberal with these to avoid
subtle bugs.
> And you seem to be in a very "goto out" mood today - we all have
> our "goto out" days, alternating with our "return 0" days :)
:-)
But I agree, best to keep this fixup self-contained and defer anything
else to separate cleanup patches.
How about the below? It survives a swaptest with cgroup_disable=memory
for me.
Hugh, I started with your patch, which is why I kept you as the
author, but as the patch now (and arguably the previous one) is
sufficiently different, I dropped that now. I hope that's okay.
---
From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Thu, 21 May 2020 17:44:25 -0400
Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
fix
Fix crash with cgroup_disable=memory:
> > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > 102400000, 214990848
> > > > Allocating group tables: 0/7453 done
> > > > Writing inode tables: 0/7453 done
> > > > Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL
> > > > pointer dereference, address: 000000c8
> > > > [ 35.508372] #PF: supervisor read access in kernel mode
> > > > [ 35.513506] #PF: error_code(0x0000) - not-present page
> > > > [ 35.518638] *pde = 00000000
> > > > [ 35.521514] Oops: 0000 [#1] SMP
> > > > [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > 5.7.0-rc6-next-20200519+ #1
> > > > [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > 2.2 05/23/2018
> > > > [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
Swap accounting used to be implied-disabled when the cgroup controller
was disabled. Restore that for the new cgroup_memory_noswap, so that
we bail out of this function instead of dereferencing a NULL memcg.
Reported-by: Naresh Kamboju <[email protected]>
Debugged-by: Hugh Dickins <[email protected]>
Debugged-by: Michal Hocko <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3e000a316b59..e3b785d6e771 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7075,7 +7075,11 @@ static struct cftype memsw_files[] = {
static int __init mem_cgroup_swap_init(void)
{
- if (mem_cgroup_disabled() || cgroup_memory_noswap)
+ /* No memory control -> no swap control */
+ if (mem_cgroup_disabled())
+ cgroup_memory_noswap = true;
+
+ if (cgroup_memory_noswap)
return 0;
WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
--
2.26.2
[Sorry for a late reply - was offline for few days]
On Thu 21-05-20 17:58:55, Johannes Weiner wrote:
> On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
[...]
> >From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Thu, 21 May 2020 17:44:25 -0400
> Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
> fix
>
> Fix crash with cgroup_disable=memory:
>
> > > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > > Superblock backups stored on blocks:
> > > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > > 102400000, 214990848
> > > > > Allocating group tables: 0/7453 done
> > > > > Writing inode tables: 0/7453 done
> > > > > Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL
> > > > > pointer dereference, address: 000000c8
> > > > > [ 35.508372] #PF: supervisor read access in kernel mode
> > > > > [ 35.513506] #PF: error_code(0x0000) - not-present page
> > > > > [ 35.518638] *pde = 00000000
> > > > > [ 35.521514] Oops: 0000 [#1] SMP
> > > > > [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > > 5.7.0-rc6-next-20200519+ #1
> > > > > [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > 2.2 05/23/2018
> > > > > [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
>
> Swap accounting used to be implied-disabled when the cgroup controller
> was disabled. Restore that for the new cgroup_memory_noswap, so that
> we bail out of this function instead of dereferencing a NULL memcg.
>
> Reported-by: Naresh Kamboju <[email protected]>
> Debugged-by: Hugh Dickins <[email protected]>
> Debugged-by: Michal Hocko <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Yes this looks better. I hope to get to your series soon to have the
full picture finally.
> ---
> mm/memcontrol.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3e000a316b59..e3b785d6e771 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7075,7 +7075,11 @@ static struct cftype memsw_files[] = {
>
> static int __init mem_cgroup_swap_init(void)
> {
> - if (mem_cgroup_disabled() || cgroup_memory_noswap)
> + /* No memory control -> no swap control */
> + if (mem_cgroup_disabled())
> + cgroup_memory_noswap = true;
> +
> + if (cgroup_memory_noswap)
> return 0;
>
> WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
> --
> 2.26.2
--
Michal Hocko
SUSE Labs
On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> My apology !
> As per the test results history this problem started happening from
> Bad : next-20200430 (still reproducible on next-20200519)
> Good : next-20200429
>
> The git tree / tag used for testing is from linux next-20200430 tag and reverted
> following three patches and oom-killer problem fixed.
>
> Revert "mm, memcg: avoid stale protection values when cgroup is above
> protection"
> Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
> Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
The discussion has fragmented and I got lost TBH.
In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww@mail.gmail.com
you have said that none of the added tracing output has triggered. Does
this still hold? Because I still have a hard time to understand how
those three patches could have the observed effects.
--
Michal Hocko
SUSE Labs
On Thu, 28 May 2020 at 20:33, Michal Hocko <[email protected]> wrote:
>
> On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> > My apology !
> > As per the test results history this problem started happening from
> > Bad : next-20200430 (still reproducible on next-20200519)
> > Good : next-20200429
> >
> > The git tree / tag used for testing is from linux next-20200430 tag and reverted
> > following three patches and oom-killer problem fixed.
> >
> > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > protection"
> > Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
> > Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
>
> The discussion has fragmented and I got lost TBH.
> In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww@mail.gmail.com
> you have said that none of the added tracing output has triggered. Does
> this still hold? Because I still have a hard time to understand how
> those three patches could have the observed effects.
On the other email thread [1] this issue is concluded.
Yafang wrote on May 22 2020,
Regarding the root cause, my guess is it makes a similar mistake that
I tried to fix in the previous patch that the direct reclaimer read a
stale protection value. But I don't think it is worth to add another
fix. The best way is to revert this commit.
[1] [PATCH v3 2/2] mm, memcg: Decouple e{low,min} state mutations
from protection checks
https://lore.kernel.org/linux-mm/CALOAHbArZ3NsuR3mCnx_kbSF8ktpjhUF2kaaTa7Mb7ocJajsQg@mail.gmail.com/
- Naresh
> --
> Michal Hocko
> SUSE Labs
Naresh Kamboju writes:
>On Thu, 28 May 2020 at 20:33, Michal Hocko <[email protected]> wrote:
>>
>> On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
>> > My apology !
>> > As per the test results history this problem started happening from
>> > Bad : next-20200430 (still reproducible on next-20200519)
>> > Good : next-20200429
>> >
>> > The git tree / tag used for testing is from linux next-20200430 tag and reverted
>> > following three patches and oom-killer problem fixed.
>> >
>> > Revert "mm, memcg: avoid stale protection values when cgroup is above
>> > protection"
>> > Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
>> > Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
>>
>> The discussion has fragmented and I got lost TBH.
>> In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww@mail.gmail.com
>> you have said that none of the added tracing output has triggered. Does
>> this still hold? Because I still have a hard time to understand how
>> those three patches could have the observed effects.
>
>On the other email thread [1] this issue is concluded.
>
>Yafang wrote on May 22 2020,
>
>Regarding the root cause, my guess is it makes a similar mistake that
>I tried to fix in the previous patch that the direct reclaimer read a
>stale protection value. But I don't think it is worth to add another
>fix. The best way is to revert this commit.
This isn't a conclusion, just a guess (and one I think is unlikely). For this
to reliably happen, it implies that the same race happens the same way each
time.
On Fri, May 29, 2020 at 12:41 AM Chris Down <[email protected]> wrote:
>
> Naresh Kamboju writes:
> >On Thu, 28 May 2020 at 20:33, Michal Hocko <[email protected]> wrote:
> >>
> >> On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> >> > My apology !
> >> > As per the test results history this problem started happening from
> >> > Bad : next-20200430 (still reproducible on next-20200519)
> >> > Good : next-20200429
> >> >
> >> > The git tree / tag used for testing is from linux next-20200430 tag and reverted
> >> > following three patches and oom-killer problem fixed.
> >> >
> >> > Revert "mm, memcg: avoid stale protection values when cgroup is above
> >> > protection"
> >> > Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
> >> > Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
> >>
> >> The discussion has fragmented and I got lost TBH.
> >> In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww@mail.gmail.com
> >> you have said that none of the added tracing output has triggered. Does
> >> this still hold? Because I still have a hard time to understand how
> >> those three patches could have the observed effects.
> >
> >On the other email thread [1] this issue is concluded.
> >
> >Yafang wrote on May 22 2020,
> >
> >Regarding the root cause, my guess is it makes a similar mistake that
> >I tried to fix in the previous patch that the direct reclaimer read a
> >stale protection value. But I don't think it is worth to add another
> >fix. The best way is to revert this commit.
>
> This isn't a conclusion, just a guess (and one I think is unlikely). For this
> to reliably happen, it implies that the same race happens the same way each
> time.
Hi Chris,
Look at this patch[1] carefully you will find that it introduces the
same issue that I tried to fix in another patch [2]. Even more sad is
these two patches are in the same patchset. Although this issue isn't
related with the issue found by Naresh, we have to ask ourselves why
we always make the same mistake ?
One possible answer is that we always forget the lifecyle of
memory.emin before we read it. memory.emin doesn't have the same
lifecycle with the memcg, while it really has the same lifecyle with
the reclaimer. IOW, once a reclaimer begins the protetion value should
be set to 0, and after we traversal the memcg tree we calculate a
protection value for this reclaimer, finnaly it disapears after the
reclaimer stops. That is why I highly suggest to add an new protection
member in scan_control before.
[1]. https://lore.kernel.org/linux-mm/[email protected]/
[2]. https://lore.kernel.org/linux-mm/[email protected]/
--
Thanks
Yafang
Yafang Shao writes:
>Look at this patch[1] carefully you will find that it introduces the
>same issue that I tried to fix in another patch [2]. Even more sad is
>these two patches are in the same patchset. Although this issue isn't
>related with the issue found by Naresh, we have to ask ourselves why
>we always make the same mistake ?
>One possible answer is that we always forget the lifecyle of
>memory.emin before we read it. memory.emin doesn't have the same
>lifecycle with the memcg, while it really has the same lifecyle with
>the reclaimer. IOW, once a reclaimer begins the protetion value should
>be set to 0, and after we traversal the memcg tree we calculate a
>protection value for this reclaimer, finnaly it disapears after the
>reclaimer stops. That is why I highly suggest to add an new protection
>member in scan_control before.
I agree with you that the e{min,low} lifecycle is confusing for everyone -- the
only thing I've not seen confirmation of is any confirmed correlation with the
i386 oom killer issue. If you've validated that, I'd like to see the data :-)
On Fri 29-05-20 02:56:44, Chris Down wrote:
> Yafang Shao writes:
> > Look at this patch[1] carefully you will find that it introduces the
> > same issue that I tried to fix in another patch [2]. Even more sad is
> > these two patches are in the same patchset. Although this issue isn't
> > related with the issue found by Naresh, we have to ask ourselves why
> > we always make the same mistake ?
> > One possible answer is that we always forget the lifecyle of
> > memory.emin before we read it. memory.emin doesn't have the same
> > lifecycle with the memcg, while it really has the same lifecyle with
> > the reclaimer. IOW, once a reclaimer begins the protetion value should
> > be set to 0, and after we traversal the memcg tree we calculate a
> > protection value for this reclaimer, finnaly it disapears after the
> > reclaimer stops. That is why I highly suggest to add an new protection
> > member in scan_control before.
>
> I agree with you that the e{min,low} lifecycle is confusing for everyone --
> the only thing I've not seen confirmation of is any confirmed correlation
> with the i386 oom killer issue. If you've validated that, I'd like to see
> the data :-)
Agreed. Even if e{low,min} might still have some rough edges I am
completely puzzled how we could end up oom if none of the protection
path triggers which the additional debugging should confirm. Maybe my
debugging patch is incomplete or used incorrectly (maybe it would be
esier to use printk rather than trace_printk?).
--
Michal Hocko
SUSE Labs
On Fri 29-05-20 11:49:20, Michal Hocko wrote:
> On Fri 29-05-20 02:56:44, Chris Down wrote:
> > Yafang Shao writes:
> > > Look at this patch[1] carefully you will find that it introduces the
> > > same issue that I tried to fix in another patch [2]. Even more sad is
> > > these two patches are in the same patchset. Although this issue isn't
> > > related with the issue found by Naresh, we have to ask ourselves why
> > > we always make the same mistake ?
> > > One possible answer is that we always forget the lifecyle of
> > > memory.emin before we read it. memory.emin doesn't have the same
> > > lifecycle with the memcg, while it really has the same lifecyle with
> > > the reclaimer. IOW, once a reclaimer begins the protetion value should
> > > be set to 0, and after we traversal the memcg tree we calculate a
> > > protection value for this reclaimer, finnaly it disapears after the
> > > reclaimer stops. That is why I highly suggest to add an new protection
> > > member in scan_control before.
> >
> > I agree with you that the e{min,low} lifecycle is confusing for everyone --
> > the only thing I've not seen confirmation of is any confirmed correlation
> > with the i386 oom killer issue. If you've validated that, I'd like to see
> > the data :-)
>
> Agreed. Even if e{low,min} might still have some rough edges I am
> completely puzzled how we could end up oom if none of the protection
> path triggers which the additional debugging should confirm. Maybe my
> debugging patch is incomplete or used incorrectly (maybe it would be
> esier to use printk rather than trace_printk?).
It would be really great if we could move forward. While the fix (which
has been dropped from mmotm) is not super urgent I would really like to
understand how it could hit the observed behavior. Can we double check
that the debugging patch really doesn't trigger (e.g.
s@trace_printk@printk in the first step)? I have checked it again but
do not see any potential code path which would be affected by the patch
yet not trigger any output. But another pair of eyes would be really
great.
--
Michal Hocko
SUSE Labs
On Thu, 11 Jun 2020 at 15:25, Michal Hocko <[email protected]> wrote:
>
> On Fri 29-05-20 11:49:20, Michal Hocko wrote:
> > On Fri 29-05-20 02:56:44, Chris Down wrote:
> > > Yafang Shao writes:
> > Agreed. Even if e{low,min} might still have some rough edges I am
> > completely puzzled how we could end up oom if none of the protection
> > path triggers which the additional debugging should confirm. Maybe my
> > debugging patch is incomplete or used incorrectly (maybe it would be
> > esier to use printk rather than trace_printk?).
>
> It would be really great if we could move forward. While the fix (which
> has been dropped from mmotm) is not super urgent I would really like to
> understand how it could hit the observed behavior. Can we double check
> that the debugging patch really doesn't trigger (e.g.
> s@trace_printk@printk in the first step)?
Please suggest to me the way to get more debug information
by providing kernel debug patches and extra kernel configs.
I have applied your debug patch and tested on top on linux next 20200612
but did not find any printk output while running mkfs -t ext4 /drive test case.
> I have checked it again but
> do not see any potential code path which would be affected by the patch
> yet not trigger any output. But another pair of eyes would be really
> great.
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..d13ce7b02de4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2375,6 +2375,8 @@ static void get_scan_count(struct lruvec
*lruvec, struct scan_control *sc,
* sc->priority further than desirable.
*/
scan = max(scan, SWAP_CLUSTER_MAX);
+
+ trace_printk("scan:%lu protection:%lu\n", scan, protection);
} else {
scan = lruvec_size;
}
@@ -2618,6 +2620,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat,
struct scan_control *sc)
switch (mem_cgroup_protected(target_memcg, memcg)) {
case MEMCG_PROT_MIN:
+ trace_printk("under min:%lu emin:%lu\n", memcg->memory.min,
memcg->memory.emin);
/*
* Hard protection.
* If there is no reclaimable memory, OOM.
@@ -2630,6 +2633,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat,
struct scan_control *sc)
* there is an unprotected supply
* of reclaimable memory from other cgroups.
*/
+ trace_printk("under low:%lu elow:%lu\n", memcg->memory.low,
memcg->memory.elow);
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
continue;
--
2.23.0
ref:
test output:
https://lkft.validation.linaro.org/scheduler/job/1489767#L1388
Test artifacts link (kernel / modules):
https://builds.tuxbuild.com/5rRNgQqF_wHsSRptdj4A1A/
- Naresh
On Fri 12-06-20 15:13:22, Naresh Kamboju wrote:
> On Thu, 11 Jun 2020 at 15:25, Michal Hocko <[email protected]> wrote:
> >
> > On Fri 29-05-20 11:49:20, Michal Hocko wrote:
> > > On Fri 29-05-20 02:56:44, Chris Down wrote:
> > > > Yafang Shao writes:
> > > Agreed. Even if e{low,min} might still have some rough edges I am
> > > completely puzzled how we could end up oom if none of the protection
> > > path triggers which the additional debugging should confirm. Maybe my
> > > debugging patch is incomplete or used incorrectly (maybe it would be
> > > esier to use printk rather than trace_printk?).
> >
> > It would be really great if we could move forward. While the fix (which
> > has been dropped from mmotm) is not super urgent I would really like to
> > understand how it could hit the observed behavior. Can we double check
> > that the debugging patch really doesn't trigger (e.g.
> > s@trace_printk@printk in the first step)?
>
> Please suggest to me the way to get more debug information
> by providing kernel debug patches and extra kernel configs.
>
> I have applied your debug patch and tested on top on linux next 20200612
> but did not find any printk output while running mkfs -t ext4 /drive test case.
Have you tried s@trace_printk@printk@ in the patch? AFAIK trace_printk
doesn't dump anything into the printk ring buffer. You would have to
look into trace ring buffer.
--
Michal Hocko
SUSE Labs
On Thu, 21 May 2020 at 22:04, Michal Hocko <[email protected]> wrote:
>
> On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > Hi Naresh,
> > >
> > > Naresh Kamboju writes:
> > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > >
> > > > The following two patches have been reverted on next-20200519 and retested the
> > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > ( invoked oom-killer is gone now)
> > > >
> > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > protection"
> > > > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > >
> > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > checks"
> > > > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > >
> > > Thanks Anders and Naresh for tracking this down and reverting.
> > >
> > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > in either of those commits from a (very) cursory glance, but they should
> > > only be taking effect if protections are set.
> >
> > Agreed. If memory.{low,min} is not used then the patch should be
> > effectively a nop.
>
> I was staring into the code and do not see anything. Could you give the
> following debugging patch a try and see whether it triggers?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc555903a332..df2e8df0eb71 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> * sc->priority further than desirable.
> */
> scan = max(scan, SWAP_CLUSTER_MAX);
> +
> + trace_printk("scan:%lu protection:%lu\n", scan, protection);
> } else {
> scan = lruvec_size;
> }
> @@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> mem_cgroup_calculate_protection(target_memcg, memcg);
>
> if (mem_cgroup_below_min(memcg)) {
> + trace_printk("under min:%lu emin:%lu\n", memcg->memory.min, memcg->memory.emin);
> /*
> * Hard protection.
> * If there is no reclaimable memory, OOM.
> @@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> * there is an unprotected supply
> * of reclaimable memory from other cgroups.
> */
> + trace_printk("under low:%lu elow:%lu\n", memcg->memory.low, memcg->memory.elow);
> if (!sc->memcg_low_reclaim) {
> sc->memcg_low_skipped = 1;
> continue;
As per your suggestions on debugging this problem,
trace_printk is replaced with printk and applied to your patch on top of the
problematic kernel and here is the test output and link.
mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Allocating group tables: 0/7453 done
Writing inode tables: 0/7453 done
Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0
[ 51.845304] under min:0 emin:0
[ 51.848738] under min:0 emin:0
[ 51.858147] under min:0 emin:0
[ 51.861333] under min:0 emin:0
[ 51.862034] under min:0 emin:0
[ 51.862442] under min:0 emin:0
[ 51.862763] under min:0 emin:0
Full test log link,
https://lkft.validation.linaro.org/scheduler/job/1497412#L1451
- Naresh
Naresh Kamboju writes:
>mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
>mke2fs 1.43.8 (1-Jan-2018)
>Creating filesystem with 244190646 4k blocks and 61054976 inodes
>Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
>Superblock backups stored on blocks:
>32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>102400000, 214990848
>Allocating group tables: 0/7453 done
>Writing inode tables: 0/7453 done
>Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0
>[ 51.845304] under min:0 emin:0
>[ 51.848738] under min:0 emin:0
>[ 51.858147] under min:0 emin:0
>[ 51.861333] under min:0 emin:0
>[ 51.862034] under min:0 emin:0
>[ 51.862442] under min:0 emin:0
>[ 51.862763] under min:0 emin:0
Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even when
min/emin is 0 (which should indeed be the case if you haven't set them in the
hierarchy).
My guess is that page_counter_read(&memcg->memory) is 0, which means
mem_cgroup_below_min will return 1.
However, I don't know for sure why that should then result in the OOM killer
coming along. My guess is that since this memcg has 0 pages to scan anyway, we
enter premature OOM under some conditions. I don't know why we wouldn't have
hit that with the old version of mem_cgroup_protected that returned
MEMCG_PROT_* members, though.
Can you please try the patch with the `>=` checks in mem_cgroup_below_min and
mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a strong
hint about what's going on here.
Thanks for your help!
On Wed 17-06-20 19:07:20, Naresh Kamboju wrote:
> On Thu, 21 May 2020 at 22:04, Michal Hocko <[email protected]> wrote:
> >
> > On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > Hi Naresh,
> > > >
> > > > Naresh Kamboju writes:
> > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > > >
> > > > > The following two patches have been reverted on next-20200519 and retested the
> > > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > > ( invoked oom-killer is gone now)
> > > > >
> > > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > > protection"
> > > > > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > >
> > > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > > checks"
> > > > > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > >
> > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > >
> > > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > > in either of those commits from a (very) cursory glance, but they should
> > > > only be taking effect if protections are set.
> > >
> > > Agreed. If memory.{low,min} is not used then the patch should be
> > > effectively a nop.
> >
> > I was staring into the code and do not see anything. Could you give the
> > following debugging patch a try and see whether it triggers?
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cc555903a332..df2e8df0eb71 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > * sc->priority further than desirable.
> > */
> > scan = max(scan, SWAP_CLUSTER_MAX);
> > +
> > + trace_printk("scan:%lu protection:%lu\n", scan, protection);
> > } else {
> > scan = lruvec_size;
> > }
> > @@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > mem_cgroup_calculate_protection(target_memcg, memcg);
> >
> > if (mem_cgroup_below_min(memcg)) {
> > + trace_printk("under min:%lu emin:%lu\n", memcg->memory.min, memcg->memory.emin);
> > /*
> > * Hard protection.
> > * If there is no reclaimable memory, OOM.
> > @@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > * there is an unprotected supply
> > * of reclaimable memory from other cgroups.
> > */
> > + trace_printk("under low:%lu elow:%lu\n", memcg->memory.low, memcg->memory.elow);
> > if (!sc->memcg_low_reclaim) {
> > sc->memcg_low_skipped = 1;
> > continue;
>
> As per your suggestions on debugging this problem,
> trace_printk is replaced with printk and applied to your patch on top of the
> problematic kernel and here is the test output and link.
>
> mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> mke2fs 1.43.8 (1-Jan-2018)
> Creating filesystem with 244190646 4k blocks and 61054976 inodes
> Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 102400000, 214990848
> Allocating group tables: 0/7453 done
> Writing inode tables: 0/7453 done
> Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0
> [ 51.845304] under min:0 emin:0
> [ 51.848738] under min:0 emin:0
> [ 51.858147] under min:0 emin:0
> [ 51.861333] under min:0 emin:0
> [ 51.862034] under min:0 emin:0
> [ 51.862442] under min:0 emin:0
> [ 51.862763] under min:0 emin:0
>
> Full test log link,
> https://lkft.validation.linaro.org/scheduler/job/1497412#L1451
Thanks a lot. So it is clear that mem_cgroup_below_min got confused and
reported protected cgroup. Both effective and real limits are 0 so there
is no garbage in them. The problem is in mem_cgroup_below_* and it is
quite obvious.
We are doing the following
+static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
+{
+ if (mem_cgroup_disabled())
+ return false;
+
+ return READ_ONCE(memcg->memory.emin) >=
+ page_counter_read(&memcg->memory);
+}
and it makes some sense. Except for the root memcg where we do not
account any memory. Adding if (mem_cgroup_is_root(memcg)) return false;
should do the trick. The same is the case for mem_cgroup_below_low.
Could you give it a try please just to confirm?
--
Michal Hocko
SUSE Labs
Michal Hocko writes:
>and it makes some sense. Except for the root memcg where we do not
>account any memory. Adding if (mem_cgroup_is_root(memcg)) return false;
>should do the trick. The same is the case for mem_cgroup_below_low.
>Could you give it a try please just to confirm?
Oh, of course :-) This seems more likely than what I proposed, and would be
great to test.
On Wed, 17 Jun 2020 at 19:41, Michal Hocko <[email protected]> wrote:
>
> [Our emails have crossed]
>
> On Wed 17-06-20 14:57:58, Chris Down wrote:
> > Naresh Kamboju writes:
> > > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > > mke2fs 1.43.8 (1-Jan-2018)
> > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > > Superblock backups stored on blocks:
> > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > 102400000, 214990848
> > > Allocating group tables: 0/7453 done
> > > Writing inode tables: 0/7453 done
> > > Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0
> > > [ 51.845304] under min:0 emin:0
> > > [ 51.848738] under min:0 emin:0
> > > [ 51.858147] under min:0 emin:0
> > > [ 51.861333] under min:0 emin:0
> > > [ 51.862034] under min:0 emin:0
> > > [ 51.862442] under min:0 emin:0
> > > [ 51.862763] under min:0 emin:0
> >
> > Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even
> > when min/emin is 0 (which should indeed be the case if you haven't set them
> > in the hierarchy).
> >
> > My guess is that page_counter_read(&memcg->memory) is 0, which means
> > mem_cgroup_below_min will return 1.
>
> Yes this is the case because this is likely the root memcg which skips
> all charges.
>
> > However, I don't know for sure why that should then result in the OOM killer
> > coming along. My guess is that since this memcg has 0 pages to scan anyway,
> > we enter premature OOM under some conditions. I don't know why we wouldn't
> > have hit that with the old version of mem_cgroup_protected that returned
> > MEMCG_PROT_* members, though.
>
> Not really. There is likely no other memcg to reclaim from and assuming
> min limit protection will result in no reclaimable memory and thus the
> OOM killer.
>
> > Can you please try the patch with the `>=` checks in mem_cgroup_below_min
> > and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a
> > strong hint about what's going on here.
>
> This would work but I believe an explicit check for the root memcg would
> be easier to spot the reasoning.
May I request you to send debugging or proposed fix patches here.
I am happy to do more testing.
FYI,
Here is my repository for testing.
git: https://github.com/nareshkamboju/linux/tree/printk
branch: printk
- Naresh
On Wed 17-06-20 21:23:05, Naresh Kamboju wrote:
> On Wed, 17 Jun 2020 at 19:41, Michal Hocko <[email protected]> wrote:
> >
> > [Our emails have crossed]
> >
> > On Wed 17-06-20 14:57:58, Chris Down wrote:
> > > Naresh Kamboju writes:
> > > > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > 102400000, 214990848
> > > > Allocating group tables: 0/7453 done
> > > > Writing inode tables: 0/7453 done
> > > > Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0
> > > > [ 51.845304] under min:0 emin:0
> > > > [ 51.848738] under min:0 emin:0
> > > > [ 51.858147] under min:0 emin:0
> > > > [ 51.861333] under min:0 emin:0
> > > > [ 51.862034] under min:0 emin:0
> > > > [ 51.862442] under min:0 emin:0
> > > > [ 51.862763] under min:0 emin:0
> > >
> > > Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even
> > > when min/emin is 0 (which should indeed be the case if you haven't set them
> > > in the hierarchy).
> > >
> > > My guess is that page_counter_read(&memcg->memory) is 0, which means
> > > mem_cgroup_below_min will return 1.
> >
> > Yes this is the case because this is likely the root memcg which skips
> > all charges.
> >
> > > However, I don't know for sure why that should then result in the OOM killer
> > > coming along. My guess is that since this memcg has 0 pages to scan anyway,
> > > we enter premature OOM under some conditions. I don't know why we wouldn't
> > > have hit that with the old version of mem_cgroup_protected that returned
> > > MEMCG_PROT_* members, though.
> >
> > Not really. There is likely no other memcg to reclaim from and assuming
> > min limit protection will result in no reclaimable memory and thus the
> > OOM killer.
> >
> > > Can you please try the patch with the `>=` checks in mem_cgroup_below_min
> > > and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a
> > > strong hint about what's going on here.
> >
> > This would work but I believe an explicit check for the root memcg would
> > be easier to spot the reasoning.
>
> May I request you to send debugging or proposed fix patches here.
> I am happy to do more testing.
Sure, here is the diff to test.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c74a8f2323f1..6b5a31672fbe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -392,6 +392,13 @@ static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
if (mem_cgroup_disabled())
return false;
+ /*
+ * Root memcg doesn't account charges and doesn't support
+ * protection
+ */
+ if (mem_cgroup_is_root(memcg))
+ return false;
+
return READ_ONCE(memcg->memory.elow) >=
page_counter_read(&memcg->memory);
}
@@ -401,6 +408,13 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
if (mem_cgroup_disabled())
return false;
+ /*
+ * Root memcg doesn't account charges and doesn't support
+ * protection
+ */
+ if (mem_cgroup_is_root(memcg))
+ return false;
+
return READ_ONCE(memcg->memory.emin) >=
page_counter_read(&memcg->memory);
}
--
Michal Hocko
SUSE Labs
On Wed, 17 Jun 2020 at 21:36, Michal Hocko <[email protected]> wrote:
>
> On Wed 17-06-20 21:23:05, Naresh Kamboju wrote:
> > On Wed, 17 Jun 2020 at 19:41, Michal Hocko <[email protected]> wrote:
> > >
> > > [Our emails have crossed]
> > >
> > > On Wed 17-06-20 14:57:58, Chris Down wrote:
> > > > Naresh Kamboju writes:
> > > > > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > > > > Superblock backups stored on blocks:
> > > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > > 102400000, 214990848
> > > > > Allocating group tables: 0/7453 done
> > > > > Writing inode tables: 0/7453 done
> > > > > Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0
> > > > > [ 51.845304] under min:0 emin:0
> > > > > [ 51.848738] under min:0 emin:0
> > > > > [ 51.858147] under min:0 emin:0
> > > > > [ 51.861333] under min:0 emin:0
> > > > > [ 51.862034] under min:0 emin:0
> > > > > [ 51.862442] under min:0 emin:0
> > > > > [ 51.862763] under min:0 emin:0
> > > >
> > > > Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even
> > > > when min/emin is 0 (which should indeed be the case if you haven't set them
> > > > in the hierarchy).
> > > >
> > > > My guess is that page_counter_read(&memcg->memory) is 0, which means
> > > > mem_cgroup_below_min will return 1.
> > >
> > > Yes this is the case because this is likely the root memcg which skips
> > > all charges.
> > >
> > > > However, I don't know for sure why that should then result in the OOM killer
> > > > coming along. My guess is that since this memcg has 0 pages to scan anyway,
> > > > we enter premature OOM under some conditions. I don't know why we wouldn't
> > > > have hit that with the old version of mem_cgroup_protected that returned
> > > > MEMCG_PROT_* members, though.
> > >
> > > Not really. There is likely no other memcg to reclaim from and assuming
> > > min limit protection will result in no reclaimable memory and thus the
> > > OOM killer.
> > >
> > > > Can you please try the patch with the `>=` checks in mem_cgroup_below_min
> > > > and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a
> > > > strong hint about what's going on here.
> > >
> > > This would work but I believe an explicit check for the root memcg would
> > > be easier to spot the reasoning.
> >
> > May I request you to send debugging or proposed fix patches here.
> > I am happy to do more testing.
>
> Sure, here is the diff to test.
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c74a8f2323f1..6b5a31672fbe 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -392,6 +392,13 @@ static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> if (mem_cgroup_disabled())
> return false;
>
> + /*
> + * Root memcg doesn't account charges and doesn't support
> + * protection
> + */
> + if (mem_cgroup_is_root(memcg))
> + return false;
> +
> return READ_ONCE(memcg->memory.elow) >=
> page_counter_read(&memcg->memory);
> }
> @@ -401,6 +408,13 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
> if (mem_cgroup_disabled())
> return false;
>
> + /*
> + * Root memcg doesn't account charges and doesn't support
> + * protection
> + */
> + if (mem_cgroup_is_root(memcg))
> + return false;
> +
> return READ_ONCE(memcg->memory.emin) >=
> page_counter_read(&memcg->memory);
> }
After this patch applied the reported issue got fixed.
test log link,
https://lkft.validation.linaro.org/scheduler/job/1505417#L1429
- Naresh
Naresh Kamboju writes:
>After this patch applied the reported issue got fixed.
Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
I'll send out a new version tomorrow with the fixes applied and both of you
credited in the changelog for the detection and fix.
Yafang Shao writes:
>On Thu, Jun 18, 2020 at 5:09 AM Chris Down <[email protected]> wrote:
>>
>> Naresh Kamboju writes:
>> >After this patch applied the reported issue got fixed.
>>
>> Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
>>
>> I'll send out a new version tomorrow with the fixes applied and both of you
>> credited in the changelog for the detection and fix.
>
>As we have already found that the usage around memory.{emin, elow} has
>many limitations, I think memory.{emin, elow} should be used for
>memcg-tree internally only, that means they can only be used to
>calculate the protection of a memcg in a specified memcg-tree but
>should not be exposed to other MM parts.
I agree that the current semantics are mentally taxing and we should generally
avoid exposing the implementation details outside of memcg where possible. Do
you have a suggested rework? :-)
On Thu 18-06-20 13:37:43, Chris Down wrote:
> Yafang Shao writes:
> > On Thu, Jun 18, 2020 at 5:09 AM Chris Down <[email protected]> wrote:
> > >
> > > Naresh Kamboju writes:
> > > >After this patch applied the reported issue got fixed.
> > >
> > > Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
> > >
> > > I'll send out a new version tomorrow with the fixes applied and both of you
> > > credited in the changelog for the detection and fix.
> >
> > As we have already found that the usage around memory.{emin, elow} has
> > many limitations, I think memory.{emin, elow} should be used for
> > memcg-tree internally only, that means they can only be used to
> > calculate the protection of a memcg in a specified memcg-tree but
> > should not be exposed to other MM parts.
>
> I agree that the current semantics are mentally taxing and we should
> generally avoid exposing the implementation details outside of memcg where
> possible. Do you have a suggested rework? :-)
I would really prefer to do that work on top of the fixes we (used to)
have in mmotm (with the fixup).
--
Michal Hocko
SUSE Labs
Michal Hocko writes:
>I would really prefer to do that work on top of the fixes we (used to)
>have in mmotm (with the fixup).
Oh, for sure. We should reintroduce the patches with the fix, and then look at
longer-term solutions once that's in :-)
On Thu, Jun 18, 2020 at 8:37 PM Chris Down <[email protected]> wrote:
>
> Yafang Shao writes:
> >On Thu, Jun 18, 2020 at 5:09 AM Chris Down <[email protected]> wrote:
> >>
> >> Naresh Kamboju writes:
> >> >After this patch applied the reported issue got fixed.
> >>
> >> Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
> >>
> >> I'll send out a new version tomorrow with the fixes applied and both of you
> >> credited in the changelog for the detection and fix.
> >
> >As we have already found that the usage around memory.{emin, elow} has
> >many limitations, I think memory.{emin, elow} should be used for
> >memcg-tree internally only, that means they can only be used to
> >calculate the protection of a memcg in a specified memcg-tree but
> >should not be exposed to other MM parts.
>
> I agree that the current semantics are mentally taxing and we should generally
> avoid exposing the implementation details outside of memcg where possible. Do
> you have a suggested rework? :-)
Keeping the mem_cgroup_protected() as-is is my suggestion. Anyway I
think it is bad to put memory.{emin, elow} here and there.
If we don't have any better idea by now, just putting all the
references of memory.{emin, elow} into one
wrapper(mem_cgroup_protected()) is the reasonable solution.
--
Thanks
Yafang