2016-11-21 21:50:43

by Vlastimil Babka

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On 11/21/2016 04:43 PM, Marc MERLIN wrote:
> Howdy,
>
> As a followup to https://plus.google.com/u/0/+MarcMERLIN/posts/A3FrLVo3kc6
>
> http://pastebin.com/yJybSHNq and http://pastebin.com/B6xEH4Dw
> show a system with plenty of RAM (24GB) falling over and killing inoccent
> user space apps, a few hours after I start a 9TB copy between 2 raid5 arrays
> both hosting bcache, dmcrypt and btrfs (yes, that's 3 layers under btrfs)
>
> This kind of stuff worked until 4.6 if I'm not mistaken and started failing
> with 4.8 (I didn't try 4.7)
>
> I tried applying
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9f7e3387939b036faacf4e7f32de7bb92a6635d6
> to 4.8.8 and it didn't help
> http://pastebin.com/2LUicF3k
>
> 4.9rc5 however seems to be doing better, and is still running after 18
> hours. However, I got a few page allocation failures as per below, but the
> system seems to recover.
> Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
> or is that good enough, and i should go back to 4.8.8 with that patch applied?
> https://marc.info/?l=linux-mm&m=147423605024993

Hi, I think it's enough for 4.9 for now and I would appreciate trying
4.8 with that patch, yeah.

The failures below are in a GFP_NOWAIT context, which cannot do any
reclaim so it's not affected by OOM rewrite. If it's a regression, it
has to be caused by something else. But it seems the code in
cfq_get_queue() intentionally doesn't want to reclaim or use any atomic
reserves, and has a fallback scenario for allocation failure, in which
case I would argue that it should add __GFP_NOWARN, as these warnings
can't help anyone. CCing Tejun as author of commit d4aad7ff0.

>
> Thanks,
> Marc
>
>
> bash: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
> CPU: 4 PID: 16706 Comm: bash Not tainted 4.9.0-rc5-amd64-volpreempt-sysrq-20161108 #1
> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> ffff9812088ff680 ffffffff9a36f697 0000000000000000 ffffffff9aababe8
> ffff9812088ff710 ffffffff9a13ae2b 0220400000000012 ffffffff9aababe8
> ffff9812088ff6a8 ffffffff00000010 ffff9812088ff720 ffff9812088ff6c0
> Call Trace:
> [<ffffffff9a36f697>] dump_stack+0x61/0x7d
> [<ffffffff9a13ae2b>] warn_alloc+0x107/0x11b
> [<ffffffff9a13b5cc>] __alloc_pages_slowpath+0x727/0x8f2
> [<ffffffff9a13abb8>] ? get_page_from_freelist+0x62e/0x66f
> [<ffffffff9a13b8f3>] __alloc_pages_nodemask+0x15c/0x220
> [<ffffffff9a18036d>] cache_grow_begin+0xb2/0x308
> [<ffffffff9a180a2b>] fallback_alloc+0x137/0x19f
> [<ffffffff9a1808e9>] ____cache_alloc_node+0xd3/0xde
> [<ffffffff9a180b21>] kmem_cache_alloc_node+0x8e/0x163
> [<ffffffff9a36ad08>] cfq_get_queue+0x162/0x29d
> [<ffffffff9a1811d4>] ? kmem_cache_alloc+0xd7/0x14b
> [<ffffffff9a136495>] ? mempool_alloc_slab+0x15/0x17
> [<ffffffff9a13659f>] ? mempool_alloc+0x69/0x132
> [<ffffffff9a36af84>] cfq_set_request+0x141/0x2be
> [<ffffffff9a0bd9dc>] ? timekeeping_get_ns+0x1e/0x32
> [<ffffffff9a0bdb8c>] ? ktime_get+0x41/0x52
> [<ffffffff9a367188>] ? ktime_get_ns+0x9/0xb
> [<ffffffff9a3671bf>] ? cfq_init_icq+0x12/0x19
> [<ffffffff9a346046>] elv_set_request+0x1f/0x24
> [<ffffffff9a3495ca>] get_request+0x324/0x5aa
> [<ffffffff9a0945b0>] ? wake_up_atomic_t+0x2c/0x2c
> [<ffffffff9a34bc5c>] blk_queue_bio+0x19f/0x28c
> [<ffffffff9a34a525>] generic_make_request+0xbd/0x160
> [<ffffffff9a34a6c8>] submit_bio+0x100/0x11d
> [<ffffffff9a17177a>] ? map_swap_page+0x12/0x14
> [<ffffffff9a16e875>] ? get_swap_bio+0x57/0x6c
> [<ffffffff9a16edfb>] swap_readpage+0x106/0x10e
> [<ffffffff9a16f3e0>] read_swap_cache_async+0x26/0x2d
> [<ffffffff9a16f501>] swapin_readahead+0x11a/0x16a
> [<ffffffff9a15de97>] do_swap_page+0x9c/0x42e
> [<ffffffff9a15de97>] ? do_swap_page+0x9c/0x42e
> [<ffffffff9a1601ff>] handle_mm_fault+0xa51/0xb71
> [<ffffffff9a6a61a5>] ? _raw_spin_lock_irq+0x1c/0x1e
> [<ffffffff9a052091>] __do_page_fault+0x29e/0x425
> [<ffffffff9a05223d>] do_page_fault+0x25/0x27
> [<ffffffff9a6a7818>] page_fault+0x28/0x30
> Mem-Info:
> active_anon:563129 inactive_anon:140630 isolated_anon:0
> active_file:4036325 inactive_file:448954 isolated_file:288
> unevictable:1760 dirty:9197 writeback:446395 unstable:0
> slab_reclaimable:47810 slab_unreclaimable:120834
> mapped:534180 shmem:627708 pagetables:5647 bounce:0
> free:90108 free_pcp:218 free_cma:78
> Node 0 active_anon:2252516kB inactive_anon:562520kB active_file:16145300kB inactive_file:1795816kB unevictable:7040kB isolated(anon):0kB isolated(file):1152kB mapped:2136720kB dirty:367
> 1785580kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2510832kB writeback_tmp:0kB unstable:0kB pages_scanned:32 all_unreclaimable? no
> Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15976kB managed:15892kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 3199 23767 23767 23767
> Node 0 DMA32 free:117656kB min:35424kB low:44280kB high:53136kB active_anon:38004kB inactive_anon:13540kB active_file:2221420kB inactive_file:307236kB unevictable:0kB writepending:311780kB present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:47992kB slab_unreclaimable:25360kB kernel_stack:512kB pagetables:796kB bounce:0kB free_pcp:96kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 0 20567 20567 20567
> Node 0 Normal free:226892kB min:226544kB low:283180kB high:339816kB active_anon:2214684kB inactive_anon:549272kB active_file:13923880kB inactive_file:1488092kB unevictable:7040kB writepending:1510452kB present:21485568kB managed:21080820kB mlocked:7040kB slab_reclaimable:143248kB slab_unreclaimable:457968kB kernel_stack:44384kB pagetables:21792kB bounce:0kB free_pcp:776kB local_pcp:132kB free_cma:312kB
> lowmem_reserve[]: 0 0 0 0 0
> Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
> Node 0 DMA32: 11805*4kB (UME) 8876*8kB (UM) 5*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 118308kB
> Node 0 Normal: 55077*4kB (UMEC) 843*8kB (UMC) 2*16kB (C) 1*32kB (C) 3*64kB (C) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 227308kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 5121498 total pagecache pages
> 7334 pages in swap cache
> Swap cache stats: add 1513475, delete 1506141, find 949827/1257375
> Free swap = 14492876kB
> Total swap = 15616764kB
> 6215903 pages RAM
> 0 pages HighMem/MovableOnly
> 117600 pages reserved
> 4096 pages cma reserved
> 0 pages hwpoisoned
>
> kworker/4:197: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
> CPU: 4 PID: 7411 Comm: kworker/4:197 Not tainted 4.9.0-rc5-amd64-volpreempt-sysrq-20161108 #1
> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> Workqueue: bcache cache_lookup [bcache]
> ffff98121672f590 ffffffff9a36f697 0000000000000000 ffffffff9aababe8
> ffff98121672f620 ffffffff9a13ae2b 0220400000000012 ffffffff9aababe8
> ffff98121672f5b8 ffffffff00000010 ffff98121672f630 ffff98121672f5d0
> Call Trace:
> [<ffffffff9a36f697>] dump_stack+0x61/0x7d
> [<ffffffff9a13ae2b>] warn_alloc+0x107/0x11b
> [<ffffffff9a13b5cc>] __alloc_pages_slowpath+0x727/0x8f2
> [<ffffffff9a13abb8>] ? get_page_from_freelist+0x62e/0x66f
> [<ffffffff9a13b8f3>] __alloc_pages_nodemask+0x15c/0x220
> [<ffffffff9a18036d>] cache_grow_begin+0xb2/0x308
> [<ffffffff9a180a2b>] fallback_alloc+0x137/0x19f
> [<ffffffff9a1808e9>] ____cache_alloc_node+0xd3/0xde
> [<ffffffff9a180b21>] kmem_cache_alloc_node+0x8e/0x163
> [<ffffffff9a36ad08>] cfq_get_queue+0x162/0x29d
> [<ffffffff9a136495>] ? mempool_alloc_slab+0x15/0x17
> [<ffffffff9a13659f>] ? mempool_alloc+0x69/0x132
> [<ffffffff9a36af84>] cfq_set_request+0x141/0x2be
> [<ffffffff9a0bd9dc>] ? timekeeping_get_ns+0x1e/0x32
> [<ffffffff9a0bdb8c>] ? ktime_get+0x41/0x52
> [<ffffffff9a367188>] ? ktime_get_ns+0x9/0xb
> [<ffffffff9a3671bf>] ? cfq_init_icq+0x12/0x19
> [<ffffffff9a346046>] elv_set_request+0x1f/0x24
> [<ffffffff9a3495ca>] get_request+0x324/0x5aa
> [<ffffffff9a0945b0>] ? wake_up_atomic_t+0x2c/0x2c
> [<ffffffff9a34bc5c>] blk_queue_bio+0x19f/0x28c
> [<ffffffff9a34a525>] generic_make_request+0xbd/0x160
> [<ffffffffc062cd53>] cached_dev_cache_miss+0x20c/0x21b [bcache]
> [<ffffffffc062c9ca>] cache_lookup_fn+0xe2/0x25f [bcache]
> [<ffffffffc06227b7>] ? bch_ptr_bad+0xa/0xc [bcache]
> [<ffffffffc062c8e8>] ? bio_next_split.constprop.22+0x20/0x20 [bcache]
> [<ffffffffc0625377>] bch_btree_map_keys_recurse+0x7b/0x151 [bcache]
> [<ffffffffc06250e8>] ? bch_btree_node_get+0xc2/0x1c8 [bcache]
> [<ffffffffc062c8e8>] ? bio_next_split.constprop.22+0x20/0x20 [bcache]
> [<ffffffffc06253c4>] bch_btree_map_keys_recurse+0xc8/0x151 [bcache]
> [<ffffffff9a08acb4>] ? set_next_entity+0x51/0xbc
> [<ffffffff9a08f309>] ? pick_next_task_fair+0x12c/0x348
> [<ffffffffc062784c>] bch_btree_map_keys+0x8f/0xdb [bcache]
> [<ffffffffc062c8e8>] ? bio_next_split.constprop.22+0x20/0x20 [bcache]
> [<ffffffffc062c814>] cache_lookup+0x84/0xb7 [bcache]
> [<ffffffff9a0771da>] process_one_work+0x17f/0x28d
> [<ffffffff9a0777b6>] worker_thread+0x1ee/0x2c1
> [<ffffffff9a0775c8>] ? rescuer_thread+0x2ad/0x2ad
> [<ffffffff9a07bbb6>] kthread+0xa6/0xae
> [<ffffffff9a07bb10>] ? init_completion+0x24/0x24
> [<ffffffff9a6a66b5>] ret_from_fork+0x25/0x30
> Mem-Info:
> active_anon:557191 inactive_anon:139781 isolated_anon:0
> active_file:4043120 inactive_file:390969 isolated_file:0
> unevictable:1760 dirty:4811 writeback:387556 unstable:0
> slab_reclaimable:47779 slab_unreclaimable:120741
> mapped:561961 shmem:627643 pagetables:5533 bounce:0
> free:90159 free_pcp:411 free_cma:78
> Node 0 active_anon:2228764kB inactive_anon:559124kB active_file:16172480kB inactive_file:1563876kB unevictable:7040kB isolated(anon):0kB isolated(file):0kB mapped:2247844kB dirty:19244kB writeback:1550224kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2510572kB writeback_tmp:0kB unstable:0kB pages_scanned:96 all_unreclaimable? no
> Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15976kB managed:15892kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 3199 23767 23767 23767
> Node 0 DMA32 free:117816kB min:35424kB low:44280kB high:53136kB active_anon:35856kB inactive_anon:12996kB active_file:2225180kB inactive_file:289328kB unevictable:0kB writepending:290420kB present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:47664kB slab_unreclaimable:25036kB kernel_stack:528kB pagetables:692kB bounce:0kB free_pcp:244kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 0 20567 20567 20567
> Node 0 Normal free:226936kB min:226544kB low:283180kB high:339816kB active_anon:2192908kB inactive_anon:546128kB active_file:13947204kB inactive_file:1274320kB unevictable:7040kB writepending:1279356kB present:21485568kB managed:21080820kB mlocked:7040kB slab_reclaimable:143452kB slab_unreclaimable:457920kB kernel_stack:44224kB pagetables:21440kB bounce:0kB free_pcp:1396kB local_pcp:0kB free_cma:312kB
> lowmem_reserve[]: 0 0 0 0 0
> Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
> Node 0 DMA32: 9086*4kB (ME) 9579*8kB (UME) 306*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 117872kB
> Node 0 Normal: 54290*4kB (UMEC) 1198*8kB (UMC) 2*16kB (C) 1*32kB (C) 3*64kB (C) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 227000kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 5067527 total pagecache pages
> 4955 pages in swap cache
> Swap cache stats: add 1516061, delete 1511106, find 950848/1258840
> Free swap = 14488852kB
> Total swap = 15616764kB
> 6215903 pages RAM
> 0 pages HighMem/MovableOnly
> 117600 pages reserved
> 4096 pages cma reserved
> 0 pages hwpoisoned
>
> hpet1: lost 7439 rtc interrupts
> kworker/0:203: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
> CPU: 0 PID: 10557 Comm: kworker/0:203 Not tainted 4.9.0-rc5-amd64-volpreempt-sysrq-20161108 #1
> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> Workqueue: bcache cache_lookup [bcache]
> ffff98121ac275f0 ffffffff9a36f697 0000000000000000 ffffffff9aababe8
> ffff98121ac27680 ffffffff9a13ae2b 0220400000000012 ffffffff9aababe8
> ffff98121ac27618 ffffffff00000010 ffff98121ac27690 ffff98121ac27630
> Call Trace:
> [<ffffffff9a36f697>] dump_stack+0x61/0x7d
> [<ffffffff9a13ae2b>] warn_alloc+0x107/0x11b
> [<ffffffff9a13b5cc>] __alloc_pages_slowpath+0x727/0x8f2
> [<ffffffff9a13abb8>] ? get_page_from_freelist+0x62e/0x66f
> [<ffffffff9a13b8f3>] __alloc_pages_nodemask+0x15c/0x220
> [<ffffffff9a18036d>] cache_grow_begin+0xb2/0x308
> [<ffffffff9a180a2b>] fallback_alloc+0x137/0x19f
> [<ffffffff9a1808e9>] ____cache_alloc_node+0xd3/0xde
> [<ffffffff9a180b21>] kmem_cache_alloc_node+0x8e/0x163
> [<ffffffff9a36ad08>] cfq_get_queue+0x162/0x29d
> [<ffffffff9a367188>] ? ktime_get_ns+0x9/0xb
> [<ffffffff9a36bcb7>] ? cfq_dispatch_requests+0x124/0x81f
> [<ffffffffc01ad443>] ? sil24_qc_issue+0x1e/0x55 [sata_sil24]
> [<ffffffff9a51fa7c>] ? ata_qc_issue+0x278/0x2b9
> [<ffffffff9a36af84>] cfq_set_request+0x141/0x2be
> [<ffffffff9a347c0a>] ? alloc_request_struct+0x19/0x1b
> [<ffffffff9a13659f>] ? mempool_alloc+0x69/0x132
> [<ffffffff9a029dff>] ? native_sched_clock+0x1f/0x3a
> [<ffffffff9a346046>] elv_set_request+0x1f/0x24
> [<ffffffff9a3495ca>] get_request+0x324/0x5aa
> [<ffffffff9a0945b0>] ? wake_up_atomic_t+0x2c/0x2c
> [<ffffffff9a34bc5c>] blk_queue_bio+0x19f/0x28c
> [<ffffffff9a34a525>] generic_make_request+0xbd/0x160
> [<ffffffffc06296d6>] __bch_submit_bbio+0x5f/0x62 [bcache]
> [<ffffffffc0629704>] bch_submit_bbio+0x2b/0x30 [bcache]
> [<ffffffffc06241b9>] bch_btree_node_read+0xca/0x16e [bcache]
> [<ffffffffc06250db>] bch_btree_node_get+0xb5/0x1c8 [bcache]
> [<ffffffffc062c8e8>] ? bio_next_split.constprop.22+0x20/0x20 [bcache]
> [<ffffffffc062539d>] bch_btree_map_keys_recurse+0xa1/0x151 [bcache]
> [<ffffffff9a08acb4>] ? set_next_entity+0x51/0xbc
> [<ffffffff9a08f309>] ? pick_next_task_fair+0x12c/0x348
> [<ffffffffc062784c>] bch_btree_map_keys+0x8f/0xdb [bcache]
> [<ffffffffc062c8e8>] ? bio_next_split.constprop.22+0x20/0x20 [bcache]
> [<ffffffffc062c814>] cache_lookup+0x84/0xb7 [bcache]
> [<ffffffff9a0771da>] process_one_work+0x17f/0x28d
> [<ffffffff9a0777b6>] worker_thread+0x1ee/0x2c1
> [<ffffffff9a0775c8>] ? rescuer_thread+0x2ad/0x2ad
> [<ffffffff9a07bbb6>] kthread+0xa6/0xae
> [<ffffffff9a07bb10>] ? init_completion+0x24/0x24
> [<ffffffff9a6a66b5>] ret_from_fork+0x25/0x30
> Mem-Info:
> active_anon:561047 inactive_anon:140273 isolated_anon:0
> active_file:3993560 inactive_file:546181 isolated_file:0
> unevictable:1891 dirty:16100 writeback:540602 unstable:0
> slab_reclaimable:47776 slab_unreclaimable:122101
> mapped:646066 shmem:628914 pagetables:5797 bounce:0
> free:106427 free_pcp:1437 free_cma:126
> Node 0 active_anon:2244188kB inactive_anon:561092kB active_file:15974240kB inactive_file:2184724kB unevictable:7564kB isolated(anon):0kB isolated(file):0kB mapped:2584264kB dirty:64400kB writeback:2162408kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2515656kB writeback_tmp:0kB unstable:0kB pages_scanned:96 all_unreclaimable? no
> Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15976kB managed:15892kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 3199 23767 23767 23767
> Node 0 DMA32 free:145580kB min:35424kB low:44280kB high:53136kB active_anon:28040kB inactive_anon:19732kB active_file:2166240kB inactive_file:458248kB unevictable:252kB writepending:469368kB present:3362068kB managed:3296500kB mlocked:252kB slab_reclaimable:48544kB slab_unreclaimable:25364kB kernel_stack:480kB pagetables:604kB bounce:0kB free_pcp:2896kB local_pcp:120kB free_cma:0kB
> lowmem_reserve[]: 0 0 20567 20567 20567
> Node 0 Normal free:264244kB min:226544kB low:283180kB high:339816kB active_anon:2216160kB inactive_anon:541360kB active_file:13808316kB inactive_file:1726796kB unevictable:7312kB writepending:1757972kB present:21485568kB managed:21080820kB mlocked:7312kB slab_reclaimable:142560kB slab_unreclaimable:463032kB kernel_stack:44544kB pagetables:22584kB bounce:0kB free_pcp:2688kB local_pcp:0kB free_cma:504kB
> lowmem_reserve[]: 0 0 0 0 0
> Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
> Node 0 DMA32: 9458*4kB (UME) 12888*8kB (UM) 297*16kB (UM) 1*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 145720kB
> Node 0 Normal: 56076*4kB (UMEC) 4824*8kB (UMC) 106*16kB (UMC) 16*32kB (UC) 3*64kB (C) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 265296kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 5177157 total pagecache pages
> 7486 pages in swap cache
> Swap cache stats: add 1529461, delete 1521975, find 963233/1276841
> Free swap = 14488104kB
> Total swap = 15616764kB
> 6215903 pages RAM
> 0 pages HighMem/MovableOnly
> 117600 pages reserved
> 4096 pages cma reserved
> 0 pages hwpoisoned
> hpet1: lost 876 rtc interrupts
>


2016-11-21 22:19:26

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> > 4.9rc5 however seems to be doing better, and is still running after 18
> > hours. However, I got a few page allocation failures as per below, but the
> > system seems to recover.
> > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
> > or is that good enough, and i should go back to 4.8.8 with that patch applied?
> > https://marc.info/?l=linux-mm&m=147423605024993
>
> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> 4.8 with that patch, yeah.

So the good news is that it's been running for almost 5H and so far so good.

> The failures below are in a GFP_NOWAIT context, which cannot do any
> reclaim so it's not affected by OOM rewrite. If it's a regression, it
> has to be caused by something else. But it seems the code in
> cfq_get_queue() intentionally doesn't want to reclaim or use any atomic
> reserves, and has a fallback scenario for allocation failure, in which
> case I would argue that it should add __GFP_NOWARN, as these warnings
> can't help anyone. CCing Tejun as author of commit d4aad7ff0.

No, that's not a regression, I get those on occasion. The good news is that they're not
fatal. Just got another one with 4.8.8.
No idea if they're actual errors I should worry about, or just warnings that spam
the console a bit, but things retry, recover and succeed, so I can ignore them.

Another one from 4.8.8 below. I'll report back tomorrow to see if this has run for a day
and if so, I'll call your patch a fix for my problem (but at this point, it's already
looking very good).

Thanks, Marc

cron: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
CPU: 4 PID: 9748 Comm: cron Tainted: G U 4.8.8-amd64-volpreempt-sysrq-20161108vb2 #9
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
0000000000000000 ffffa1e37429f6d0 ffffffff9a36a0bb 0000000000000000
0000000000000000 ffffa1e37429f768 ffffffff9a1359d4 022040009f5e8d00
0000000000000012 0000000000000000 0000000000000000 ffffffff9a140770
Call Trace:
[<ffffffff9a36a0bb>] dump_stack+0x61/0x7d
[<ffffffff9a1359d4>] warn_alloc_failed+0x11c/0x132
[<ffffffff9a140770>] ? wakeup_kswapd+0x8e/0x153
[<ffffffff9a1362d8>] __alloc_pages_nodemask+0x87b/0xb02
[<ffffffff9a1362d8>] ? __alloc_pages_nodemask+0x87b/0xb02
[<ffffffff9a17b388>] cache_grow_begin+0xb2/0x30b
[<ffffffff9a17ba49>] fallback_alloc+0x137/0x19f
[<ffffffff9a17b907>] ____cache_alloc_node+0xd3/0xde
[<ffffffff9a17bb3f>] kmem_cache_alloc_node+0x8e/0x163
[<ffffffff9a36578f>] cfq_get_queue+0x162/0x29d
[<ffffffff9a17c1ea>] ? kmem_cache_alloc+0xd7/0x14b
[<ffffffff9a17acc0>] ? slab_post_alloc_hook+0x5b/0x66
[<ffffffff9a365a0b>] cfq_set_request+0x141/0x2be
[<ffffffff9a0ba1ed>] ? timekeeping_get_ns+0x1e/0x32
[<ffffffff9a0ba39d>] ? ktime_get+0x41/0x52
[<ffffffff9a361bd8>] ? ktime_get_ns+0x9/0xb
[<ffffffff9a361c0f>] ? cfq_init_icq+0x12/0x19
[<ffffffff9a340407>] elv_set_request+0x1f/0x24
[<ffffffff9a34367d>] get_request+0x324/0x5aa
[<ffffffff9a091aed>] ? wake_up_atomic_t+0x2c/0x2c
[<ffffffff9a346005>] blk_queue_bio+0x19f/0x28c
[<ffffffff9a3448e0>] generic_make_request+0xbd/0x160
[<ffffffff9a344a83>] submit_bio+0x100/0x11d
[<ffffffff9a16c7cb>] ? map_swap_page+0x12/0x14
[<ffffffff9a169835>] ? get_swap_bio+0x57/0x6c
[<ffffffff9a169dd0>] swap_readpage+0x110/0x118
[<ffffffff9a16a39d>] read_swap_cache_async+0x26/0x2d
[<ffffffff9a16a4be>] swapin_readahead+0x11a/0x16a
[<ffffffff9a159167>] do_swap_page+0x9c/0x431
[<ffffffff9a159167>] ? do_swap_page+0x9c/0x431
[<ffffffff9a15b4af>] handle_mm_fault+0xa4d/0xb3d
[<ffffffff9a19ce81>] ? vfs_getattr_nosec+0x26/0x37
[<ffffffff9a050e06>] __do_page_fault+0x267/0x43d
[<ffffffff9a051001>] do_page_fault+0x25/0x27
[<ffffffff9a698c18>] page_fault+0x28/0x30
Mem-Info:
active_anon:532194 inactive_anon:133376 isolated_anon:0
active_file:4118244 inactive_file:382010 isolated_file:0
unevictable:1687 dirty:3502 writeback:386111 unstable:0
slab_reclaimable:41767 slab_unreclaimable:106595
mapped:512496 shmem:582026 pagetables:5352 bounce:0
free:92092 free_pcp:176 free_cma:2072
Node 0 active_anon:2128776kB inactive_anon:533504kB active_file:16472976kB inactive_file:1528040kB unevictable:6748kB isolated(anon):0kB isolated(file):0kB mapped:2049984kB dirty:14008kB writeback:1544444kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2328104kB writeback_tmp:0kB unstable:0kB pages_scanned:1 all_unreclaimable? no
Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15976kB managed:15892kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 3200 23767 23767 23767
Node 0 DMA32 free:117580kB min:35424kB low:44280kB high:53136kB active_anon:3980kB inactive_anon:400kB active_file:2632672kB inactive_file:286956kB unevictable:0kB writepending:288296kB present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:41632kB slab_unreclaimable:19512kB kernel_stack:880kB pagetables:676kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 20567 20567 20567
Node 0 Normal free:234904kB min:226544kB low:283180kB high:339816kB active_anon:2124796kB inactive_anon:533104kB active_file:13840304kB inactive_file:1241268kB unevictable:6748kB writepending:1270156kB present:21485568kB managed:21080636kB mlocked:6748kB slab_reclaimable:125436kB slab_unreclaimable:406860kB kernel_stack:12432kB pagetables:20732kB bounce:0kB free_pcp:704kB local_pcp:108kB free_cma:8288kB
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
Node 0 DMA32: 10970*4kB (UME) 5760*8kB (UME) 1737*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 117752kB
Node 0 Normal: 32844*4kB (UMEHC) 12063*8kB (UMHC) 54*16kB (MHC) 23*32kB (MHC) 14*64kB (MC) 12*128kB (C) 2*256kB (C) 0*512kB 1*1024kB (C) 1*2048kB (C) 0*4096kB = 235496kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
5095318 total pagecache pages
12175 pages in swap cache
Swap cache stats: add 959044, delete 946869, find 485396/573209
Free swap = 14575420kB
Total swap = 15616764kB
6215903 pages RAM
0 pages HighMem/MovableOnly
117646 pages reserved
4096 pages cma reserved
0 pages hwpoisoned

--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

2016-11-21 23:03:36

by Tejun Heo

[permalink] [raw]
Subject: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
when that fails falls back to operations which aren't specific to the
cgroup. Occassional failures are expected under pressure and falling
back to non-cgroup operation is the right thing to do.

Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
these expected failures end up creating a lot of noise. Add
__GFP_NOWARN.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Marc MERLIN <[email protected]>
Reported-by: Vlastimil Babka <[email protected]>
---
block/blk-cgroup.c | 9 +++++----
block/cfq-iosched.c | 3 ++-
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b08ccbb..8ba0af7 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -185,7 +185,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
}

wb_congested = wb_congested_get_create(&q->backing_dev_info,
- blkcg->css.id, GFP_NOWAIT);
+ blkcg->css.id,
+ GFP_NOWAIT | __GFP_NOWARN);
if (!wb_congested) {
ret = -ENOMEM;
goto err_put_css;
@@ -193,7 +194,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,

/* allocate */
if (!new_blkg) {
- new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT);
+ new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT | __GFP_NOWARN);
if (unlikely(!new_blkg)) {
ret = -ENOMEM;
goto err_put_congested;
@@ -1022,7 +1023,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
}

spin_lock_init(&blkcg->lock);
- INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT);
+ INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN);
INIT_HLIST_HEAD(&blkcg->blkg_list);
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&blkcg->cgwb_list);
@@ -1240,7 +1241,7 @@ int blkcg_activate_policy(struct request_queue *q,
if (blkg->pd[pol->plid])
continue;

- pd = pol->pd_alloc_fn(GFP_NOWAIT, q->node);
+ pd = pol->pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN, q->node);
if (!pd)
swap(pd, pd_prealloc);
if (!pd) {
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5e24d88..b4c3b6c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3854,7 +3854,8 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
goto out;
}

- cfqq = kmem_cache_alloc_node(cfq_pool, GFP_NOWAIT | __GFP_ZERO,
+ cfqq = kmem_cache_alloc_node(cfq_pool,
+ GFP_NOWAIT | __GFP_ZERO | __GFP_NOWARN,
cfqd->queue->node);
if (!cfqq) {
cfqq = &cfqd->oom_cfqq;

2016-11-22 15:48:10

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On 11/22/2016 12:03 AM, Tejun Heo wrote:
> blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
> when that fails falls back to operations which aren't specific to the
> cgroup. Occassional failures are expected under pressure and falling
> back to non-cgroup operation is the right thing to do.
>
> Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
> these expected failures end up creating a lot of noise. Add
> __GFP_NOWARN.

Thanks. Makes me wonder whether we should e.g. add __GFP_NOWARN to
GFP_NOWAIT globally at some point.

> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: Marc MERLIN <[email protected]>
> Reported-by: Vlastimil Babka <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> block/blk-cgroup.c | 9 +++++----
> block/cfq-iosched.c | 3 ++-
> 2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index b08ccbb..8ba0af7 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -185,7 +185,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> }
>
> wb_congested = wb_congested_get_create(&q->backing_dev_info,
> - blkcg->css.id, GFP_NOWAIT);
> + blkcg->css.id,
> + GFP_NOWAIT | __GFP_NOWARN);
> if (!wb_congested) {
> ret = -ENOMEM;
> goto err_put_css;
> @@ -193,7 +194,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>
> /* allocate */
> if (!new_blkg) {
> - new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT);
> + new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT | __GFP_NOWARN);
> if (unlikely(!new_blkg)) {
> ret = -ENOMEM;
> goto err_put_congested;
> @@ -1022,7 +1023,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
> }
>
> spin_lock_init(&blkcg->lock);
> - INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT);
> + INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN);
> INIT_HLIST_HEAD(&blkcg->blkg_list);
> #ifdef CONFIG_CGROUP_WRITEBACK
> INIT_LIST_HEAD(&blkcg->cgwb_list);
> @@ -1240,7 +1241,7 @@ int blkcg_activate_policy(struct request_queue *q,
> if (blkg->pd[pol->plid])
> continue;
>
> - pd = pol->pd_alloc_fn(GFP_NOWAIT, q->node);
> + pd = pol->pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN, q->node);
> if (!pd)
> swap(pd, pd_prealloc);
> if (!pd) {
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 5e24d88..b4c3b6c 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3854,7 +3854,8 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
> goto out;
> }
>
> - cfqq = kmem_cache_alloc_node(cfq_pool, GFP_NOWAIT | __GFP_ZERO,
> + cfqq = kmem_cache_alloc_node(cfq_pool,
> + GFP_NOWAIT | __GFP_ZERO | __GFP_NOWARN,
> cfqd->queue->node);
> if (!cfqq) {
> cfqq = &cfqd->oom_cfqq;
>

2016-11-22 16:00:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On 11/21/2016 04:03 PM, Tejun Heo wrote:
> blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
> when that fails falls back to operations which aren't specific to the
> cgroup. Occassional failures are expected under pressure and falling
> back to non-cgroup operation is the right thing to do.
>
> Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
> these expected failures end up creating a lot of noise. Add
> __GFP_NOWARN.

Thanks Tejun, added for 4.10.

--
Jens Axboe

2016-11-22 16:06:39

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> > > 4.9rc5 however seems to be doing better, and is still running after 18
> > > hours. However, I got a few page allocation failures as per below, but the
> > > system seems to recover.
> > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
> > > or is that good enough, and i should go back to 4.8.8 with that patch applied?
> > > https://marc.info/?l=linux-mm&m=147423605024993
> >
> > Hi, I think it's enough for 4.9 for now and I would appreciate trying
> > 4.8 with that patch, yeah.
>
> So the good news is that it's been running for almost 5H and so far so good.

And the better news is that the copy is still going strong, 4.4TB and
going. So 4.8.8 is fixed with that one single patch as far as I'm
concerned.

So thanks for that, looks good to me to merge.

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

2016-11-22 16:14:22

by Vlastimil Babka

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
>> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
>>>> 4.9rc5 however seems to be doing better, and is still running after 18
>>>> hours. However, I got a few page allocation failures as per below, but the
>>>> system seems to recover.
>>>> Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
>>>> or is that good enough, and i should go back to 4.8.8 with that patch applied?
>>>> https://marc.info/?l=linux-mm&m=147423605024993
>>>
>>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
>>> 4.8 with that patch, yeah.
>>
>> So the good news is that it's been running for almost 5H and so far so good.
>
> And the better news is that the copy is still going strong, 4.4TB and
> going. So 4.8.8 is fixed with that one single patch as far as I'm
> concerned.
>
> So thanks for that, looks good to me to merge.

Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
already EOL AFAICS).

- send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
- alternatively a simpler (againm 4.8-only) patch that just outright
prevents OOM for 0 < order < costly, as Michal already suggested.
- backport 10+ compaction patches to 4.8 stable
- something else?

Michal? Linus?

[1] https://marc.info/?l=linux-mm&m=147423605024993

> Marc
>

2016-11-22 16:25:49

by Michal Hocko

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue 22-11-16 17:14:02, Vlastimil Babka wrote:
> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> >>>> 4.9rc5 however seems to be doing better, and is still running after 18
> >>>> hours. However, I got a few page allocation failures as per below, but the
> >>>> system seems to recover.
> >>>> Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
> >>>> or is that good enough, and i should go back to 4.8.8 with that patch applied?
> >>>> https://marc.info/?l=linux-mm&m=147423605024993
> >>>
> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> >>> 4.8 with that patch, yeah.
> >>
> >> So the good news is that it's been running for almost 5H and so far so good.
> >
> > And the better news is that the copy is still going strong, 4.4TB and
> > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > concerned.
> >
> > So thanks for that, looks good to me to merge.
>
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
>
> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
> - alternatively a simpler (againm 4.8-only) patch that just outright
> prevents OOM for 0 < order < costly, as Michal already suggested.
> - backport 10+ compaction patches to 4.8 stable
> - something else?
>
> Michal? Linus?

Dunno. To be honest I do not like [1] because it seriously tweaks the
retry logic. 10+ compaction patches to 4.8 seems too much for a stable
tree and quite risky as well. Considering that 4.9 works just much
better, is there any strong reason to do 4.8 specific fix at all? Most
users reporting OOM regressions seemed to be ok with what 4.8 does
currently AFAIR. I hate that Marc is not falling into that category but
is it really problem for you to run with 4.9? If we have more users
seeing this regression then I would rather go with a simpler 4.8-only
"never trigger OOM for order > 0 && order < costly because that would at
least have deterministic behavior.

>
> [1] https://marc.info/?l=linux-mm&m=147423605024993

--
Michal Hocko
SUSE Labs

2016-11-22 16:37:55

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:
> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> >>>> 4.9rc5 however seems to be doing better, and is still running after 18
> >>>> hours. However, I got a few page allocation failures as per below, but the
> >>>> system seems to recover.
> >>>> Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
> >>>> or is that good enough, and i should go back to 4.8.8 with that patch applied?
> >>>> https://marc.info/?l=linux-mm&m=147423605024993
> >>>
> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> >>> 4.8 with that patch, yeah.
> >>
> >> So the good news is that it's been running for almost 5H and so far so good.
> >
> > And the better news is that the copy is still going strong, 4.4TB and
> > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > concerned.
> >
> > So thanks for that, looks good to me to merge.
>
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
>
> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
> - alternatively a simpler (againm 4.8-only) patch that just outright
> prevents OOM for 0 < order < costly, as Michal already suggested.
> - backport 10+ compaction patches to 4.8 stable
> - something else?

Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is
released? :)

thanks,

greg k-h

2016-11-22 16:47:21

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue, Nov 22, 2016 at 05:25:44PM +0100, Michal Hocko wrote:
> currently AFAIR. I hate that Marc is not falling into that category but
> is it really problem for you to run with 4.9? If we have more users

Don't do anything just on my account. I had a problem, it's been fixed
in 2 different ways: 4.8+patch, or 4.9rc5

For me this was a 100% regression from 4.6, there was just no way I
could copy my data at all with 4.8, it not only failed, but killed all
the services on my machine until it randomly killed the shell that was
doing the copy.
Personally, I'll stick with 4.8 + this patch, and switch to 4.9 when
it's out (I'm a bit wary of RC kernels on a production server,
especially when I'm in the middle of trying to get my only good backup
to work again)

But at the same time, what I'm doing is probably not common (btrfs on
top of dmcrypt, on top of bcache, on top of swraid5, for both source and
destination), so I can't comment on whether the fix I just put on my 4.8
kernel does not cause other regressions or problems for other people.

Either way, I'm personally ok again now, so I thank you all for your
help, and will leave the hard decisions to you :)

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

2016-11-22 17:01:40

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

Hello,

On Tue, Nov 22, 2016 at 04:47:49PM +0100, Vlastimil Babka wrote:
> Thanks. Makes me wonder whether we should e.g. add __GFP_NOWARN to
> GFP_NOWAIT globally at some point.

Yeah, that makes sense. The caller is explicitly saying that it's
okay to fail the allocation.

Thanks.

--
tejun

2016-11-22 19:39:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka <[email protected]> wrote:
>
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
>
> - send the patch [1] as 4.8-only stable.

I think that's the right thing to do. It's pretty small, and the
argument that it changes the oom logic too much is pretty bogus, I
think. The oom logic in 4.8 is simply broken. Let's get it fixed.
Changing it is the point.

Linus

2016-11-22 22:03:28

by Simon Kirby

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:

> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> >>>> 4.9rc5 however seems to be doing better, and is still running after 18
> >>>> hours. However, I got a few page allocation failures as per below, but the
> >>>> system seems to recover.
> >>>> Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
> >>>> or is that good enough, and i should go back to 4.8.8 with that patch applied?
> >>>> https://marc.info/?l=linux-mm&m=147423605024993
> >>>
> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> >>> 4.8 with that patch, yeah.
> >>
> >> So the good news is that it's been running for almost 5H and so far so good.
> >
> > And the better news is that the copy is still going strong, 4.4TB and
> > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > concerned.
> >
> > So thanks for that, looks good to me to merge.
>
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
>
> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
> - alternatively a simpler (againm 4.8-only) patch that just outright
> prevents OOM for 0 < order < costly, as Michal already suggested.
> - backport 10+ compaction patches to 4.8 stable
> - something else?
>
> Michal? Linus?
>
> [1] https://marc.info/?l=linux-mm&m=147423605024993

Sorry for my molasses rate of feedback. I found a workaround, setting
vm/watermark_scale_factor to 500, and threw that in sysctl. This was on
the MythTV box that OOMs everything after about a day on 4.8 otherwise.

I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but
just realized I forgot to remove the watermark_scale_factor workaround.
I've restored that now, so I'll see if it becomes unhappy by tomorrow.

I also threw up a few other things you had asked for (vmstat, zoneinfo
before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/
(that was before booting into a rebuild with [1] applied)

Simon-

2016-11-22 22:13:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On Tue, Nov 22, 2016 at 8:48 AM, Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Tue, Nov 22, 2016 at 04:47:49PM +0100, Vlastimil Babka wrote:
> > Thanks. Makes me wonder whether we should e.g. add __GFP_NOWARN to
> > GFP_NOWAIT globally at some point.
>
> Yeah, that makes sense. The caller is explicitly saying that it's
> okay to fail the allocation.

I'm not so convinced about the "atomic automatically means you shouldn't warn".

You'd certainly _hope_ that atomic allocations either have fallbacks
or are harmless if they fail, but I'd still rather see that
__GFP_NOWARN just to make that very much explicit.

Because as it is, atomic allocations certainly get to dig deeper into
our memory reserves, but they most definitely can fail, and I
definitely see how some code has no fallback because it thinks that
the deeper reserves mean that it will succeed.

Linus

2016-11-23 06:34:30

by Michal Hocko

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue 22-11-16 11:38:47, Linus Torvalds wrote:
> On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka <[email protected]> wrote:
> >
> > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> > already EOL AFAICS).
> >
> > - send the patch [1] as 4.8-only stable.
>
> I think that's the right thing to do. It's pretty small, and the
> argument that it changes the oom logic too much is pretty bogus, I
> think. The oom logic in 4.8 is simply broken. Let's get it fixed.
> Changing it is the point.

The point I've tried to make is that it is not should_reclaim_retry
which is broken. It's an overly optimistic reliance on the compaction
to do it's work which led to all those issues. My previous fix
31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") tried to cope with that by checking the order-0
watermark which has proven to help most users. Now it didn't cover
everybody obviously. Rather than fiddling with fine tuning of these
heuristics I think it would be safer to simply admit that high order
OOM detection doesn't work in 4.8 kernel and so do not declare the OOM
killer for those requests at all. The risk of such a change is not big
because there usually are order-0 requests happening all the time so if
we are really OOM we would trigger the OOM eventually.

So I am proposing this for 4.8 stable tree instead
---
commit b2ccdcb731b666aa28f86483656c39c5e53828c7
Author: Michal Hocko <[email protected]>
Date: Wed Nov 23 07:26:30 2016 +0100

mm, oom: stop pre-mature high-order OOM killer invocations

31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
killer invocation for high order requests. It seemed to work for most
users just fine but it is far from bullet proof and obviously not
sufficient for Marc who has reported pre-mature OOM killer invocations
with 4.8 based kernels. 4.9 will all the compaction improvements seems
to be behaving much better but that would be too intrusive to backport
to 4.8 stable kernels. Instead this patch simply never declares OOM for
!costly high order requests. We rely on order-0 requests to do that in
case we are really out of memory. Order-0 requests are much more common
and so a risk of a livelock without any way forward is highly unlikely.

Reported-by: Marc MERLIN <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..7401e996009a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
return false;

+#ifdef CONFIG_COMPACTION
+ /*
+ * This is a gross workaround to compensate a lack of reliable compaction
+ * operation. We cannot simply go OOM with the current state of the compaction
+ * code because this can lead to pre mature OOM declaration.
+ */
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ return true;
+#endif
+
/*
* There are setups with compaction disabled which would prefer to loop
* inside the allocator rather than hit the oom killer prematurely.
--
Michal Hocko
SUSE Labs

2016-11-23 06:59:09

by Hillf Danton

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Wednesday, November 23, 2016 2:34 PM Michal Hocko wrote:
> @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
> return false;
>
> +#ifdef CONFIG_COMPACTION
> + /*
> + * This is a gross workaround to compensate a lack of reliable compaction
> + * operation. We cannot simply go OOM with the current state of the compaction
> + * code because this can lead to pre mature OOM declaration.
> + */
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)

No need to check order once more.
Plus can we retry without CONFIG_COMPACTION enabled?

> + return true;
> +#endif
> +
> /*
> * There are setups with compaction disabled which would prefer to loop
> * inside the allocator rather than hit the oom killer prematurely.
> --
> Michal Hocko
> SUSE Labs
>

2016-11-23 07:00:06

by Michal Hocko

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Wed 23-11-16 14:53:12, Hillf Danton wrote:
> On Wednesday, November 23, 2016 2:34 PM Michal Hocko wrote:
> > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> > if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
> > return false;
> >
> > +#ifdef CONFIG_COMPACTION
> > + /*
> > + * This is a gross workaround to compensate a lack of reliable compaction
> > + * operation. We cannot simply go OOM with the current state of the compaction
> > + * code because this can lead to pre mature OOM declaration.
> > + */
> > + if (order <= PAGE_ALLOC_COSTLY_ORDER)
>
> No need to check order once more.

yes simple return true would be sufficient but I wanted the code to be
more obvious.

> Plus can we retry without CONFIG_COMPACTION enabled?

Yes checking the order-0 watermark was the original implementation of
the high order retry without compaction enabled. I do not rememeber any
reports for that so I didn't want to touch that path.
--
Michal Hocko
SUSE Labs

2016-11-23 08:50:46

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On 11/22/2016 11:13 PM, Linus Torvalds wrote:
> On Tue, Nov 22, 2016 at 8:48 AM, Tejun Heo <[email protected]> wrote:
>>
>> Hello,
>>
>> On Tue, Nov 22, 2016 at 04:47:49PM +0100, Vlastimil Babka wrote:
>>> Thanks. Makes me wonder whether we should e.g. add __GFP_NOWARN to
>>> GFP_NOWAIT globally at some point.
>>
>> Yeah, that makes sense. The caller is explicitly saying that it's
>> okay to fail the allocation.
>
> I'm not so convinced about the "atomic automatically means you shouldn't warn".

Right, but atomic allocations should be using GFP_ATOMIC, which allows
to use the atomic reserves. I meant here just GFP_NOWAIT which does not
allow reserves, for allocations that are not in atomic context, but
still don't want to reclaim for performance or whatever reasons, and
have a suitable fallback. It's their choice to not spend any effort on
the allocation and thus they shouldn't spew warnings IMHO.

> You'd certainly _hope_ that atomic allocations either have fallbacks
> or are harmless if they fail, but I'd still rather see that
> __GFP_NOWARN just to make that very much explicit.

A global change to GFP_NOWAIT would of course mean that we should audit
its users (there don't seem to be many), whether they are using it
consciously and should not rather be using GFP_ATOMIC.

Vlastimil

> Because as it is, atomic allocations certainly get to dig deeper into
> our memory reserves, but they most definitely can fail, and I
> definitely see how some code has no fallback because it thinks that
> the deeper reserves mean that it will succeed.
>
> Linus
>

2016-11-23 09:18:41

by Vlastimil Babka

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On 11/23/2016 07:34 AM, Michal Hocko wrote:
> On Tue 22-11-16 11:38:47, Linus Torvalds wrote:
>> On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka <[email protected]> wrote:
>>>
>>> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
>>> already EOL AFAICS).
>>>
>>> - send the patch [1] as 4.8-only stable.
>>
>> I think that's the right thing to do. It's pretty small, and the
>> argument that it changes the oom logic too much is pretty bogus, I
>> think. The oom logic in 4.8 is simply broken. Let's get it fixed.
>> Changing it is the point.
>
> The point I've tried to make is that it is not should_reclaim_retry
> which is broken. It's an overly optimistic reliance on the compaction
> to do it's work which led to all those issues. My previous fix
> 31e49bfda184 ("mm, oom: protect !costly allocations some more for
> !CONFIG_COMPACTION") tried to cope with that by checking the order-0
> watermark which has proven to help most users. Now it didn't cover
> everybody obviously. Rather than fiddling with fine tuning of these
> heuristics I think it would be safer to simply admit that high order
> OOM detection doesn't work in 4.8 kernel and so do not declare the OOM
> killer for those requests at all. The risk of such a change is not big
> because there usually are order-0 requests happening all the time so if
> we are really OOM we would trigger the OOM eventually.
>
> So I am proposing this for 4.8 stable tree instead
> ---
> commit b2ccdcb731b666aa28f86483656c39c5e53828c7
> Author: Michal Hocko <[email protected]>
> Date: Wed Nov 23 07:26:30 2016 +0100
>
> mm, oom: stop pre-mature high-order OOM killer invocations
>
> 31e49bfda184 ("mm, oom: protect !costly allocations some more for
> !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
> killer invocation for high order requests. It seemed to work for most
> users just fine but it is far from bullet proof and obviously not
> sufficient for Marc who has reported pre-mature OOM killer invocations
> with 4.8 based kernels. 4.9 will all the compaction improvements seems
> to be behaving much better but that would be too intrusive to backport
> to 4.8 stable kernels. Instead this patch simply never declares OOM for
> !costly high order requests. We rely on order-0 requests to do that in
> case we are really out of memory. Order-0 requests are much more common
> and so a risk of a livelock without any way forward is highly unlikely.
>
> Reported-by: Marc MERLIN <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

This should effectively restore the 4.6 logic, so I'm fine with it for
stable, if it passes testing.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a2214c64ed3c..7401e996009a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
> return false;
>
> +#ifdef CONFIG_COMPACTION
> + /*
> + * This is a gross workaround to compensate a lack of reliable compaction
> + * operation. We cannot simply go OOM with the current state of the compaction
> + * code because this can lead to pre mature OOM declaration.
> + */
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> + return true;
> +#endif
> +
> /*
> * There are setups with compaction disabled which would prefer to loop
> * inside the allocator rather than hit the oom killer prematurely.
>

2016-11-28 07:23:27

by Michal Hocko

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

Marc, could you try this patch please? I think it should be pretty clear
it should help you but running it through your use case would be more
than welcome before I ask Greg to take this to the 4.8 stable tree.

Thanks!

On Wed 23-11-16 07:34:10, Michal Hocko wrote:
[...]
> commit b2ccdcb731b666aa28f86483656c39c5e53828c7
> Author: Michal Hocko <[email protected]>
> Date: Wed Nov 23 07:26:30 2016 +0100
>
> mm, oom: stop pre-mature high-order OOM killer invocations
>
> 31e49bfda184 ("mm, oom: protect !costly allocations some more for
> !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
> killer invocation for high order requests. It seemed to work for most
> users just fine but it is far from bullet proof and obviously not
> sufficient for Marc who has reported pre-mature OOM killer invocations
> with 4.8 based kernels. 4.9 will all the compaction improvements seems
> to be behaving much better but that would be too intrusive to backport
> to 4.8 stable kernels. Instead this patch simply never declares OOM for
> !costly high order requests. We rely on order-0 requests to do that in
> case we are really out of memory. Order-0 requests are much more common
> and so a risk of a livelock without any way forward is highly unlikely.
>
> Reported-by: Marc MERLIN <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a2214c64ed3c..7401e996009a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
> return false;
>
> +#ifdef CONFIG_COMPACTION
> + /*
> + * This is a gross workaround to compensate a lack of reliable compaction
> + * operation. We cannot simply go OOM with the current state of the compaction
> + * code because this can lead to pre mature OOM declaration.
> + */
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> + return true;
> +#endif
> +
> /*
> * There are setups with compaction disabled which would prefer to loop
> * inside the allocator rather than hit the oom killer prematurely.
> --
> Michal Hocko
> SUSE Labs

--
Michal Hocko
SUSE Labs

2016-11-28 08:07:08

by Vlastimil Babka

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On 11/22/2016 10:46 PM, Simon Kirby wrote:
> On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:
>
>> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
>>> On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
>>>> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
>>>>>> 4.9rc5 however seems to be doing better, and is still running after 18
>>>>>> hours. However, I got a few page allocation failures as per below, but the
>>>>>> system seems to recover.
>>>>>> Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
>>>>>> or is that good enough, and i should go back to 4.8.8 with that patch applied?
>>>>>> https://marc.info/?l=linux-mm&m=147423605024993
>>>>>
>>>>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
>>>>> 4.8 with that patch, yeah.
>>>>
>>>> So the good news is that it's been running for almost 5H and so far so good.
>>>
>>> And the better news is that the copy is still going strong, 4.4TB and
>>> going. So 4.8.8 is fixed with that one single patch as far as I'm
>>> concerned.
>>>
>>> So thanks for that, looks good to me to merge.
>>
>> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
>> already EOL AFAICS).
>>
>> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
>> - alternatively a simpler (againm 4.8-only) patch that just outright
>> prevents OOM for 0 < order < costly, as Michal already suggested.
>> - backport 10+ compaction patches to 4.8 stable
>> - something else?
>>
>> Michal? Linus?
>>
>> [1] https://marc.info/?l=linux-mm&m=147423605024993
>
> Sorry for my molasses rate of feedback. I found a workaround, setting
> vm/watermark_scale_factor to 500, and threw that in sysctl. This was on
> the MythTV box that OOMs everything after about a day on 4.8 otherwise.
>
> I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but
> just realized I forgot to remove the watermark_scale_factor workaround.
> I've restored that now, so I'll see if it becomes unhappy by tomorrow.

Thanks for the testing. Could you now try Michal's stable candidate [1]
from this thread please?

[1] http://marc.info/?l=linux-mm&m=147988285831283&w=2

> I also threw up a few other things you had asked for (vmstat, zoneinfo
> before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/
> (that was before booting into a rebuild with [1] applied)
>
> Simon-
>

2016-11-28 17:19:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

Hello,

On Wed, Nov 23, 2016 at 09:50:12AM +0100, Vlastimil Babka wrote:
> > You'd certainly _hope_ that atomic allocations either have fallbacks
> > or are harmless if they fail, but I'd still rather see that
> > __GFP_NOWARN just to make that very much explicit.
>
> A global change to GFP_NOWAIT would of course mean that we should audit its
> users (there don't seem to be many), whether they are using it consciously
> and should not rather be using GFP_ATOMIC.

A while ago, I thought about something like, say, GFP_MAYBE which is
combination of NOWAIT and NOWARN but couldn't really come up with
scenarios where one would want to use NOWAIT w/o NOWARN. If an
allocation is important enough to warn the user of its failure, it
better be dipping into the atomic reserve pool; otherwise, it doesn't
make sense to make noise.

Maybe we can come up with a better name which signifies that this is
likely to fail every now and then but I still think it'd be beneficial
to make it quiet by default. Linus, do you still think NOWARN should
be explicit?

Thanks.

--
tejun

2016-11-28 20:55:21

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> Marc, could you try this patch please? I think it should be pretty clear
> it should help you but running it through your use case would be more
> than welcome before I ask Greg to take this to the 4.8 stable tree.

This will take a little while, the whole copy took 5 days to finish and I'm a
bit hesitant about blowing it away and starting over :)
Let me see if I can come up with maybe another disk array for another test.

For now, as a reminder, I'm running that attached patch, and it works fine
I'll report back as soon as I can.

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/


Attachments:
(No filename) (847.00 B)
4.8.8-mem2.patch (1.00 kB)
Download all attachments

2016-11-29 07:25:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On Mon 28-11-16 12:19:07, Tejun Heo wrote:
> Hello,
>
> On Wed, Nov 23, 2016 at 09:50:12AM +0100, Vlastimil Babka wrote:
> > > You'd certainly _hope_ that atomic allocations either have fallbacks
> > > or are harmless if they fail, but I'd still rather see that
> > > __GFP_NOWARN just to make that very much explicit.
> >
> > A global change to GFP_NOWAIT would of course mean that we should audit its
> > users (there don't seem to be many), whether they are using it consciously
> > and should not rather be using GFP_ATOMIC.
>
> A while ago, I thought about something like, say, GFP_MAYBE which is
> combination of NOWAIT and NOWARN but couldn't really come up with
> scenarios where one would want to use NOWAIT w/o NOWARN. If an
> allocation is important enough to warn the user of its failure, it
> better be dipping into the atomic reserve pool; otherwise, it doesn't
> make sense to make noise.

I do not think we really need a new flag for that and fully agree that
GFP_NOWAIT warning about failure is rarely, if ever, useful.
Historically we didn't use to distinguish atomic (with access to
reserves) allocations from those which just do not want to trigger the
reclaim resp. to sleep (aka optimistic allocation requests). But this
has changed so I guess we can really do the following
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f8041f9de31e..a53b5187b4da 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -246,7 +246,7 @@ struct vm_area_struct;
#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
#define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
-#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
+#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM|__GFP_NOWARN)
#define GFP_NOIO (__GFP_RECLAIM)
#define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
#define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \

this will not catch users who are doing gfp & ~__GFP_DIRECT_RECLAIM but
I would rather not make warn_alloc() even more cluttered with checks.
--
Michal Hocko
SUSE Labs

2016-11-29 15:56:06

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> Marc, could you try this patch please? I think it should be pretty clear
> it should help you but running it through your use case would be more
> than welcome before I ask Greg to take this to the 4.8 stable tree.

I ran it overnight and copied 1.4TB with it before it failed because
there wasn't enough disk space on the other side, so I think it fixes
the problem too.

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

2016-11-29 16:08:02

by Michal Hocko

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue 29-11-16 07:55:37, Marc MERLIN wrote:
> On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> > Marc, could you try this patch please? I think it should be pretty clear
> > it should help you but running it through your use case would be more
> > than welcome before I ask Greg to take this to the 4.8 stable tree.
>
> I ran it overnight and copied 1.4TB with it before it failed because
> there wasn't enough disk space on the other side, so I think it fixes
> the problem too.

Can I add your Tested-by?

--
Michal Hocko
SUSE Labs

2016-11-29 16:16:04

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> Marc, could you try this patch please? I think it should be pretty clear
> it should help you but running it through your use case would be more
> than welcome before I ask Greg to take this to the 4.8 stable tree.
>
> Thanks!
>
> On Wed 23-11-16 07:34:10, Michal Hocko wrote:
> [...]
> > commit b2ccdcb731b666aa28f86483656c39c5e53828c7
> > Author: Michal Hocko <[email protected]>
> > Date: Wed Nov 23 07:26:30 2016 +0100
> >
> > mm, oom: stop pre-mature high-order OOM killer invocations
> >
> > 31e49bfda184 ("mm, oom: protect !costly allocations some more for
> > !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
> > killer invocation for high order requests. It seemed to work for most
> > users just fine but it is far from bullet proof and obviously not
> > sufficient for Marc who has reported pre-mature OOM killer invocations
> > with 4.8 based kernels. 4.9 will all the compaction improvements seems
> > to be behaving much better but that would be too intrusive to backport
> > to 4.8 stable kernels. Instead this patch simply never declares OOM for
> > !costly high order requests. We rely on order-0 requests to do that in
> > case we are really out of memory. Order-0 requests are much more common
> > and so a risk of a livelock without any way forward is highly unlikely.
> >
> > Reported-by: Marc MERLIN <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>

Tested-by: Marc MERLIN <[email protected]>

Marc

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a2214c64ed3c..7401e996009a 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> > if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
> > return false;
> >
> > +#ifdef CONFIG_COMPACTION
> > + /*
> > + * This is a gross workaround to compensate a lack of reliable compaction
> > + * operation. We cannot simply go OOM with the current state of the compaction
> > + * code because this can lead to pre mature OOM declaration.
> > + */
> > + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > + return true;
> > +#endif
> > +
> > /*
> > * There are setups with compaction disabled which would prefer to loop
> > * inside the allocator rather than hit the oom killer prematurely.
> > --
> > Michal Hocko
> > SUSE Labs
>
> --
> Michal Hocko
> SUSE Labs
>

--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

2016-11-29 16:25:27

by Michal Hocko

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue 22-11-16 17:38:01, Greg KH wrote:
> On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:
> > On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> > >>>> 4.9rc5 however seems to be doing better, and is still running after 18
> > >>>> hours. However, I got a few page allocation failures as per below, but the
> > >>>> system seems to recover.
> > >>>> Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
> > >>>> or is that good enough, and i should go back to 4.8.8 with that patch applied?
> > >>>> https://marc.info/?l=linux-mm&m=147423605024993
> > >>>
> > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> > >>> 4.8 with that patch, yeah.
> > >>
> > >> So the good news is that it's been running for almost 5H and so far so good.
> > >
> > > And the better news is that the copy is still going strong, 4.4TB and
> > > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > > concerned.
> > >
> > > So thanks for that, looks good to me to merge.
> >
> > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> > already EOL AFAICS).
> >
> > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
> > - alternatively a simpler (againm 4.8-only) patch that just outright
> > prevents OOM for 0 < order < costly, as Michal already suggested.
> > - backport 10+ compaction patches to 4.8 stable
> > - something else?
>
> Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is
> released? :)

OK, so can we push this through to 4.8 before EOL and make sure there
won't be any additional pre-mature high order OOM reports? The patch
should be simple enough and safe for the stable tree. There is no
upstream commit because 4.9 is fixed in a different way which would be
way too intrusive for the stable backport.
---
>From 02306e8d593fa8a48d620e0c9d63a934ca8366d8 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 23 Nov 2016 07:26:30 +0100
Subject: [PATCH] mm, oom: stop pre-mature high-order OOM killer invocations

31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
killer invocation for high order requests. It seemed to work for most
users just fine but it is far from bullet proof and obviously not
sufficient for Marc who has reported pre-mature OOM killer invocations
with 4.8 based kernels. 4.9 will all the compaction improvements seems
to be behaving much better but that would be too intrusive to backport
to 4.8 stable kernels. Instead this patch simply never declares OOM for
!costly high order requests. We rely on order-0 requests to do that in
case we are really out of memory. Order-0 requests are much more common
and so a risk of a livelock without any way forward is highly unlikely.

Reported-by: Marc MERLIN <[email protected]>
Tested-by: Marc MERLIN <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..7401e996009a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
return false;

+#ifdef CONFIG_COMPACTION
+ /*
+ * This is a gross workaround to compensate a lack of reliable compaction
+ * operation. We cannot simply go OOM with the current state of the compaction
+ * code because this can lead to pre mature OOM declaration.
+ */
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ return true;
+#endif
+
/*
* There are setups with compaction disabled which would prefer to loop
* inside the allocator rather than hit the oom killer prematurely.
--
2.10.2

--
Michal Hocko
SUSE Labs

2016-11-29 16:34:23

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue, Nov 29, 2016 at 05:07:51PM +0100, Michal Hocko wrote:
> On Tue 29-11-16 07:55:37, Marc MERLIN wrote:
> > On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> > > Marc, could you try this patch please? I think it should be pretty clear
> > > it should help you but running it through your use case would be more
> > > than welcome before I ask Greg to take this to the 4.8 stable tree.
> >
> > I ran it overnight and copied 1.4TB with it before it failed because
> > there wasn't enough disk space on the other side, so I think it fixes
> > the problem too.
>
> Can I add your Tested-by?

Done.

Now, probably unrelated, but hard to be sure, doing those big copies
causes massive hangs on my system. I hit a few of the 120s hangs,
but more generally lots of things hang, including shells, my DNS server,
monitoring reading from USB and timing out, and so forth.
Examples below.
I have a hard time telling what is the fault, but is there a chance it
might be memory allocation pressure?
I already have a preempt kernel, so I can't make it more preempt than
that.
Now, to be fair, this is not a new problem, it's just varying degrees of
bad and usually only happens when I do a lot of I/O with btrfs.
That said, btrfs may very well just be suffering from memory allocation
issues and hanging as a result, with everything else on my system also
hanging for similar reasons until the memory pressure goes away with the
copy or scrub are finished.

What do you think?

[28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28034.975471] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28035.025429] btrfs D ffff91154d33fc70 0 5618 5372 0x00000080
[28035.047717] ffff91154d33fc70 0000000000200246 ffff911842f880c0 ffff9115a4cf01c0
[28035.071020] ffff91154d33fc58 ffff91154d340000 ffff91165493bca0 ffff9115623773f0
[28035.094252] 0000000000001000 0000000000000001 ffff91154d33fc88 ffffffffb86cf1a6
[28035.117538] Call Trace:
[28035.125791] [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
[28035.141550] [<ffffffffb82bd18e>] btrfs_start_ordered_extent+0xce/0x122
[28035.162457] [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
[28035.180891] [<ffffffffb82bd434>] btrfs_wait_ordered_range+0xa9/0x10d
[28035.201723] [<ffffffffb82aec04>] btrfs_truncate+0x40/0x24b
[28035.219269] [<ffffffffb82af437>] btrfs_setattr+0x1da/0x2d7
[28035.237032] [<ffffffffb81c7507>] notify_change+0x252/0x39c
[28035.254566] [<ffffffffb81ad35b>] do_truncate+0x81/0xb4
[28035.271057] [<ffffffffb81ad467>] vfs_truncate+0xd9/0xf9
[28035.287782] [<ffffffffb81ad4ea>] do_sys_truncate+0x63/0xa7

I get other hangs like:

[10338.968912] perf: interrupt took too long (3927 > 3917), lowering kernel.perf_event_max_sample_rate to 50750

[12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb stopped: -32

[17761.122238] usb 4-1.4: USB disconnect, device number 39
[17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 rq 6 len 1024 ret -108
[17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd
[17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd

[24130.574425] hpet1: lost 2306 rtc interrupts
[24156.034950] hpet1: lost 1628 rtc interrupts
[24173.314738] hpet1: lost 1104 rtc interrupts
[24180.129950] hpet1: lost 436 rtc interrupts
[24257.557955] hpet1: lost 4954 rtc interrupts
[24267.522656] hpet1: lost 637 rtc interrupts

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

2016-11-29 16:38:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On Tue, Nov 29, 2016 at 08:25:07AM +0100, Michal Hocko wrote:
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -246,7 +246,7 @@ struct vm_area_struct;
> #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
> -#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
> +#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM|__GFP_NOWARN)
> #define GFP_NOIO (__GFP_RECLAIM)
> #define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
> #define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
>
> this will not catch users who are doing gfp & ~__GFP_DIRECT_RECLAIM but
> I would rather not make warn_alloc() even more cluttered with checks.

Yeah, FWIW, looks good to me.

--
tejun

2016-11-29 16:43:09

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Patch "mm, oom: stop pre-mature high-order OOM killer invocations" has been added to the 4.8-stable tree


This is a note to let you know that I've just added the patch titled

mm, oom: stop pre-mature high-order OOM killer invocations

to the 4.8-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
mm-oom-stop-pre-mature-high-order-oom-killer-invocations.patch
and it can be found in the queue-4.8 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <[email protected]> know about it.


>From [email protected] Tue Nov 29 17:42:17 2016
From: Michal Hocko <[email protected]>
Date: Tue, 29 Nov 2016 17:25:15 +0100
Subject: mm, oom: stop pre-mature high-order OOM killer invocations
To: Greg Kroah-Hartman <[email protected]>, Stable tree <[email protected]>
Cc: Vlastimil Babka <[email protected]>, Marc MERLIN <[email protected]>, [email protected], Linus Torvalds <[email protected]>, LKML <[email protected]>, Joonsoo Kim <[email protected]>, Tejun Heo <[email protected]>
Message-ID: <[email protected]>
Content-Disposition: inline

From: Michal Hocko <[email protected]>

31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
killer invocation for high order requests. It seemed to work for most
users just fine but it is far from bullet proof and obviously not
sufficient for Marc who has reported pre-mature OOM killer invocations
with 4.8 based kernels. 4.9 will all the compaction improvements seems
to be behaving much better but that would be too intrusive to backport
to 4.8 stable kernels. Instead this patch simply never declares OOM for
!costly high order requests. We rely on order-0 requests to do that in
case we are really out of memory. Order-0 requests are much more common
and so a risk of a livelock without any way forward is highly unlikely.

Reported-by: Marc MERLIN <[email protected]>
Tested-by: Marc MERLIN <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 10 ++++++++++
1 file changed, 10 insertions(+)

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_contex
if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
return false;

+#ifdef CONFIG_COMPACTION
+ /*
+ * This is a gross workaround to compensate a lack of reliable compaction
+ * operation. We cannot simply go OOM with the current state of the compaction
+ * code because this can lead to pre mature OOM declaration.
+ */
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ return true;
+#endif
+
/*
* There are setups with compaction disabled which would prefer to loop
* inside the allocator rather than hit the oom killer prematurely.


Patches currently in stable-queue which might be from [email protected] are

queue-4.8/mm-oom-stop-pre-mature-high-order-oom-killer-invocations.patch

2016-11-29 16:43:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue, Nov 29, 2016 at 05:25:15PM +0100, Michal Hocko wrote:
> On Tue 22-11-16 17:38:01, Greg KH wrote:
> > On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:
> > > On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> > > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> > > >>>> 4.9rc5 however seems to be doing better, and is still running after 18
> > > >>>> hours. However, I got a few page allocation failures as per below, but the
> > > >>>> system seems to recover.
> > > >>>> Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
> > > >>>> or is that good enough, and i should go back to 4.8.8 with that patch applied?
> > > >>>> https://marc.info/?l=linux-mm&m=147423605024993
> > > >>>
> > > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> > > >>> 4.8 with that patch, yeah.
> > > >>
> > > >> So the good news is that it's been running for almost 5H and so far so good.
> > > >
> > > > And the better news is that the copy is still going strong, 4.4TB and
> > > > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > > > concerned.
> > > >
> > > > So thanks for that, looks good to me to merge.
> > >
> > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> > > already EOL AFAICS).
> > >
> > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
> > > - alternatively a simpler (againm 4.8-only) patch that just outright
> > > prevents OOM for 0 < order < costly, as Michal already suggested.
> > > - backport 10+ compaction patches to 4.8 stable
> > > - something else?
> >
> > Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is
> > released? :)
>
> OK, so can we push this through to 4.8 before EOL and make sure there
> won't be any additional pre-mature high order OOM reports? The patch
> should be simple enough and safe for the stable tree. There is no
> upstream commit because 4.9 is fixed in a different way which would be
> way too intrusive for the stable backport.

Now queued up, thanks!

greg k-h

2016-11-29 16:57:24

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On 11/29/2016 05:38 PM, Tejun Heo wrote:
> On Tue, Nov 29, 2016 at 08:25:07AM +0100, Michal Hocko wrote:
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -246,7 +246,7 @@ struct vm_area_struct;
>> #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
>> #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
>> #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
>> -#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
>> +#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM|__GFP_NOWARN)
>> #define GFP_NOIO (__GFP_RECLAIM)
>> #define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
>> #define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
>>
>> this will not catch users who are doing gfp & ~__GFP_DIRECT_RECLAIM but
>> I would rather not make warn_alloc() even more cluttered with checks.
>
> Yeah, FWIW, looks good to me.

Me too. Just don't forget to update the comment describing GFP_NOWAIT and check
the existing users if duplicite __GFP_NOWARN can be removed, and if they really
want to be doing GFP_NOWAIT and not GFP_ATOMIC.

Also dunno what about Tejun's eariler patch.

2016-11-29 17:07:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN <[email protected]> wrote:
> Now, to be fair, this is not a new problem, it's just varying degrees of
> bad and usually only happens when I do a lot of I/O with btrfs.

One situation where I've seen something like this happen is

(a) lots and lots of dirty data queued up
(b) horribly slow storage
(c) filesystem that ends up serializing on writeback under certain
circumstances

The usual case for (b) in the modern world is big SSD's that have bad
worst-case behavior (ie they may do gbps speeds when doing well, and
then they come to a screeching halt when their buffers fill up and
they have to do rewrites, and their gbps throughput drops to mbps or
lower).

Generally you only find that kind of really nasty SSD in the USB stick
world these days.

The usual case for (c) is "fsync" or similar - often on a totally
unrelated file - which then ends up waiting for everything else to
flush too. Looks like btrfs_start_ordered_extent() does something kind
of like that, where it waits for data to be flushed.

The usual *fix* for this is to just not get into situation (a).

Sadly, our defaults for "how much dirty data do we allow" are somewhat
buggered. The global defaults are in "percent of memory", and are
generally _much_ too high for big-memory machines:

[torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
20
[torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
10

says that it only starts really throttling writes when you hit 20% of
all memory used. You don't say how much memory you have in that
machine, but if it's the same one you talked about earlier, it was
24GB. So you can have 4GB of dirty data waiting to be flushed out.

And we *try* to do this per-device backing-dev congestion thing to
make things work better, but it generally seems to not work very well.
Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
does really well, and we want to open up, and then it shuts down).

One thing you can try is to just make the global limits much lower. As in

echo 2 > /proc/sys/vm/dirty_ratio
echo 1 > /proc/sys/vm/dirty_background_ratio

(if you want to go lower than 1%, you'll have to use the
"dirty_*ratio_bytes" byte limits instead of percentage limits).

Obviously you'll need to be root for this, and equally obviously it's
really a failure of the kernel. I'd *love* to get something like this
right automatically, but sadly it depends so much on memory size,
load, disk subsystem, etc etc that I despair at it.

On x86-32 we "fixed" this long ago by just saying "high memory is not
dirtyable", so you were always limited to a maximum of 10/20% of 1GB,
rather than the full memory range. It worked better, but it's a sad
kind of fix.

(See commit dc6e29da9162: "Fix balance_dirty_page() calculations with
CONFIG_HIGHMEM")

Linus

2016-11-29 17:13:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On Tue 29-11-16 17:57:08, Vlastimil Babka wrote:
> On 11/29/2016 05:38 PM, Tejun Heo wrote:
> > On Tue, Nov 29, 2016 at 08:25:07AM +0100, Michal Hocko wrote:
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -246,7 +246,7 @@ struct vm_area_struct;
> > > #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> > > #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> > > #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
> > > -#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
> > > +#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM|__GFP_NOWARN)
> > > #define GFP_NOIO (__GFP_RECLAIM)
> > > #define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
> > > #define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
> > >
> > > this will not catch users who are doing gfp & ~__GFP_DIRECT_RECLAIM but
> > > I would rather not make warn_alloc() even more cluttered with checks.
> >
> > Yeah, FWIW, looks good to me.
>
> Me too. Just don't forget to update the comment describing GFP_NOWAIT and
> check the existing users if duplicite __GFP_NOWARN can be removed, and if
> they really want to be doing GFP_NOWAIT and not GFP_ATOMIC.

How does this look like?
---
>From c6635f7fedec1fc475da0a4be32ea360560cde18 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 29 Nov 2016 18:04:00 +0100
Subject: [PATCH] mm: do not warn about allocation failures for GFP_NOWAIT

Historically we didn't have a distinction between atomic allocation
requests and those which just do not want to sleep for other (e.g.
performance optimization reasons). After d0164adc89f6 ("mm, page_alloc:
distinguish between being unable to sleep, unwilling to sleep and
avoiding waking kswapd") this distinction is clear.

Nevertheless we still have quite some GFP_NOWAIT requests without
__GFP_NOWARN so the system log will contain scary and rarely useful
allocation failure splats. Those allocations are to be expected to fail
under a memory pressure (especially when the kswapd doesn't catch up
with the load). GFP_ATOMIC is different in this regards because it
allows an access to part of the memory reserves which sould make it much
less likely to fail and actual reports could help to tune the system
better - e.g. by increasing the amount of memory reserves for a better
performance. This is not true for GFP_NOWAIT though.

This patch simply makes __GFP_NOWARN implicit to all GFP_NOWAIT requests
to silent them all. Those which used the explicit NOWARN were removed.

Suggested-by: Vlastimil Babka <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
arch/arm/include/asm/tlb.h | 2 +-
arch/ia64/include/asm/tlb.h | 2 +-
arch/s390/mm/pgalloc.c | 2 +-
drivers/dma-buf/reservation.c | 3 +--
drivers/md/bcache/bset.c | 3 +--
drivers/md/bcache/btree.c | 2 +-
fs/fs-writeback.c | 2 +-
include/linux/gfp.h | 5 +++--
kernel/events/uprobes.c | 2 +-
mm/memory.c | 4 ++--
mm/rmap.c | 2 +-
net/ipv4/tcp_cdg.c | 2 +-
net/sunrpc/sched.c | 2 +-
net/sunrpc/xprt.c | 2 +-
net/sunrpc/xprtrdma/transport.c | 2 +-
virt/kvm/async_pf.c | 2 +-
16 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 1e25cd80589e..5b173836df23 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -117,7 +117,7 @@ static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr)

static inline void __tlb_alloc_page(struct mmu_gather *tlb)
{
- unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
+ unsigned long addr = __get_free_pages(GFP_NOWAIT, 0);

if (addr) {
tlb->pages = (void *)addr;
diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h
index 77e541cf0e5d..7dfabd37c993 100644
--- a/arch/ia64/include/asm/tlb.h
+++ b/arch/ia64/include/asm/tlb.h
@@ -158,7 +158,7 @@ ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long e

static inline void __tlb_alloc_page(struct mmu_gather *tlb)
{
- unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
+ unsigned long addr = __get_free_pages(GFP_NOWAIT, 0);

if (addr) {
tlb->pages = (void *)addr;
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 995f78532cc2..1e5adc76f7dd 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -340,7 +340,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
tlb->mm->context.flush_mm = 1;
if (*batch == NULL) {
*batch = (struct mmu_table_batch *)
- __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+ __get_free_page(GFP_NOWAIT);
if (*batch == NULL) {
__tlb_flush_mm_lazy(tlb->mm);
tlb_remove_table_one(table);
diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
index 9566a62ad8e3..2715258ef1db 100644
--- a/drivers/dma-buf/reservation.c
+++ b/drivers/dma-buf/reservation.c
@@ -298,8 +298,7 @@ int reservation_object_get_fences_rcu(struct reservation_object *obj,
struct fence **nshared;
size_t sz = sizeof(*shared) * fobj->shared_max;

- nshared = krealloc(shared, sz,
- GFP_NOWAIT | __GFP_NOWARN);
+ nshared = krealloc(shared, sz, GFP_NOWAIT);
if (!nshared) {
rcu_read_unlock();
nshared = krealloc(shared, sz, GFP_KERNEL);
diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index 646fe85261c1..0a9b6a91f5b9 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -1182,8 +1182,7 @@ static void __btree_sort(struct btree_keys *b, struct btree_iter *iter,
{
uint64_t start_time;
bool used_mempool = false;
- struct bset *out = (void *) __get_free_pages(__GFP_NOWARN|GFP_NOWAIT,
- order);
+ struct bset *out = (void *) __get_free_pages(GFP_NOWAIT, order);
if (!out) {
struct page *outp;

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 76f7534d1dd1..54355099dd14 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -419,7 +419,7 @@ static void do_btree_node_write(struct btree *b)
SET_PTR_OFFSET(&k.key, 0, PTR_OFFSET(&k.key, 0) +
bset_sector_offset(&b->keys, i));

- if (!bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
+ if (!bio_alloc_pages(b->bio, GFP_NOWAIT)) {
int j;
struct bio_vec *bv;
void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05713a5da083..ae46eebd56be 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -932,7 +932,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
* wakeup the thread for old dirty data writeback
*/
work = kzalloc(sizeof(*work),
- GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ GFP_NOWAIT | __GFP_NOMEMALLOC);
if (!work) {
trace_writeback_nowork(wb);
wb_wakeup(wb);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f8041f9de31e..2579a9abbc38 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -205,7 +205,8 @@ struct vm_area_struct;
* accounted to kmemcg.
*
* GFP_NOWAIT is for kernel allocations that should not stall for direct
- * reclaim, start physical IO or use any filesystem callback.
+ * reclaim, start physical IO or use any filesystem callback. These are
+ * quite likely to fail under memory pressure.
*
* GFP_NOIO will use direct reclaim to discard clean pages or slab pages
* that do not require the starting of any physical IO.
@@ -246,7 +247,7 @@ struct vm_area_struct;
#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
#define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
-#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
+#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM | __GFP_NOWARN)
#define GFP_NOIO (__GFP_RECLAIM)
#define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
#define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 19417719fde7..3013df303461 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -732,7 +732,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
* reclaim. This is optimistic, no harm done if it fails.
*/
prev = kmalloc(sizeof(struct map_info),
- GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ GFP_NOWAIT | __GFP_NOMEMALLOC);
if (prev)
prev->next = NULL;
}
diff --git a/mm/memory.c b/mm/memory.c
index 840adc6e05d6..13fa563e90e1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -197,7 +197,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
if (tlb->batch_count == MAX_GATHER_BATCH_COUNT)
return false;

- batch = (void *)__get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
+ batch = (void *)__get_free_pages(GFP_NOWAIT, 0);
if (!batch)
return false;

@@ -383,7 +383,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
}

if (*batch == NULL) {
- *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+ *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT);
if (*batch == NULL) {
tlb_remove_table_one(table);
return;
diff --git a/mm/rmap.c b/mm/rmap.c
index 1ef36404e7b2..004abc7c2cff 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -263,7 +263,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
struct anon_vma *anon_vma;

- avc = anon_vma_chain_alloc(GFP_NOWAIT | __GFP_NOWARN);
+ avc = anon_vma_chain_alloc(GFP_NOWAIT);
if (unlikely(!avc)) {
unlock_anon_vma_root(root);
root = NULL;
diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index 03725b294286..e2621acc54e8 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -385,7 +385,7 @@ static void tcp_cdg_init(struct sock *sk)
/* We silently fall back to window = 1 if allocation fails. */
if (window > 1)
ca->gradients = kcalloc(window, sizeof(ca->gradients[0]),
- GFP_NOWAIT | __GFP_NOWARN);
+ GFP_NOWAIT);
ca->rtt_seq = tp->snd_nxt;
ca->shadow_wnd = tp->snd_cwnd;
}
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9ae588511aaf..791fd3fe4995 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -871,7 +871,7 @@ void *rpc_malloc(struct rpc_task *task, size_t size)
gfp_t gfp = GFP_NOIO | __GFP_NOWARN;

if (RPC_IS_SWAPPER(task))
- gfp = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+ gfp = __GFP_MEMALLOC | GFP_NOWAIT;

size += sizeof(struct rpc_buffer);
if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index ea244b29138b..3a4e13282f4e 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1081,7 +1081,7 @@ void xprt_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
list_del(&req->rq_list);
goto out_init_req;
}
- req = xprt_dynamic_alloc_slot(xprt, GFP_NOWAIT|__GFP_NOWARN);
+ req = xprt_dynamic_alloc_slot(xprt, GFP_NOWAIT);
if (!IS_ERR(req))
goto out_init_req;
switch (PTR_ERR(req)) {
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 81f0e879f019..c22ed3b7bf61 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -502,7 +502,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)

flags = RPCRDMA_DEF_GFP;
if (RPC_IS_SWAPPER(task))
- flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+ flags = __GFP_MEMALLOC | GFP_NOWAIT;

if (req->rl_rdmabuf == NULL)
goto out_rdmabuf;
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 8035cc1eb955..99209775818f 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -179,7 +179,7 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
* do alloc nowait since if we are going to sleep anyway we
* may as well sleep faulting in page
*/
- work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT | __GFP_NOWARN);
+ work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
if (!work)
return 0;

--
2.10.2
--
Michal Hocko
SUSE Labs

2016-11-29 17:18:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On Tue, Nov 29, 2016 at 9:13 AM, Michal Hocko <[email protected]> wrote:
> How does this look like?

No.

I *really* want people to write out that "I am ok with the allocation failing".

It's not an "inconvenience". It's a sign that you are competent and
that you know it will fail, and that you can handle it.

If you don't show that you know that, we warn about it.

And no, "GFP_NOWAIT" does *not* mean "I have a good fallback".

Linus

2016-11-29 17:29:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On Tue 29-11-16 09:17:37, Linus Torvalds wrote:
> On Tue, Nov 29, 2016 at 9:13 AM, Michal Hocko <[email protected]> wrote:
> > How does this look like?
>
> No.
>
> I *really* want people to write out that "I am ok with the allocation failing".
>
> It's not an "inconvenience". It's a sign that you are competent and
> that you know it will fail, and that you can handle it.
>
> If you don't show that you know that, we warn about it.

How does this warning help those who are watching the logs? What are
they supposed to do about it? Unlike GFP_ATOMIC there is no tuning you
can possibly do.

>From my experience people tend to report those and worry about them
(quite often confusing them with the real OOM) and we usually only can
explain that this is nothing to worry about... And so then we sprinkle
GFP_NOWARN all over the place as we hit those. Is this really what we
want?

> And no, "GFP_NOWAIT" does *not* mean "I have a good fallback".

I am confused, how can anybody _rely_ on GFP_NOWAIT to succeed?

--
Michal Hocko
SUSE Labs

2016-11-29 17:40:36

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

Thanks for the reply and suggestions.

On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote:
> On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN <[email protected]> wrote:
> > Now, to be fair, this is not a new problem, it's just varying degrees of
> > bad and usually only happens when I do a lot of I/O with btrfs.
>
> One situation where I've seen something like this happen is
>
> (a) lots and lots of dirty data queued up
> (b) horribly slow storage

In my case, it is a 5x 4TB HDD with
software raid 5 < bcache < dmcrypt < btrfs
bcache is currently half disabled (as in I removed the actual cache) or
too many bcache requests pile up, and the kernel dies when too many
workqueues have piled up.
I'm just kind of worried that since I'm going through 4 subsystems
before my data can hit disk, that's a lot of memory allocations and
places where data can accumulate and cause bottlenecks if the next
subsystem isn't as fast.

But this shouldn't be "horribly slow", should it? (it does copy a few
terabytes per day, not fast, but not horrible, about 30MB/s or so)

> Sadly, our defaults for "how much dirty data do we allow" are somewhat
> buggered. The global defaults are in "percent of memory", and are
> generally _much_ too high for big-memory machines:
>
> [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
> 20
> [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
> 10

I can confirm I have the same.

> says that it only starts really throttling writes when you hit 20% of
> all memory used. You don't say how much memory you have in that
> machine, but if it's the same one you talked about earlier, it was
> 24GB. So you can have 4GB of dirty data waiting to be flushed out.

Correct, 24GB and 4GB.

> And we *try* to do this per-device backing-dev congestion thing to
> make things work better, but it generally seems to not work very well.
> Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
> does really well, and we want to open up, and then it shuts down).
>
> One thing you can try is to just make the global limits much lower. As in
>
> echo 2 > /proc/sys/vm/dirty_ratio
> echo 1 > /proc/sys/vm/dirty_background_ratio

I will give that a shot, thank you.

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

2016-11-29 17:48:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg

On Tue, Nov 29, 2016 at 9:28 AM, Michal Hocko <[email protected]> wrote:
> How does this warning help those who are watching the logs? What are
> they supposed to do about it? Unlike GFP_ATOMIC there is no tuning you
> can possibly do.

You can report it and it will get fixed.

It's not about tuning. It's about people like Tejun who made changes
and didn't do them right.

In other words, exactly the patch that this whole thread started with.

Except that because of the idiotic arguments about the *obvious*
patch, the patch gets delayed and not applied.

The whole __GFP_NOWARN thing is not some kind of newfangled thing that
suddenly became a problem. It's been there for decades. Why are you
arguing for stupidly removing it now?

> I am confused, how can anybody _rely_ on GFP_NOWAIT to succeed?

You can't (except perhaps during bootup).

BUT YOU HAVE TO HAVE A FALLBACK, AND SHOW THAT YOU ARE *AWARE* THAT
YOU CAN"T RELY ON IT.

Christ. What's so hard to understand about this?

And no, GFP_NOWAIT is not special. Higher orders have the exact same
issue. And they too need that __GFP_NOWARN to show that "yes, I know,
and yes, I have a fallback strategy".

Because that warning shows real bugs. Seriously. We had fix for this
pending for 4.10 already (nfsd just blithely assuming it can do big
allocations).

So stop the idiotic arguments. The whole point is that lots of people
don't think about allocations failing (and NOWAIT and friends do not
change that ONE WHIT), and __GFP_NOWARN is there exactly to show that
you thought about them.

The warning _has_ been useful. We're not hiding it by default, because
that makes the whole warning pointless.

Really.

Linus

2016-11-29 18:01:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue, Nov 29, 2016 at 9:40 AM, Marc MERLIN <[email protected]> wrote:
>
> In my case, it is a 5x 4TB HDD with
> software raid 5 < bcache < dmcrypt < btrfs

It doesn't sound like the nasty situations I have seen (particularly
with large USB flash storage - often high momentary speed for
benchmarks, but slows down to a crawl after you've written a bit to
it, and doesn't have the smart garbage collection that modern "real"
SSDs have).

But while it doesn't sound like that nasty case, RAID5 will certainly
not help your write speed, and with spinning rust that potentially up
to 4GB (in fact, almost 5GB) of dirty pending data is going to take a
long time to write out if it's not all nice and contiguous (which it
won't be).

And btrfs might be weak on that case - I remember complaining about
fsync stuttering all IO a few years ago, exactly because it would
force-flush everything else too (ie you were doing non-synchronous
writes in one session, and then the browser did a "fsync" on the small
writes it did to the mysql database, and suddenly the browser paused
for ten seconds or more, because the fsync wasn't just waiting for the
small database update, but for _everythinig_ to be written back).

Your backtrace isn't for fsync, but it looks superficially similar:
"wait for write data to flush".

Linus

2016-11-29 23:01:46

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue, Nov 29, 2016 at 09:40:19AM -0800, Marc MERLIN wrote:
> Thanks for the reply and suggestions.
>
> On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote:
> > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN <[email protected]> wrote:
> > > Now, to be fair, this is not a new problem, it's just varying degrees of
> > > bad and usually only happens when I do a lot of I/O with btrfs.
> >
> > One situation where I've seen something like this happen is
> >
> > (a) lots and lots of dirty data queued up
> > (b) horribly slow storage
>
> In my case, it is a 5x 4TB HDD with
> software raid 5 < bcache < dmcrypt < btrfs
> bcache is currently half disabled (as in I removed the actual cache) or
> too many bcache requests pile up, and the kernel dies when too many
> workqueues have piled up.
> I'm just kind of worried that since I'm going through 4 subsystems
> before my data can hit disk, that's a lot of memory allocations and
> places where data can accumulate and cause bottlenecks if the next
> subsystem isn't as fast.
>
> But this shouldn't be "horribly slow", should it? (it does copy a few
> terabytes per day, not fast, but not horrible, about 30MB/s or so)
>
> > Sadly, our defaults for "how much dirty data do we allow" are somewhat
> > buggered. The global defaults are in "percent of memory", and are
> > generally _much_ too high for big-memory machines:
> >
> > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
> > 20
> > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
> > 10
>
> I can confirm I have the same.
>
> > says that it only starts really throttling writes when you hit 20% of
> > all memory used. You don't say how much memory you have in that
> > machine, but if it's the same one you talked about earlier, it was
> > 24GB. So you can have 4GB of dirty data waiting to be flushed out.
>
> Correct, 24GB and 4GB.
>
> > And we *try* to do this per-device backing-dev congestion thing to
> > make things work better, but it generally seems to not work very well.
> > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
> > does really well, and we want to open up, and then it shuts down).
> >
> > One thing you can try is to just make the global limits much lower. As in
> >
> > echo 2 > /proc/sys/vm/dirty_ratio
> > echo 1 > /proc/sys/vm/dirty_background_ratio
>
> I will give that a shot, thank you.

And, after 5H of copying, not a single hang, or USB disconnect, or anything.
Obviously this seems to point to other problems in the code, and I have no
idea which layer is a culprit here, but reducing the buffers absolutely
helped a lot.

Thanks much,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

2016-11-30 13:58:44

by Tetsuo Handa

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On 2016/11/30 8:01, Marc MERLIN wrote:
> And, after 5H of copying, not a single hang, or USB disconnect, or anything.
> Obviously this seems to point to other problems in the code, and I have no
> idea which layer is a culprit here, but reducing the buffers absolutely
> helped a lot.

Maybe you can try commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for too long")
or http://lkml.kernel.org/r/1478416501-10104-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
for finding the culprit.

2016-11-30 17:47:30

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Tue, Nov 29, 2016 at 10:01:10AM -0800, Linus Torvalds wrote:
> On Tue, Nov 29, 2016 at 9:40 AM, Marc MERLIN <[email protected]> wrote:
> >
> > In my case, it is a 5x 4TB HDD with
> > software raid 5 < bcache < dmcrypt < btrfs
>
> It doesn't sound like the nasty situations I have seen (particularly
> with large USB flash storage - often high momentary speed for
> benchmarks, but slows down to a crawl after you've written a bit to
> it, and doesn't have the smart garbage collection that modern "real"
> SSDs have).

I gave it a thought again, I think it is exactly the nasty situation you
described.
bcache takes I/O quickly while sending to SSD cache. SSD fills up, now
bcache can't handle IO as quickly and has to hang until the SSD has been
flushed to spinning rust drives.
This actually is exactly the same as filling up the cache on a USB key
and now you're waiting for slow writes to flash, is it not?

With your dirty ratio workaround, I was able to re-enable bcache and
have it not fall over, but only barely. I recorded over a hundred
workqueues in flight during the copy at some point (just not enough
to actually kill the kernel this time).

I've started a bcache followp on this here:
http://marc.info/?l=linux-bcache&m=148052441423532&w=2
http://marc.info/?l=linux-bcache&m=148052620524162&w=2

This message shows the huge pileup of workqueeues in bcache
just before the kernel dies with
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
task: ffff9ee0c2fa4180 task.stack: ffff9ee0c2fa8000
RIP: 0010:[<ffffffffbb57a128>] [<ffffffffbb57a128>] cpuidle_enter_state+0x119/0x171
RSP: 0000:ffff9ee0c2fabea0 EFLAGS: 00000246
RAX: ffff9ee0de3d90c0 RBX: 0000000000000004 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 0000000000000007 RDI: 0000000000000000
RBP: ffff9ee0c2fabed0 R08: 0000000000000f92 R09: 0000000000000f42
R10: ffff9ee0c2fabe50 R11: 071c71c71c71c71c R12: ffffe047bfdcb200
R13: 00000af626899577 R14: 0000000000000004 R15: 00000af6264cc557
FS: 0000000000000000(0000) GS:ffff9ee0de3c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000898b000 CR3: 000000045cc06000 CR4: 00000000001406e0
Stack:
0000000000000f40 ffffe047bfdcb200 ffffffffbbccc060 ffff9ee0c2fac000
ffff9ee0c2fa8000 ffff9ee0c2fac000 ffff9ee0c2fabee0 ffffffffbb57a1ac
ffff9ee0c2fabf30 ffffffffbb09238d ffff9ee0c2fa8000 0000000700000004
Call Trace:
[<ffffffffbb57a1ac>] cpuidle_enter+0x17/0x19
[<ffffffffbb09238d>] cpu_startup_entry+0x210/0x28b
[<ffffffffbb03de22>] start_secondary+0x13e/0x140
Code: 00 00 00 48 c7 c7 cd ae b2 bb c6 05 4b 8e 7a 00 01 e8 17 6c ae ff fa 66 0f 1f 44 00 00 31 ff e8 75 60 b4
44 00 00 <4c> 89 e8 b9 e8 03 00 00 4c 29 f8 48 99 48 f7 f9 ba ff ff ff 7f
Kernel panic - not syncing: Hard LOCKUP

A full traceback showing the pilup of requests is here:
http://marc.info/?l=linux-bcache&m=147949497808483&w=2

and there:
http://pastebin.com/rJ5RKUVm
(2 different ones but mostly the same result)

We can probably follow up on the bcache thread I Cc'ed you on since I'm
not sure if the fault here lies with bcache or the VM subsystem anymore.

Thanks.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

2016-11-30 18:15:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Wed, Nov 30, 2016 at 9:47 AM, Marc MERLIN <[email protected]> wrote:
>
> I gave it a thought again, I think it is exactly the nasty situation you
> described.
> bcache takes I/O quickly while sending to SSD cache. SSD fills up, now
> bcache can't handle IO as quickly and has to hang until the SSD has been
> flushed to spinning rust drives.
> This actually is exactly the same as filling up the cache on a USB key
> and now you're waiting for slow writes to flash, is it not?

It does sound like you might hit exactly the same kind of situation, yes.

And the fact that you have dmcrypt running too just makes things pile
up more. All those IO's end up slowed down by the scheduling too.

Anyway, none of this seems new per se. I'm adding Kent and Jens to the
cc (Tejun already was), in the hope that maybe they have some idea how
to control the nasty worst-case behavior wrt workqueue lockup (it's
not really a "lockup", it looks like it's just hundreds of workqueues
all waiting for IO to complete and much too deep IO queues).

I think it's the traditional "throughput is much easier to measure and
improve" situation, where making queues big help some throughput
situation, but ends up causing chaos when things go south.

And I think your NMI watchdog then turns the "system is no longer
responsive" into an actual kernel panic.

> With your dirty ratio workaround, I was able to re-enable bcache and
> have it not fall over, but only barely. I recorded over a hundred
> workqueues in flight during the copy at some point (just not enough
> to actually kill the kernel this time).
>
> I've started a bcache followp on this here:
> http://marc.info/?l=linux-bcache&m=148052441423532&w=2
> http://marc.info/?l=linux-bcache&m=148052620524162&w=2
>
> A full traceback showing the pilup of requests is here:
> http://marc.info/?l=linux-bcache&m=147949497808483&w=2
>
> and there:
> http://pastebin.com/rJ5RKUVm
> (2 different ones but mostly the same result)

Tejun/Kent - any way to just limit the workqueue depth for bcache?
Because that really isn't helping, and things *will* time out and
cause those problems when you have hundreds of IO's queued on a disk
that likely as a write iops around ~100..

And I really wonder if we should do the "big hammer" approach to the
dirty limits on non-HIGHMEM machines too (approximate the
"vm_highmem_is_dirtyable" by just limiting global_dirtyable_memory()
to 1 GB).

That would make the default dirty limits be 100/200MB (for soft/hard
throttling), which really is much more reasonable than gigabytes and
gigabytes of dirty data.

Of course, no way do we do that during rc7..

Linus


Attachments:
patch.diff (956.00 B)

2016-11-30 18:22:01

by Marc MERLIN

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote:
> Anyway, none of this seems new per se. I'm adding Kent and Jens to the
> cc (Tejun already was), in the hope that maybe they have some idea how
> to control the nasty worst-case behavior wrt workqueue lockup (it's
> not really a "lockup", it looks like it's just hundreds of workqueues
> all waiting for IO to complete and much too deep IO queues).

I'll take your word for it, all I got in the end was
Kernel panic - not syncing: Hard LOCKUP
and the system stone dead when I woke up hours later.

> And I think your NMI watchdog then turns the "system is no longer
> responsive" into an actual kernel panic.

Ah, I see.

Thanks for the reply, and sorry for bringing in that separate thread
from the btrfs mailing list, which effectively was a suggestion similar
to what you're saying here too.

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

2016-11-30 18:28:18

by Jens Axboe

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On 11/30/2016 11:14 AM, Linus Torvalds wrote:
> On Wed, Nov 30, 2016 at 9:47 AM, Marc MERLIN <[email protected]> wrote:
>>
>> I gave it a thought again, I think it is exactly the nasty situation you
>> described.
>> bcache takes I/O quickly while sending to SSD cache. SSD fills up, now
>> bcache can't handle IO as quickly and has to hang until the SSD has been
>> flushed to spinning rust drives.
>> This actually is exactly the same as filling up the cache on a USB key
>> and now you're waiting for slow writes to flash, is it not?
>
> It does sound like you might hit exactly the same kind of situation, yes.
>
> And the fact that you have dmcrypt running too just makes things pile
> up more. All those IO's end up slowed down by the scheduling too.
>
> Anyway, none of this seems new per se. I'm adding Kent and Jens to the
> cc (Tejun already was), in the hope that maybe they have some idea how
> to control the nasty worst-case behavior wrt workqueue lockup (it's
> not really a "lockup", it looks like it's just hundreds of workqueues
> all waiting for IO to complete and much too deep IO queues).

Honestly, the easiest would be to wire it up to the blk-wbt stuff that
is queued up for 4.10, which attempts to limit the queue depths to
something reasonable instead of letting them run amok. This is largely
(exclusively, almost) a problem with buffered writeback.

On devices utilizing the stacked interface, they never get any depth
throttling. Obviously it's worse if each IO ends up queueing work, but
it's a big problem even if they do not.

> I think it's the traditional "throughput is much easier to measure and
> improve" situation, where making queues big help some throughput
> situation, but ends up causing chaos when things go south.

Yes, and the longer queues never buy you anything, but they end up
causing tons of problems at the other end of the spectrum.

Still makes sense to limit dirty memory for highmem, though.

--
Jens Axboe

2016-11-30 20:30:21

by Tejun Heo

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

Hello,

On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote:
> Tejun/Kent - any way to just limit the workqueue depth for bcache?
> Because that really isn't helping, and things *will* time out and
> cause those problems when you have hundreds of IO's queued on a disk
> that likely as a write iops around ~100..

Yeah, easily. I'm assuming it's gonna be the bcache_wq allocated in
from bcache_init(). It's currently using 0 as @max_active and it can
set to be any arbitrary number. It'd be a very crude way to control
what looks like a buffer bloat with IOs tho. We can make it a bit
more granular by splitting workqueues per bcache instance / purpose
but for the long term the right solution seems to be hooking into
writeback throttling mechanism that block layer just grew recently.

Thanks.

--
tejun

2016-12-01 13:50:21

by Kent Overstreet

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Wed, Nov 30, 2016 at 03:30:11PM -0500, Tejun Heo wrote:
> Hello,
>
> On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote:
> > Tejun/Kent - any way to just limit the workqueue depth for bcache?
> > Because that really isn't helping, and things *will* time out and
> > cause those problems when you have hundreds of IO's queued on a disk
> > that likely as a write iops around ~100..
>
> Yeah, easily. I'm assuming it's gonna be the bcache_wq allocated in
> from bcache_init(). It's currently using 0 as @max_active and it can
> set to be any arbitrary number. It'd be a very crude way to control
> what looks like a buffer bloat with IOs tho. We can make it a bit
> more granular by splitting workqueues per bcache instance / purpose
> but for the long term the right solution seems to be hooking into
> writeback throttling mechanism that block layer just grew recently.

Agreed that the writeback code is the right place to do it. Within bcache we
can't really do anything smarter than just throw a hard limit on the number of
outstanding IOs and enforce it by blocking in generic_make_request(), and the
bcache code is the wrong place to do that - we don't know what the limit should
be there, and all the IOs look the same at that point so you'd probably still
end up with writeback starving everything else.

I could futz with the workqueue stuff, but that'd likely as not break some other
workload - I've spent enough time as it is fighting with workqueue concurrency
stuff in the past. My preference would be to just try and get Jens's stuff in.

That said, I'm not sure how I feel about Jens's exact approach... it seems to me
that this can really just live within the writeback code, I don't know why it
should involve the block layer at all. plus, if I understand correctly his code
has the effect of blocking in generic_make_request() to throttle, which means
due to the way the writeback code is structured we'll be blocking with page
locks held. I did my own thing in bcachefs, same idea but throttling in
writepages... it's dumb and simple but it's worked exceedingly well, as far as
actual usability and responsiveness:

https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/fs-io.c?h=bcache-dev&id=acf766b2dd33b076fdce66c86363a3e26a9b70cf#n1002

that said - any kind of throttling for writeback will be a million times better
than the current situation...

2016-12-01 18:16:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Thu, Dec 1, 2016 at 5:50 AM, Kent Overstreet
<[email protected]> wrote:
>
> That said, I'm not sure how I feel about Jens's exact approach... it seems to me
> that this can really just live within the writeback code, I don't know why it
> should involve the block layer at all. plus, if I understand correctly his code
> has the effect of blocking in generic_make_request() to throttle, which means
> due to the way the writeback code is structured we'll be blocking with page
> locks held.

Yeah, I do *not* believe that throttling at the block layer is at all
the right thing to do.

I do think that the block layer needs to throttle, but it needs to be
seen as a "last resort" kind of thing, where the block layer just
needs to limit how much it will have oending. But it should be seen as
a failure mode, not as a write balancing issue.

Because the real throttling absolutely needs to happen when things are
marked dirty, because no block layer throttling will ever fix the
situation where you just have too much memory dirtied that you cannot
free because it will take a minute to write out.

So throttling at a VM level is sane. Throttling at a block layer level is not.

Linus

2016-12-01 18:30:57

by Jens Axboe

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On 12/01/2016 11:16 AM, Linus Torvalds wrote:
> On Thu, Dec 1, 2016 at 5:50 AM, Kent Overstreet
> <[email protected]> wrote:
>>
>> That said, I'm not sure how I feel about Jens's exact approach... it seems to me
>> that this can really just live within the writeback code, I don't know why it
>> should involve the block layer at all. plus, if I understand correctly his code
>> has the effect of blocking in generic_make_request() to throttle, which means
>> due to the way the writeback code is structured we'll be blocking with page
>> locks held.
>
> Yeah, I do *not* believe that throttling at the block layer is at all
> the right thing to do.
>
> I do think that the block layer needs to throttle, but it needs to be
> seen as a "last resort" kind of thing, where the block layer just
> needs to limit how much it will have oending. But it should be seen as
> a failure mode, not as a write balancing issue.
>
> Because the real throttling absolutely needs to happen when things are
> marked dirty, because no block layer throttling will ever fix the
> situation where you just have too much memory dirtied that you cannot
> free because it will take a minute to write out.
>
> So throttling at a VM level is sane. Throttling at a block layer level is not.

It's two different kinds of throttling. The vm absolutely should
throttle at dirty time, to avoid having insane amounts of memory dirty.
On the block layer side, throttling is about avoid the device queues
being too long. It's very similar to the buffer bloating on the
networking side. The block layer throttling is not a fix for the vm
allowing too much memory to be dirty and causing issues, it's about
keeping the device response latencies in check.

--
Jens Axboe

2016-12-01 18:37:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On Thu, Dec 1, 2016 at 10:30 AM, Jens Axboe <[email protected]> wrote:
>
> It's two different kinds of throttling. The vm absolutely should
> throttle at dirty time, to avoid having insane amounts of memory dirty.
> On the block layer side, throttling is about avoid the device queues
> being too long. It's very similar to the buffer bloating on the
> networking side. The block layer throttling is not a fix for the vm
> allowing too much memory to be dirty and causing issues, it's about
> keeping the device response latencies in check.

Sure. But if we really do just end up blocking in the block layer (in
situations where we didn't used to), that may be a bad thing. It might
be better to feed that information back to the VM instead,
particularly for writes, where the VM layer already tries to ratelimit
the writes.

And frankly, it's almost purely writes that matter. There just aren't
a lot of ways to get that many parallel reads in real life.

I haven't looked at your patches, so maybe you already do this.

Linus

2016-12-01 18:46:25

by Jens Axboe

[permalink] [raw]
Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

On 12/01/2016 11:37 AM, Linus Torvalds wrote:
> On Thu, Dec 1, 2016 at 10:30 AM, Jens Axboe <[email protected]> wrote:
>>
>> It's two different kinds of throttling. The vm absolutely should
>> throttle at dirty time, to avoid having insane amounts of memory dirty.
>> On the block layer side, throttling is about avoid the device queues
>> being too long. It's very similar to the buffer bloating on the
>> networking side. The block layer throttling is not a fix for the vm
>> allowing too much memory to be dirty and causing issues, it's about
>> keeping the device response latencies in check.
>
> Sure. But if we really do just end up blocking in the block layer (in
> situations where we didn't used to), that may be a bad thing. It might
> be better to feed that information back to the VM instead,
> particularly for writes, where the VM layer already tries to ratelimit
> the writes.

It's not a new blocking point, it's the same blocking point that we
always end up in, if we run out of requests. The problem with bcache and
other stacked drivers is that they don't have a request pool, so they
never really need to block there.

> And frankly, it's almost purely writes that matter. There just aren't
> a lot of ways to get that many parallel reads in real life.

Exactly, it's almost exclusively a buffered write problem, as I wrote in
the initial reply. Most other things tend to throttle nicely on their
own.

> I haven't looked at your patches, so maybe you already do this.

It's currently not fed back, but that would be pretty trivial to do. The
mechanism we have for that (queue congestion) is a bit of a mess,
though, so it would need to be revamped a bit.

--
Jens Axboe