FYI, we noticed the below changes on
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
commit d0164adc89f6bb374d304ffcc375c6d2652fe67d ("mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd")
Note: the testing machine is a virtual machine with only 1G memory.
=========================================================================================
compiler/disk/filesize/fs/iterations/kconfig/nr_directories/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase:
gcc-4.9/1HDD/16MB/xfs/1x/x86_64-rhel/16d/256fpd/32t/debian-x86_64-2015-02-07.cgz/fsyncBeforeClose/vm-vp-1G/60G/fsmark
commit:
016c13daa5c9e4827eca703e2f0621c131f2cca3
d0164adc89f6bb374d304ffcc375c6d2652fe67d
016c13daa5c9e482 d0164adc89f6bb374d304ffcc3
---------------- --------------------------
fail:runs %reproduction fail:runs
| | |
:4 100% 4:4 last_state.fsmark.exit_code.143
:4 50% 2:4 last_state.is_incomplete_run
:4 100% 4:4 dmesg.Mem-Info
:4 100% 4:4 dmesg.page_allocation_failure:order:#,mode
:4 100% 4:4 dmesg.warn_alloc_failed+0x
6327 23% -93.5% 409.00 80% proc-vmstat.allocstall
173495 58% -100.0% 0.00 -1% proc-vmstat.compact_free_scanned
4394 59% -100.0% 0.00 -1% proc-vmstat.compact_isolated
10055 44% -100.0% 0.00 -1% proc-vmstat.compact_migrate_scanned
443.25 13% -99.7% 1.50 100% proc-vmstat.kswapd_high_wmark_hit_quickly
28950 12% -91.4% 2502 81% proc-vmstat.kswapd_low_wmark_hit_quickly
15704144 0% -91.1% 1402050 73% proc-vmstat.nr_dirtied
12851 0% +26.3% 16235 18% proc-vmstat.nr_dirty_background_threshold
25704 0% +26.3% 32471 18% proc-vmstat.nr_dirty_threshold
2882 0% +1130.5% 35463 84% proc-vmstat.nr_free_pages
15693749 0% -91.3% 1365065 75% proc-vmstat.nr_written
16289593 0% -91.0% 1464689 72% proc-vmstat.numa_hit
16289593 0% -91.0% 1464689 72% proc-vmstat.numa_local
30453 12% -91.6% 2552 81% proc-vmstat.pageoutrun
16316641 0% -91.0% 1468330 72% proc-vmstat.pgalloc_dma32
642889 5% -90.5% 61326 56% proc-vmstat.pgfault
16218859 0% -91.6% 1355797 78% proc-vmstat.pgfree
69.25 68% -100.0% 0.00 -1% proc-vmstat.pgmigrate_fail
2066 58% -100.0% 0.00 -1% proc-vmstat.pgmigrate_success
62849613 0% -91.2% 5512004 74% proc-vmstat.pgpgout
417966 16% -80.1% 82999 36% proc-vmstat.pgscan_direct_dma32
15259915 0% -91.5% 1303209 76% proc-vmstat.pgscan_kswapd_dma32
360298 23% -93.5% 23325 82% proc-vmstat.pgsteal_direct_dma32
15224912 0% -91.7% 1270706 79% proc-vmstat.pgsteal_kswapd_dma32
236736 0% -96.1% 9216 100% proc-vmstat.slabs_scanned
108153 0% -98.0% 2154 100% proc-vmstat.workingset_nodereclaim
vm-vp-1G: qemu-system-x86_64 -enable-kvm -cpu Nehalem
Memory: 1G
To reproduce:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Thanks,
Ying Huang
On Thu, Nov 26, 2015 at 08:56:12AM +0800, kernel test robot wrote:
> FYI, we noticed the below changes on
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> commit d0164adc89f6bb374d304ffcc375c6d2652fe67d ("mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd")
>
> Note: the testing machine is a virtual machine with only 1G memory.
>
I'm not actually seeing any problem here. Is this a positive report or
am I missing something obvious?
--
Mel Gorman
SUSE Labs
On 11/26/2015 08:25 AM, Mel Gorman wrote:
> On Thu, Nov 26, 2015 at 08:56:12AM +0800, kernel test robot wrote:
>> FYI, we noticed the below changes on
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> commit d0164adc89f6bb374d304ffcc375c6d2652fe67d ("mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd")
>>
>> Note: the testing machine is a virtual machine with only 1G memory.
>>
>
> I'm not actually seeing any problem here. Is this a positive report or
> am I missing something obvious?
I've gotten several reports that could be either
positive or negative, but where I am not quite
sure how to interpret the results.
The tool seems to CC the maintainers of the code
that was changed, so I am hoping they will pipe
up when they see a problem.
Of course, that doesn't help in this case :)
Hi, Mel,
Mel Gorman <[email protected]> writes:
> On Thu, Nov 26, 2015 at 08:56:12AM +0800, kernel test robot wrote:
>> FYI, we noticed the below changes on
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> commit d0164adc89f6bb374d304ffcc375c6d2652fe67d ("mm, page_alloc:
>> distinguish between being unable to sleep, unwilling to sleep and
>> avoiding waking kswapd")
>>
>> Note: the testing machine is a virtual machine with only 1G memory.
>>
>
> I'm not actually seeing any problem here. Is this a positive report or
> am I missing something obvious?
Sorry the email subject is generated automatically and I forget to
change it to some meaningful stuff before sending out. From the testing
result, we found the commit make the OOM possibility increased from 0%
to 100% on this machine with small memory. I also added proc-vmstat
information data too to help diagnose it.
Best Regards,
Huang, Ying
On Fri, Nov 27, 2015 at 09:14:52AM +0800, Huang, Ying wrote:
> Hi, Mel,
>
> Mel Gorman <[email protected]> writes:
>
> > On Thu, Nov 26, 2015 at 08:56:12AM +0800, kernel test robot wrote:
> >> FYI, we noticed the below changes on
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> commit d0164adc89f6bb374d304ffcc375c6d2652fe67d ("mm, page_alloc:
> >> distinguish between being unable to sleep, unwilling to sleep and
> >> avoiding waking kswapd")
> >>
> >> Note: the testing machine is a virtual machine with only 1G memory.
> >>
> >
> > I'm not actually seeing any problem here. Is this a positive report or
> > am I missing something obvious?
>
> Sorry the email subject is generated automatically and I forget to
> change it to some meaningful stuff before sending out. From the testing
> result, we found the commit make the OOM possibility increased from 0%
> to 100% on this machine with small memory. I also added proc-vmstat
> information data too to help diagnose it.
>
There is no reference to OOM possibility in the email that I can see. Can
you give examples of the OOM messages that shows the problem sites? It was
suspected that there may be some callers that were accidentally depending
on access to emergency reserves. If so, either they need to be fixed (if
the case is extremely rare) or a small reserve will have to be created
for callers that are not high priority but still cannot reclaim.
Note that I'm travelling a lot over the next two weeks so I'll be slow to
respond but I will get to it.
--
Mel Gorman
SUSE Labs
Mel Gorman <[email protected]> writes:
> On Fri, Nov 27, 2015 at 09:14:52AM +0800, Huang, Ying wrote:
>> Hi, Mel,
>>
>> Mel Gorman <[email protected]> writes:
>>
>> > On Thu, Nov 26, 2015 at 08:56:12AM +0800, kernel test robot wrote:
>> >> FYI, we noticed the below changes on
>> >>
>> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >> commit d0164adc89f6bb374d304ffcc375c6d2652fe67d ("mm, page_alloc:
>> >> distinguish between being unable to sleep, unwilling to sleep and
>> >> avoiding waking kswapd")
>> >>
>> >> Note: the testing machine is a virtual machine with only 1G memory.
>> >>
>> >
>> > I'm not actually seeing any problem here. Is this a positive report or
>> > am I missing something obvious?
>>
>> Sorry the email subject is generated automatically and I forget to
>> change it to some meaningful stuff before sending out. From the testing
>> result, we found the commit make the OOM possibility increased from 0%
>> to 100% on this machine with small memory. I also added proc-vmstat
>> information data too to help diagnose it.
>>
>
> There is no reference to OOM possibility in the email that I can see. Can
> you give examples of the OOM messages that shows the problem sites? It was
> suspected that there may be some callers that were accidentally depending
> on access to emergency reserves. If so, either they need to be fixed (if
> the case is extremely rare) or a small reserve will have to be created
> for callers that are not high priority but still cannot reclaim.
>
> Note that I'm travelling a lot over the next two weeks so I'll be slow to
> respond but I will get to it.
Here is the kernel log, the full dmesg is attached too. The OOM
occurs during fsmark testing.
Best Regards,
Huang, Ying
[ 31.453514] kworker/u4:0: page allocation failure: order:0, mode:0x2200000
[ 31.463570] CPU: 0 PID: 6 Comm: kworker/u4:0 Not tainted 4.3.0-08056-gd0164ad #1
[ 31.466115] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 31.477146] Workqueue: writeback wb_workfn (flush-253:0)
[ 31.481450] 0000000000000000 ffff880035ac75e8 ffffffff8140a142 0000000002200000
[ 31.492582] ffff880035ac7670 ffffffff8117117b ffff880037586b28 ffff880000000040
[ 31.507631] ffff88003523b270 0000000000000040 ffff880035abc800 ffffffff00000000
[ 31.510568] Call Trace:
[ 31.511828] [<ffffffff8140a142>] dump_stack+0x4b/0x69
[ 31.513391] [<ffffffff8117117b>] warn_alloc_failed+0xdb/0x140
[ 31.523163] [<ffffffff81174ec4>] __alloc_pages_nodemask+0x874/0xa60
[ 31.524949] [<ffffffff811bcb62>] alloc_pages_current+0x92/0x120
[ 31.526659] [<ffffffff811c73e4>] new_slab+0x3d4/0x480
[ 31.536134] [<ffffffff811c7c36>] __slab_alloc+0x376/0x470
[ 31.537541] [<ffffffff814e0ced>] ? alloc_indirect+0x1d/0x50
[ 31.543268] [<ffffffff81338221>] ? xfs_submit_ioend_bio+0x31/0x40
[ 31.545104] [<ffffffff814e0ced>] ? alloc_indirect+0x1d/0x50
[ 31.546982] [<ffffffff811c8e8d>] __kmalloc+0x20d/0x260
[ 31.548334] [<ffffffff814e0ced>] alloc_indirect+0x1d/0x50
[ 31.549805] [<ffffffff814e0fec>] virtqueue_add_sgs+0x2cc/0x3a0
[ 31.555396] [<ffffffff81573a30>] __virtblk_add_req+0xb0/0x1f0
[ 31.556846] [<ffffffff8117a121>] ? pagevec_lookup_tag+0x21/0x30
[ 31.558318] [<ffffffff813e5d72>] ? blk_rq_map_sg+0x1e2/0x4f0
[ 31.563880] [<ffffffff81573c82>] virtio_queue_rq+0x112/0x280
[ 31.565307] [<ffffffff813e9de7>] __blk_mq_run_hw_queue+0x1d7/0x370
[ 31.571005] [<ffffffff813e9bef>] blk_mq_run_hw_queue+0x9f/0xc0
[ 31.572472] [<ffffffff813eb10a>] blk_mq_insert_requests+0xfa/0x1a0
[ 31.573982] [<ffffffff813ebdb3>] blk_mq_flush_plug_list+0x123/0x140
[ 31.583686] [<ffffffff813e1777>] blk_flush_plug_list+0xa7/0x200
[ 31.585138] [<ffffffff813e1c49>] blk_finish_plug+0x29/0x40
[ 31.586542] [<ffffffff81215f85>] wb_writeback+0x185/0x2c0
[ 31.592429] [<ffffffff812166a5>] wb_workfn+0xf5/0x390
[ 31.594037] [<ffffffff81091297>] process_one_work+0x157/0x420
[ 31.599804] [<ffffffff81091ef9>] worker_thread+0x69/0x4a0
[ 31.601484] [<ffffffff81091e90>] ? rescuer_thread+0x380/0x380
[ 31.611368] [<ffffffff8109746f>] kthread+0xef/0x110
[ 31.612953] [<ffffffff81097380>] ? kthread_park+0x60/0x60
[ 31.619418] [<ffffffff818bce8f>] ret_from_fork+0x3f/0x70
[ 31.621221] [<ffffffff81097380>] ? kthread_park+0x60/0x60
[ 31.635226] Mem-Info:
[ 31.636569] active_anon:4942 inactive_anon:1643 isolated_anon:0
[ 31.636569] active_file:23196 inactive_file:110131 isolated_file:251
[ 31.636569] unevictable:92329 dirty:2865 writeback:1925 unstable:0
[ 31.636569] slab_reclaimable:10588 slab_unreclaimable:3390
[ 31.636569] mapped:2848 shmem:1687 pagetables:876 bounce:0
[ 31.636569] free:1932 free_pcp:218 free_cma:0
[ 31.667096] Node 0 DMA free:3948kB min:60kB low:72kB high:88kB active_anon:264kB inactive_anon:128kB active_file:1544kB inactive_file:5296kB unevictable:3136kB isolated(anon):0kB isolated(file):236kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:440kB shmem:128kB slab_reclaimable:588kB slab_unreclaimable:304kB kernel_stack:112kB pagetables:80kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3376 all_unreclaimable? no
[ 31.708140] lowmem_reserve[]: 0 972 972 972
[ 31.710104] Node 0 DMA32 free:3780kB min:3824kB low:4780kB high:5736kB active_anon:19504kB inactive_anon:6444kB active_file:91240kB inactive_file:435228kB unevictable:366180kB isolated(anon):0kB isolated(file):768kB present:1032064kB managed:997532kB mlocked:0kB dirty:11460kB writeback:7700kB mapped:10952kB shmem:6620kB slab_reclaimable:41764kB slab_unreclaimable:13256kB kernel_stack:2752kB pagetables:3424kB unstable:0kB bounce:0kB free_pcp:872kB local_pcp:232kB free_cma:0kB writeback_tmp:0kB pages_scanned:140404 all_unreclaimable? no
[ 31.743737] lowmem_reserve[]: 0 0 0 0
[ 31.745320] Node 0 DMA: 7*4kB (UME) 2*8kB (UM) 2*16kB (ME) 1*32kB (E) 0*64kB 2*128kB (ME) 2*256kB (ME) 2*512kB (UM) 2*1024kB (ME) 0*2048kB 0*4096kB = 3948kB
[ 31.757513] Node 0 DMA32: 1*4kB (U) 0*8kB 4*16kB (UME) 3*32kB (UE) 3*64kB (UM) 1*128kB (U) 1*256kB (U) 0*512kB 3*1024kB (UME) 0*2048kB 0*4096kB = 3812kB
[ 31.766470] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 31.772953] 227608 total pagecache pages
[ 31.774127] 0 pages in swap cache
[ 31.775428] Swap cache stats: add 0, delete 0, find 0/0
[ 31.776785] Free swap = 0kB
[ 31.777799] Total swap = 0kB
[ 31.779569] 262014 pages RAM
[ 31.780584] 0 pages HighMem/MovableOnly
[ 31.781744] 8654 pages reserved
[ 31.790944] 0 pages hwpoisoned
[ 31.792008] SLUB: Unable to allocate memory on node -1 (gfp=0x2080000)
[ 31.793537] cache: kmalloc-128, object size: 128, buffer size: 128, default order: 0, min order: 0
[ 31.796088] node 0: slabs: 27, objs: 864, free: 0
[Let's CC Will - see the question at the end of the email please]
This seems to be a similar allocation failure reported
http://lkml.kernel.org/r/87oafjpnb1.fsf%40yhuang-dev.intel.com
where I failed to see the important point, more on that below.
On Mon 30-11-15 10:14:24, Huang, Ying wrote:
> Mel Gorman <[email protected]> writes:
>
> > On Fri, Nov 27, 2015 at 09:14:52AM +0800, Huang, Ying wrote:
> >> Hi, Mel,
> >>
> >> Mel Gorman <[email protected]> writes:
> >>
> >> > On Thu, Nov 26, 2015 at 08:56:12AM +0800, kernel test robot wrote:
> >> >> FYI, we noticed the below changes on
> >> >>
> >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >> >> commit d0164adc89f6bb374d304ffcc375c6d2652fe67d ("mm, page_alloc:
> >> >> distinguish between being unable to sleep, unwilling to sleep and
> >> >> avoiding waking kswapd")
> >> >>
> >> >> Note: the testing machine is a virtual machine with only 1G memory.
> >> >>
> >> >
> >> > I'm not actually seeing any problem here. Is this a positive report or
> >> > am I missing something obvious?
> >>
> >> Sorry the email subject is generated automatically and I forget to
> >> change it to some meaningful stuff before sending out. From the testing
> >> result, we found the commit make the OOM possibility increased from 0%
> >> to 100% on this machine with small memory. I also added proc-vmstat
> >> information data too to help diagnose it.
> >>
> >
> > There is no reference to OOM possibility in the email that I can see. Can
> > you give examples of the OOM messages that shows the problem sites? It was
> > suspected that there may be some callers that were accidentally depending
> > on access to emergency reserves. If so, either they need to be fixed (if
> > the case is extremely rare) or a small reserve will have to be created
> > for callers that are not high priority but still cannot reclaim.
__virtblk_add_req calls
virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC)
alloc_indirect(gfp)
gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH)
So this is true __GFP_ATOMIC, we just drop __GFP_HIGH so it doesn't get
access to more reserves. It still does ALLOC_HARDER. So I think the real
issue is somewhere else when something should have triggered kswapd and
it doesn't do that anymore. I have tried to find that offender the last
time but didn't manage to find any.
Btw. I completely miss why b92b1b89a33c ("virtio: force vring
descriptors to be allocated from lowmem") had to clear __GFP_HIGH. Will
do you remember why you have dropped that flag as well?
Also I do not seem to find any user of alloc_indirect which would do
__GFP_HIGHMEM. All of them are either GFP_KERNEL or GFP_ATOMIC. So
either I am missing something or this is not really needed. Maybe the
situation was different back in 2012.
--
Michal Hocko
SUSE Labs
Hi Michal,
On Mon, Nov 30, 2015 at 02:02:00PM +0100, Michal Hocko wrote:
> [Let's CC Will - see the question at the end of the email please]
[...]
> > > There is no reference to OOM possibility in the email that I can see. Can
> > > you give examples of the OOM messages that shows the problem sites? It was
> > > suspected that there may be some callers that were accidentally depending
> > > on access to emergency reserves. If so, either they need to be fixed (if
> > > the case is extremely rare) or a small reserve will have to be created
> > > for callers that are not high priority but still cannot reclaim.
>
> __virtblk_add_req calls
> virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC)
> alloc_indirect(gfp)
> gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH)
>
> So this is true __GFP_ATOMIC, we just drop __GFP_HIGH so it doesn't get
> access to more reserves. It still does ALLOC_HARDER. So I think the real
> issue is somewhere else when something should have triggered kswapd and
> it doesn't do that anymore. I have tried to find that offender the last
> time but didn't manage to find any.
>
> Btw. I completely miss why b92b1b89a33c ("virtio: force vring
> descriptors to be allocated from lowmem") had to clear __GFP_HIGH. Will
> do you remember why you have dropped that flag as well?
Right, that looks unnecessary, but it could be that we were masking a
bug somewhere else.
> Also I do not seem to find any user of alloc_indirect which would do
> __GFP_HIGHMEM. All of them are either GFP_KERNEL or GFP_ATOMIC. So
> either I am missing something or this is not really needed. Maybe the
> situation was different back in 2012.
I tried to revisit the thread leading to that patch, but it doesn't make
a whole lot of sense:
https://lkml.org/lkml/2012/10/17/143
I certainly remember debugging the failure (i.e. it wasn't theoretical),
and we were ending up with highmem addresses being passed in the virtio
ring (due to the zero-copy stuff in 9p) and also for the descriptors
themselves. The discussion at the time makes it sound like GFP_ATOMIC
was giving us those...
Will
On Tue 01-12-15 12:23:41, Will Deacon wrote:
> Hi Michal,
>
> On Mon, Nov 30, 2015 at 02:02:00PM +0100, Michal Hocko wrote:
> > [Let's CC Will - see the question at the end of the email please]
>
> [...]
>
> > > > There is no reference to OOM possibility in the email that I can see. Can
> > > > you give examples of the OOM messages that shows the problem sites? It was
> > > > suspected that there may be some callers that were accidentally depending
> > > > on access to emergency reserves. If so, either they need to be fixed (if
> > > > the case is extremely rare) or a small reserve will have to be created
> > > > for callers that are not high priority but still cannot reclaim.
> >
> > __virtblk_add_req calls
> > virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC)
> > alloc_indirect(gfp)
> > gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH)
> >
> > So this is true __GFP_ATOMIC, we just drop __GFP_HIGH so it doesn't get
> > access to more reserves. It still does ALLOC_HARDER. So I think the real
> > issue is somewhere else when something should have triggered kswapd and
> > it doesn't do that anymore. I have tried to find that offender the last
> > time but didn't manage to find any.
> >
> > Btw. I completely miss why b92b1b89a33c ("virtio: force vring
> > descriptors to be allocated from lowmem") had to clear __GFP_HIGH. Will
> > do you remember why you have dropped that flag as well?
>
> Right, that looks unnecessary, but it could be that we were masking a
> bug somewhere else.
OK, I will send a patch to remove __GFP_HIGH because it is clearly
misleding and doesn't have anything to do with the highmem zone.
> > Also I do not seem to find any user of alloc_indirect which would do
> > __GFP_HIGHMEM. All of them are either GFP_KERNEL or GFP_ATOMIC. So
> > either I am missing something or this is not really needed. Maybe the
> > situation was different back in 2012.
>
> I tried to revisit the thread leading to that patch, but it doesn't make
> a whole lot of sense:
>
> https://lkml.org/lkml/2012/10/17/143
>
> I certainly remember debugging the failure (i.e. it wasn't theoretical),
> and we were ending up with highmem addresses being passed in the virtio
> ring (due to the zero-copy stuff in 9p) and also for the descriptors
> themselves. The discussion at the time makes it sound like GFP_ATOMIC
> was giving us those...
Hmm, unless I am missing something GFP_ATOMIC resp. GFP_KERNEL cannot
fallback to the highmem zone - see GFP_ZONE_TABLE. Maybe the highmem
pointer got there from a different path than alloc_indirect?
--
Michal Hocko
SUSE Labs
On Mon, Nov 30, 2015 at 10:14:24AM +0800, Huang, Ying wrote:
> > There is no reference to OOM possibility in the email that I can see. Can
> > you give examples of the OOM messages that shows the problem sites? It was
> > suspected that there may be some callers that were accidentally depending
> > on access to emergency reserves. If so, either they need to be fixed (if
> > the case is extremely rare) or a small reserve will have to be created
> > for callers that are not high priority but still cannot reclaim.
> >
> > Note that I'm travelling a lot over the next two weeks so I'll be slow to
> > respond but I will get to it.
>
> Here is the kernel log, the full dmesg is attached too. The OOM
> occurs during fsmark testing.
>
> Best Regards,
> Huang, Ying
>
> [ 31.453514] kworker/u4:0: page allocation failure: order:0, mode:0x2200000
> [ 31.463570] CPU: 0 PID: 6 Comm: kworker/u4:0 Not tainted 4.3.0-08056-gd0164ad #1
> [ 31.466115] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [ 31.477146] Workqueue: writeback wb_workfn (flush-253:0)
> [ 31.481450] 0000000000000000 ffff880035ac75e8 ffffffff8140a142 0000000002200000
> [ 31.492582] ffff880035ac7670 ffffffff8117117b ffff880037586b28 ffff880000000040
> [ 31.507631] ffff88003523b270 0000000000000040 ffff880035abc800 ffffffff00000000
This is an allocation failure and is not a triggering of the OOM killer so
the severity is reduced but it still looks like a bug in the driver. Looking
at the history and the discussion, it appears to me that __GFP_HIGH was
cleared from the allocation site by accident. I strongly suspect that Will
Deacon thought __GFP_HIGH was related to highmem instead of being related
to high priority. Will, can you review the following patch please? Ying,
can you test please?
---8<---
virtio: allow vring descriptor allocations to use high-priority reserves
Commit b92b1b89a33c ("virtio: force vring descriptors to be allocated
from lowmem") prevented the inappropriate use of highmem pages but it
also masked out __GFP_HIGH. __GFP_HIGH is used for GFP_ATOMIC allocation
requests to grant access to a small emergency reserve. It's intended for
user by callers that have no alternative.
Ying Huang reported the following page allocation failure warning after
commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to
sleep, unwilling to sleep and avoiding waking kswapd")
kworker/u4:0: page allocation failure: order:0, mode:0x2200000
CPU: 0 PID: 6 Comm: kworker/u4:0 Not tainted 4.3.0-08056-gd0164ad #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
Workqueue: writeback wb_workfn (flush-253:0)
0000000000000000 ffff880035ac75e8 ffffffff8140a142 0000000002200000
ffff880035ac7670 ffffffff8117117b ffff880037586b28 ffff880000000040
ffff88003523b270 0000000000000040 ffff880035abc800 ffffffff00000000
Call Trace:
[<ffffffff8140a142>] dump_stack+0x4b/0x69
[<ffffffff8117117b>] warn_alloc_failed+0xdb/0x140
[<ffffffff81174ec4>] __alloc_pages_nodemask+0x874/0xa60
[<ffffffff811bcb62>] alloc_pages_current+0x92/0x120
[<ffffffff811c73e4>] new_slab+0x3d4/0x480
[<ffffffff811c7c36>] __slab_alloc+0x376/0x470
[<ffffffff814e0ced>] ? alloc_indirect+0x1d/0x50
[<ffffffff81338221>] ? xfs_submit_ioend_bio+0x31/0x40
[<ffffffff814e0ced>] ? alloc_indirect+0x1d/0x50
[<ffffffff811c8e8d>] __kmalloc+0x20d/0x260
[<ffffffff814e0ced>] alloc_indirect+0x1d/0x50
[<ffffffff814e0fec>] virtqueue_add_sgs+0x2cc/0x3a0
[<ffffffff81573a30>] __virtblk_add_req+0xb0/0x1f0
[<ffffffff8117a121>] ? pagevec_lookup_tag+0x21/0x30
[<ffffffff813e5d72>] ? blk_rq_map_sg+0x1e2/0x4f0
[<ffffffff81573c82>] virtio_queue_rq+0x112/0x280
[<ffffffff813e9de7>] __blk_mq_run_hw_queue+0x1d7/0x370
[<ffffffff813e9bef>] blk_mq_run_hw_queue+0x9f/0xc0
[<ffffffff813eb10a>] blk_mq_insert_requests+0xfa/0x1a0
[<ffffffff813ebdb3>] blk_mq_flush_plug_list+0x123/0x140
[<ffffffff813e1777>] blk_flush_plug_list+0xa7/0x200
[<ffffffff813e1c49>] blk_finish_plug+0x29/0x40
[<ffffffff81215f85>] wb_writeback+0x185/0x2c0
[<ffffffff812166a5>] wb_workfn+0xf5/0x390
[<ffffffff81091297>] process_one_work+0x157/0x420
[<ffffffff81091ef9>] worker_thread+0x69/0x4a0
[<ffffffff81091e90>] ? rescuer_thread+0x380/0x380
[<ffffffff8109746f>] kthread+0xef/0x110
[<ffffffff81097380>] ? kthread_park+0x60/0x60
[<ffffffff818bce8f>] ret_from_fork+0x3f/0x70
[<ffffffff81097380>] ? kthread_park+0x60/0x60
Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to
sleep, unwilling to sleep and avoiding waking kswapd") is stricter about
reserves. It distinguishes between callers that are high-priority with
access to emergency reserves and callers that simply do not want to sleep
and have recovery options. The reported allocation failure is truly atomic
with no recovery options that appears to have cleared __GFP_HIGH by mistake
for reasons that are unrelated to highmem. This patch restores the flag.
Signed-off-by: Mel Gorman <[email protected]>
---
drivers/virtio/virtio_ring.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..f9e119e6df18 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -107,9 +107,10 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
/*
* We require lowmem mappings for the descriptors because
* otherwise virt_to_phys will give us bogus addresses in the
- * virtqueue.
+ * virtqueue. Access to high-priority reserves is preserved
+ * if originally requested by GFP_ATOMIC.
*/
- gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH);
+ gfp &= ~__GFP_HIGHMEM;
desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
if (!desc)
On Tue, Dec 01, 2015 at 03:04:31PM +0100, Michal Hocko wrote:
> On Tue 01-12-15 12:23:41, Will Deacon wrote:
> > Hi Michal,
> >
> > On Mon, Nov 30, 2015 at 02:02:00PM +0100, Michal Hocko wrote:
> > > [Let's CC Will - see the question at the end of the email please]
> >
> > [...]
> >
> > > > > There is no reference to OOM possibility in the email that I can see. Can
> > > > > you give examples of the OOM messages that shows the problem sites? It was
> > > > > suspected that there may be some callers that were accidentally depending
> > > > > on access to emergency reserves. If so, either they need to be fixed (if
> > > > > the case is extremely rare) or a small reserve will have to be created
> > > > > for callers that are not high priority but still cannot reclaim.
> > >
> > > __virtblk_add_req calls
> > > virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC)
> > > alloc_indirect(gfp)
> > > gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH)
> > >
> > > So this is true __GFP_ATOMIC, we just drop __GFP_HIGH so it doesn't get
> > > access to more reserves. It still does ALLOC_HARDER. So I think the real
> > > issue is somewhere else when something should have triggered kswapd and
> > > it doesn't do that anymore. I have tried to find that offender the last
> > > time but didn't manage to find any.
> > >
> > > Btw. I completely miss why b92b1b89a33c ("virtio: force vring
> > > descriptors to be allocated from lowmem") had to clear __GFP_HIGH. Will
> > > do you remember why you have dropped that flag as well?
> >
> > Right, that looks unnecessary, but it could be that we were masking a
> > bug somewhere else.
>
> OK, I will send a patch to remove __GFP_HIGH because it is clearly
> misleding and doesn't have anything to do with the highmem zone.
>
Thanks for looking into this. I just sent a patch that includes a changelog
explaining why this bug triggers now and would have been hidden before.
> > > Also I do not seem to find any user of alloc_indirect which would do
> > > __GFP_HIGHMEM. All of them are either GFP_KERNEL or GFP_ATOMIC. So
> > > either I am missing something or this is not really needed. Maybe the
> > > situation was different back in 2012.
> >
> > I tried to revisit the thread leading to that patch, but it doesn't make
> > a whole lot of sense:
> >
> > https://lkml.org/lkml/2012/10/17/143
> >
> > I certainly remember debugging the failure (i.e. it wasn't theoretical),
> > and we were ending up with highmem addresses being passed in the virtio
> > ring (due to the zero-copy stuff in 9p) and also for the descriptors
> > themselves. The discussion at the time makes it sound like GFP_ATOMIC
> > was giving us those...
>
> Hmm, unless I am missing something GFP_ATOMIC resp. GFP_KERNEL cannot
> fallback to the highmem zone - see GFP_ZONE_TABLE. Maybe the highmem
> pointer got there from a different path than alloc_indirect?
GFP_ATOMIC should not be returning highmem addresses ever. I checked
briefly but did not spot where 9p is getting highmem pages from but it
wasn't via GFP_ATOMIC.
--
Mel Gorman
SUSE Labs
On Wed 02-12-15 11:00:09, Mel Gorman wrote:
> On Mon, Nov 30, 2015 at 10:14:24AM +0800, Huang, Ying wrote:
> > > There is no reference to OOM possibility in the email that I can see. Can
> > > you give examples of the OOM messages that shows the problem sites? It was
> > > suspected that there may be some callers that were accidentally depending
> > > on access to emergency reserves. If so, either they need to be fixed (if
> > > the case is extremely rare) or a small reserve will have to be created
> > > for callers that are not high priority but still cannot reclaim.
> > >
> > > Note that I'm travelling a lot over the next two weeks so I'll be slow to
> > > respond but I will get to it.
> >
> > Here is the kernel log, the full dmesg is attached too. The OOM
> > occurs during fsmark testing.
> >
> > Best Regards,
> > Huang, Ying
> >
> > [ 31.453514] kworker/u4:0: page allocation failure: order:0, mode:0x2200000
> > [ 31.463570] CPU: 0 PID: 6 Comm: kworker/u4:0 Not tainted 4.3.0-08056-gd0164ad #1
> > [ 31.466115] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> > [ 31.477146] Workqueue: writeback wb_workfn (flush-253:0)
> > [ 31.481450] 0000000000000000 ffff880035ac75e8 ffffffff8140a142 0000000002200000
> > [ 31.492582] ffff880035ac7670 ffffffff8117117b ffff880037586b28 ffff880000000040
> > [ 31.507631] ffff88003523b270 0000000000000040 ffff880035abc800 ffffffff00000000
>
> This is an allocation failure and is not a triggering of the OOM killer so
> the severity is reduced but it still looks like a bug in the driver. Looking
> at the history and the discussion, it appears to me that __GFP_HIGH was
> cleared from the allocation site by accident. I strongly suspect that Will
> Deacon thought __GFP_HIGH was related to highmem instead of being related
> to high priority. Will, can you review the following patch please? Ying,
> can you test please?
I have posted basically the same patch
http://lkml.kernel.org/r/[email protected]
I didn't mention this allocation failure because I am not sure it is
really related.
> ---8<---
> virtio: allow vring descriptor allocations to use high-priority reserves
>
> Commit b92b1b89a33c ("virtio: force vring descriptors to be allocated
> from lowmem") prevented the inappropriate use of highmem pages but it
> also masked out __GFP_HIGH. __GFP_HIGH is used for GFP_ATOMIC allocation
> requests to grant access to a small emergency reserve. It's intended for
> user by callers that have no alternative.
>
> Ying Huang reported the following page allocation failure warning after
> commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to
> sleep, unwilling to sleep and avoiding waking kswapd")
>
> kworker/u4:0: page allocation failure: order:0, mode:0x2200000
> CPU: 0 PID: 6 Comm: kworker/u4:0 Not tainted 4.3.0-08056-gd0164ad #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> Workqueue: writeback wb_workfn (flush-253:0)
> 0000000000000000 ffff880035ac75e8 ffffffff8140a142 0000000002200000
> ffff880035ac7670 ffffffff8117117b ffff880037586b28 ffff880000000040
> ffff88003523b270 0000000000000040 ffff880035abc800 ffffffff00000000
> Call Trace:
> [<ffffffff8140a142>] dump_stack+0x4b/0x69
> [<ffffffff8117117b>] warn_alloc_failed+0xdb/0x140
> [<ffffffff81174ec4>] __alloc_pages_nodemask+0x874/0xa60
> [<ffffffff811bcb62>] alloc_pages_current+0x92/0x120
> [<ffffffff811c73e4>] new_slab+0x3d4/0x480
> [<ffffffff811c7c36>] __slab_alloc+0x376/0x470
> [<ffffffff814e0ced>] ? alloc_indirect+0x1d/0x50
> [<ffffffff81338221>] ? xfs_submit_ioend_bio+0x31/0x40
> [<ffffffff814e0ced>] ? alloc_indirect+0x1d/0x50
> [<ffffffff811c8e8d>] __kmalloc+0x20d/0x260
> [<ffffffff814e0ced>] alloc_indirect+0x1d/0x50
> [<ffffffff814e0fec>] virtqueue_add_sgs+0x2cc/0x3a0
> [<ffffffff81573a30>] __virtblk_add_req+0xb0/0x1f0
> [<ffffffff8117a121>] ? pagevec_lookup_tag+0x21/0x30
> [<ffffffff813e5d72>] ? blk_rq_map_sg+0x1e2/0x4f0
> [<ffffffff81573c82>] virtio_queue_rq+0x112/0x280
> [<ffffffff813e9de7>] __blk_mq_run_hw_queue+0x1d7/0x370
> [<ffffffff813e9bef>] blk_mq_run_hw_queue+0x9f/0xc0
> [<ffffffff813eb10a>] blk_mq_insert_requests+0xfa/0x1a0
> [<ffffffff813ebdb3>] blk_mq_flush_plug_list+0x123/0x140
> [<ffffffff813e1777>] blk_flush_plug_list+0xa7/0x200
> [<ffffffff813e1c49>] blk_finish_plug+0x29/0x40
> [<ffffffff81215f85>] wb_writeback+0x185/0x2c0
> [<ffffffff812166a5>] wb_workfn+0xf5/0x390
> [<ffffffff81091297>] process_one_work+0x157/0x420
> [<ffffffff81091ef9>] worker_thread+0x69/0x4a0
> [<ffffffff81091e90>] ? rescuer_thread+0x380/0x380
> [<ffffffff8109746f>] kthread+0xef/0x110
> [<ffffffff81097380>] ? kthread_park+0x60/0x60
> [<ffffffff818bce8f>] ret_from_fork+0x3f/0x70
> [<ffffffff81097380>] ? kthread_park+0x60/0x60
>
> Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to
> sleep, unwilling to sleep and avoiding waking kswapd") is stricter about
> reserves. It distinguishes between callers that are high-priority with
> access to emergency reserves and callers that simply do not want to sleep
> and have recovery options. The reported allocation failure is truly atomic
> with no recovery options that appears to have cleared __GFP_HIGH by mistake
> for reasons that are unrelated to highmem. This patch restores the flag.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> drivers/virtio/virtio_ring.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 096b857e7b75..f9e119e6df18 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -107,9 +107,10 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
> /*
> * We require lowmem mappings for the descriptors because
> * otherwise virt_to_phys will give us bogus addresses in the
> - * virtqueue.
> + * virtqueue. Access to high-priority reserves is preserved
> + * if originally requested by GFP_ATOMIC.
> */
> - gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH);
> + gfp &= ~__GFP_HIGHMEM;
>
> desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
> if (!desc)
--
Michal Hocko
SUSE Labs
On Wed, Dec 02, 2015 at 01:00:46PM +0100, Michal Hocko wrote:
> On Wed 02-12-15 11:00:09, Mel Gorman wrote:
> > On Mon, Nov 30, 2015 at 10:14:24AM +0800, Huang, Ying wrote:
> > > > There is no reference to OOM possibility in the email that I can see. Can
> > > > you give examples of the OOM messages that shows the problem sites? It was
> > > > suspected that there may be some callers that were accidentally depending
> > > > on access to emergency reserves. If so, either they need to be fixed (if
> > > > the case is extremely rare) or a small reserve will have to be created
> > > > for callers that are not high priority but still cannot reclaim.
> > > >
> > > > Note that I'm travelling a lot over the next two weeks so I'll be slow to
> > > > respond but I will get to it.
> > >
> > > Here is the kernel log, the full dmesg is attached too. The OOM
> > > occurs during fsmark testing.
> > >
> > > Best Regards,
> > > Huang, Ying
> > >
> > > [ 31.453514] kworker/u4:0: page allocation failure: order:0, mode:0x2200000
> > > [ 31.463570] CPU: 0 PID: 6 Comm: kworker/u4:0 Not tainted 4.3.0-08056-gd0164ad #1
> > > [ 31.466115] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> > > [ 31.477146] Workqueue: writeback wb_workfn (flush-253:0)
> > > [ 31.481450] 0000000000000000 ffff880035ac75e8 ffffffff8140a142 0000000002200000
> > > [ 31.492582] ffff880035ac7670 ffffffff8117117b ffff880037586b28 ffff880000000040
> > > [ 31.507631] ffff88003523b270 0000000000000040 ffff880035abc800 ffffffff00000000
> >
> > This is an allocation failure and is not a triggering of the OOM killer so
> > the severity is reduced but it still looks like a bug in the driver. Looking
> > at the history and the discussion, it appears to me that __GFP_HIGH was
> > cleared from the allocation site by accident. I strongly suspect that Will
> > Deacon thought __GFP_HIGH was related to highmem instead of being related
> > to high priority. Will, can you review the following patch please? Ying,
> > can you test please?
>
> I have posted basically the same patch
> http://lkml.kernel.org/r/[email protected]
>
Sorry. I missed that while playing catch-up and I wasn't on the cc. I'll
drop this patch now. Thanks for catching it.
> I didn't mention this allocation failure because I am not sure it is
> really related.
>
I'm fairly sure it is. The failure is an allocation site that cannot
sleep but did not specify __GFP_HIGH. Such callers are normally expected
to be able to recover gracefully and probably should specify _GFP_NOWARN.
kswapd would have woken up as normal but the free pages were below the
min watermark so there was a brief failure.
--
Mel Gorman
SUSE Labs
On Wed 02-12-15 14:08:52, Mel Gorman wrote:
> On Wed, Dec 02, 2015 at 01:00:46PM +0100, Michal Hocko wrote:
> > On Wed 02-12-15 11:00:09, Mel Gorman wrote:
> > > On Mon, Nov 30, 2015 at 10:14:24AM +0800, Huang, Ying wrote:
> > > > > There is no reference to OOM possibility in the email that I can see. Can
> > > > > you give examples of the OOM messages that shows the problem sites? It was
> > > > > suspected that there may be some callers that were accidentally depending
> > > > > on access to emergency reserves. If so, either they need to be fixed (if
> > > > > the case is extremely rare) or a small reserve will have to be created
> > > > > for callers that are not high priority but still cannot reclaim.
> > > > >
> > > > > Note that I'm travelling a lot over the next two weeks so I'll be slow to
> > > > > respond but I will get to it.
> > > >
> > > > Here is the kernel log, the full dmesg is attached too. The OOM
> > > > occurs during fsmark testing.
> > > >
> > > > Best Regards,
> > > > Huang, Ying
> > > >
> > > > [ 31.453514] kworker/u4:0: page allocation failure: order:0, mode:0x2200000
> > > > [ 31.463570] CPU: 0 PID: 6 Comm: kworker/u4:0 Not tainted 4.3.0-08056-gd0164ad #1
> > > > [ 31.466115] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> > > > [ 31.477146] Workqueue: writeback wb_workfn (flush-253:0)
> > > > [ 31.481450] 0000000000000000 ffff880035ac75e8 ffffffff8140a142 0000000002200000
> > > > [ 31.492582] ffff880035ac7670 ffffffff8117117b ffff880037586b28 ffff880000000040
> > > > [ 31.507631] ffff88003523b270 0000000000000040 ffff880035abc800 ffffffff00000000
> > >
> > > This is an allocation failure and is not a triggering of the OOM killer so
> > > the severity is reduced but it still looks like a bug in the driver. Looking
> > > at the history and the discussion, it appears to me that __GFP_HIGH was
> > > cleared from the allocation site by accident. I strongly suspect that Will
> > > Deacon thought __GFP_HIGH was related to highmem instead of being related
> > > to high priority. Will, can you review the following patch please? Ying,
> > > can you test please?
> >
> > I have posted basically the same patch
> > http://lkml.kernel.org/r/[email protected]
> >
>
> Sorry. I missed that while playing catch-up and I wasn't on the cc. I'll
> drop this patch now. Thanks for catching it.
My bad. I should have CCed you. But I considered this merely a cleanup
so I didn't want to swamp you with another email.
> > I didn't mention this allocation failure because I am not sure it is
> > really related.
> >
>
> I'm fairly sure it is. The failure is an allocation site that cannot
> sleep but did not specify __GFP_HIGH.
yeah but this was the case even before your patch. As the caller used
GFP_ATOMIC then it got __GFP_ATOMIC after your patch so it still
managed to do ALLOC_HARDER. I would agree if this was an explicit
GFP_NOWAIT. Unless I am missing something your patch hasn't changed the
behavior for this particular allocation.
> Such callers are normally expected
> to be able to recover gracefully and probably should specify _GFP_NOWARN.
> kswapd would have woken up as normal but the free pages were below the
> min watermark so there was a brief failure.
--
Michal Hocko
SUSE Labs
On Wed, Dec 02, 2015 at 03:15:29PM +0100, Michal Hocko wrote:
> > > I didn't mention this allocation failure because I am not sure it is
> > > really related.
> > >
> >
> > I'm fairly sure it is. The failure is an allocation site that cannot
> > sleep but did not specify __GFP_HIGH.
>
> yeah but this was the case even before your patch. As the caller used
> GFP_ATOMIC then it got __GFP_ATOMIC after your patch so it still
> managed to do ALLOC_HARDER. I would agree if this was an explicit
> GFP_NOWAIT. Unless I am missing something your patch hasn't changed the
> behavior for this particular allocation.
>
You're right. I think it's this hunk that is the problem.
@@ -1186,7 +1186,7 @@ static struct request *blk_mq_map_request(struct
request_queue *q,
ctx = blk_mq_get_ctx(q);
hctx = q->mq_ops->map_queue(q, ctx->cpu);
blk_mq_set_alloc_data(&alloc_data, q,
- __GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
+ __GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
rq = __blk_mq_alloc_request(&alloc_data, rw);
ctx = alloc_data.ctx;
hctx = alloc_data.hctx;
This specific path at this patch is not waking kswapd any more when it
should. A series of allocations there could hit the watermarks and never wake
kswapd and then be followed by an atomic allocation failure that woke kswapd.
This bug gets fixed later by the commit 71baba4b92dc ("mm, page_alloc:
rename __GFP_WAIT to __GFP_RECLAIM") so it's not a bug in the current
kernel. However, it happens to break bisection and would be caught if each
individual commit was tested.
Your __GFP_HIGH patch is still fine although not the direct fix for this
specific problem. Commit 71baba4b92dc is.
Ying, does the page allocation failure messages happen when the whole
series is applied? i.e. is 4.4-rc3 ok?
--
Mel Gorman
SUSE Labs
Mel Gorman <[email protected]> writes:
> On Wed, Dec 02, 2015 at 03:15:29PM +0100, Michal Hocko wrote:
>> > > I didn't mention this allocation failure because I am not sure it is
>> > > really related.
>> > >
>> >
>> > I'm fairly sure it is. The failure is an allocation site that cannot
>> > sleep but did not specify __GFP_HIGH.
>>
>> yeah but this was the case even before your patch. As the caller used
>> GFP_ATOMIC then it got __GFP_ATOMIC after your patch so it still
>> managed to do ALLOC_HARDER. I would agree if this was an explicit
>> GFP_NOWAIT. Unless I am missing something your patch hasn't changed the
>> behavior for this particular allocation.
>>
>
> You're right. I think it's this hunk that is the problem.
>
> @@ -1186,7 +1186,7 @@ static struct request *blk_mq_map_request(struct
> request_queue *q,
> ctx = blk_mq_get_ctx(q);
> hctx = q->mq_ops->map_queue(q, ctx->cpu);
> blk_mq_set_alloc_data(&alloc_data, q,
> - __GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
> + __GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
> rq = __blk_mq_alloc_request(&alloc_data, rw);
> ctx = alloc_data.ctx;
> hctx = alloc_data.hctx;
>
> This specific path at this patch is not waking kswapd any more when it
> should. A series of allocations there could hit the watermarks and never wake
> kswapd and then be followed by an atomic allocation failure that woke kswapd.
>
> This bug gets fixed later by the commit 71baba4b92dc ("mm, page_alloc:
> rename __GFP_WAIT to __GFP_RECLAIM") so it's not a bug in the current
> kernel. However, it happens to break bisection and would be caught if each
> individual commit was tested.
>
> Your __GFP_HIGH patch is still fine although not the direct fix for this
> specific problem. Commit 71baba4b92dc is.
>
> Ying, does the page allocation failure messages happen when the whole
> series is applied? i.e. is 4.4-rc3 ok?
There are allocation errors for 4.4-rc3 too. dmesg is attached.
[ 54.970840] kworker/u4:0: page allocation failure: order:0, mode:0x2204000
[ 54.973161] CPU: 1 PID: 6 Comm: kworker/u4:0 Not tainted 4.4.0-rc3 #1
[ 54.975179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 54.980562] Workqueue: writeback wb_workfn (flush-253:0)
[ 54.982289] 0000000000000000 ffff880035ac7098 ffffffff81410ad2 0000000002204000
[ 54.985438] ffff880035ac7120 ffffffff811724cb ffff880037586b50 0000000000000040
[ 54.988746] 0000000000000000 0000000000000040 ffffffff00ac70e0 ffffffff00000000
[ 54.991669] Call Trace:
[ 54.992754] [<ffffffff81410ad2>] dump_stack+0x4b/0x69
[ 54.994433] [<ffffffff811724cb>] warn_alloc_failed+0xdb/0x140
[ 54.996134] [<ffffffff811762b9>] __alloc_pages_nodemask+0x949/0xb40
[ 54.997846] [<ffffffff811be2c2>] alloc_pages_current+0x92/0x120
[ 54.999494] [<ffffffff811c8bfc>] new_slab+0x3dc/0x480
[ 55.001007] [<ffffffff811c9421>] ___slab_alloc+0x351/0x440
[ 55.002588] [<ffffffff814e7fcd>] ? alloc_indirect+0x1d/0x50
[ 55.004268] [<ffffffff8132270f>] ? xfs_btree_insrec+0x47f/0x700
[ 55.005926] [<ffffffff813206e9>] ? xfs_btree_delrec+0x5a9/0x1380
[ 55.007592] [<ffffffff814e7fcd>] ? alloc_indirect+0x1d/0x50
[ 55.009281] [<ffffffff811c9530>] __slab_alloc+0x20/0x40
[ 55.010827] [<ffffffff811ca03d>] __kmalloc+0x20d/0x260
[ 55.012367] [<ffffffff814e7fcd>] alloc_indirect+0x1d/0x50
[ 55.014040] [<ffffffff814e82cc>] virtqueue_add_sgs+0x2cc/0x3a0
[ 55.015683] [<ffffffff8157b240>] __virtblk_add_req+0xb0/0x1f0
[ 55.017300] [<ffffffff81307944>] ? xfs_alloc_update_counters+0x44/0x50
[ 55.019759] [<ffffffff813ec524>] ? blk_rq_map_sg+0x1f4/0x510
[ 55.021371] [<ffffffff8157b492>] virtio_queue_rq+0x112/0x280
[ 55.022978] [<ffffffff813f0647>] __blk_mq_run_hw_queue+0x1d7/0x370
[ 55.024674] [<ffffffff813f044f>] blk_mq_run_hw_queue+0x9f/0xc0
[ 55.026307] [<ffffffff813f196a>] blk_mq_insert_requests+0xfa/0x1a0
[ 55.028003] [<ffffffff813f26b3>] blk_mq_flush_plug_list+0x123/0x140
[ 55.029710] [<ffffffff813e7c77>] blk_flush_plug_list+0xa7/0x200
[ 55.031363] [<ffffffff813f1bec>] blk_sq_make_request+0x1dc/0x3b0
[ 55.033025] [<ffffffff813e5b0c>] ? generic_make_request_checks+0x24c/0x530
[ 55.034829] [<ffffffff813e6341>] generic_make_request+0xf1/0x1e0
[ 55.036508] [<ffffffff813e6497>] submit_bio+0x67/0x150
[ 55.038034] [<ffffffff8133cd61>] xfs_submit_ioend_bio+0x31/0x40
[ 55.039797] [<ffffffff8133ce51>] xfs_submit_ioend+0xe1/0x120
[ 55.041410] [<ffffffff8133e183>] xfs_vm_writepage+0x2a3/0x630
[ 55.043038] [<ffffffff811771a3>] __writepage+0x13/0x30
[ 55.044568] [<ffffffff811794bc>] write_cache_pages+0x22c/0x4c0
[ 55.046214] [<ffffffff81177190>] ? wb_position_ratio+0x1f0/0x1f0
[ 55.047888] [<ffffffff81179793>] generic_writepages+0x43/0x60
[ 55.049520] [<ffffffff8133d54e>] xfs_vm_writepages+0x4e/0x60
[ 55.051130] [<ffffffff8117a49e>] do_writepages+0x1e/0x30
[ 55.052694] [<ffffffff81216b05>] __writeback_single_inode+0x45/0x2e0
[ 55.054410] [<ffffffff812172e1>] writeback_sb_inodes+0x271/0x4d0
[ 55.056075] [<ffffffff812175c9>] __writeback_inodes_wb+0x89/0xc0
[ 55.057749] [<ffffffff812178ea>] wb_writeback+0x23a/0x2c0
[ 55.059312] [<ffffffff8121806f>] wb_workfn+0x20f/0x390
[ 55.060846] [<ffffffff81091257>] process_one_work+0x157/0x420
[ 55.062466] [<ffffffff81091eb9>] worker_thread+0x69/0x4a0
[ 55.064030] [<ffffffff81091e50>] ? rescuer_thread+0x380/0x380
[ 55.065655] [<ffffffff8109741f>] kthread+0xef/0x110
[ 55.067141] [<ffffffff81097330>] ? kthread_park+0x60/0x60
[ 55.068713] [<ffffffff818c5f8f>] ret_from_fork+0x3f/0x70
[ 55.070255] [<ffffffff81097330>] ? kthread_park+0x60/0x60
[ 55.071830] Mem-Info:
[ 55.072889] active_anon:5024 inactive_anon:1642 isolated_anon:0
[ 55.072889] active_file:22984 inactive_file:106547 isolated_file:25
[ 55.072889] unevictable:93684 dirty:7732 writeback:13026 unstable:0
[ 55.072889] slab_reclaimable:13082 slab_unreclaimable:3604
[ 55.072889] mapped:2866 shmem:1687 pagetables:876 bounce:0
[ 55.072889] free:1931 free_pcp:45 free_cma:0
[ 55.081979] Node 0 DMA free:3948kB min:60kB low:72kB high:88kB active_anon:328kB inactive_anon:124kB active_file:1680kB inactive_file:1704kB unevictable:6936kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:52kB writeback:76kB mapped:440kB shmem:124kB slab_reclaimable:472kB slab_unreclaimable:232kB kernel_stack:80kB pagetables:84kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 55.093512] lowmem_reserve[]: 0 972 972 972
[ 55.095362] Node 0 DMA32 free:3776kB min:3824kB low:4780kB high:5736kB active_anon:19768kB inactive_anon:6444kB active_file:90256kB inactive_file:424484kB unevictable:367800kB isolated(anon):0kB isolated(file):100kB present:1032064kB managed:997492kB mlocked:0kB dirty:30876kB writeback:52028kB mapped:11024kB shmem:6624kB slab_reclaimable:51856kB slab_unreclaimable:14184kB kernel_stack:2864kB pagetables:3420kB unstable:0kB bounce:0kB free_pcp:180kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:100 all_unreclaimable? no
[ 55.107443] lowmem_reserve[]: 0 0 0 0
[ 55.109213] Node 0 DMA: 4*4kB (U) 3*8kB (UE) 2*16kB (UM) 1*32kB (E) 0*64kB 2*128kB (UM) 2*256kB (UM) 2*512kB (ME) 2*1024kB (ME) 0*2048kB 0*4096kB = 3944kB
[ 55.114800] Node 0 DMA32: 90*4kB (UE) 13*8kB (UE) 10*16kB (UME) 7*32kB (UME) 46*64kB (ME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3792kB
[ 55.120027] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 55.122668] 224906 total pagecache pages
[ 55.123990] 0 pages in swap cache
[ 55.125207] Swap cache stats: add 0, delete 0, find 0/0
[ 55.126738] Free swap = 0kB
[ 55.127890] Total swap = 0kB
[ 55.129039] 262014 pages RAM
[ 55.130190] 0 pages HighMem/MovableOnly
[ 55.131495] 8664 pages reserved
[ 55.132690] 0 pages hwpoisoned
[ 55.133868] SLUB: Unable to allocate memory on node -1 (gfp=0x2080000)
[ 55.135607] cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 3, min order: 0
[ 55.138454] node 0: slabs: 22, objs: 352, free: 0
Best Regards,
Huang, Ying
On Thu, Dec 03, 2015 at 04:46:53PM +0800, Huang, Ying wrote:
> Mel Gorman <[email protected]> writes:
>
> > On Wed, Dec 02, 2015 at 03:15:29PM +0100, Michal Hocko wrote:
> >> > > I didn't mention this allocation failure because I am not sure it is
> >> > > really related.
> >> > >
> >> >
> >> > I'm fairly sure it is. The failure is an allocation site that cannot
> >> > sleep but did not specify __GFP_HIGH.
> >>
> >> yeah but this was the case even before your patch. As the caller used
> >> GFP_ATOMIC then it got __GFP_ATOMIC after your patch so it still
> >> managed to do ALLOC_HARDER. I would agree if this was an explicit
> >> GFP_NOWAIT. Unless I am missing something your patch hasn't changed the
> >> behavior for this particular allocation.
> >>
> >
> > You're right. I think it's this hunk that is the problem.
> >
> > @@ -1186,7 +1186,7 @@ static struct request *blk_mq_map_request(struct
> > request_queue *q,
> > ctx = blk_mq_get_ctx(q);
> > hctx = q->mq_ops->map_queue(q, ctx->cpu);
> > blk_mq_set_alloc_data(&alloc_data, q,
> > - __GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
> > + __GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
> > rq = __blk_mq_alloc_request(&alloc_data, rw);
> > ctx = alloc_data.ctx;
> > hctx = alloc_data.hctx;
> >
> > This specific path at this patch is not waking kswapd any more when it
> > should. A series of allocations there could hit the watermarks and never wake
> > kswapd and then be followed by an atomic allocation failure that woke kswapd.
> >
> > This bug gets fixed later by the commit 71baba4b92dc ("mm, page_alloc:
> > rename __GFP_WAIT to __GFP_RECLAIM") so it's not a bug in the current
> > kernel. However, it happens to break bisection and would be caught if each
> > individual commit was tested.
> >
> > Your __GFP_HIGH patch is still fine although not the direct fix for this
> > specific problem. Commit 71baba4b92dc is.
> >
> > Ying, does the page allocation failure messages happen when the whole
> > series is applied? i.e. is 4.4-rc3 ok?
>
> There are allocation errors for 4.4-rc3 too. dmesg is attached.
>
What is the result of the __GFP_HIGH patch to give it access to
reserves?
--
Mel Gorman
SUSE Labs
Mel Gorman <[email protected]> writes:
> On Thu, Dec 03, 2015 at 04:46:53PM +0800, Huang, Ying wrote:
>> Mel Gorman <[email protected]> writes:
>>
>> > On Wed, Dec 02, 2015 at 03:15:29PM +0100, Michal Hocko wrote:
>> >> > > I didn't mention this allocation failure because I am not sure it is
>> >> > > really related.
>> >> > >
>> >> >
>> >> > I'm fairly sure it is. The failure is an allocation site that cannot
>> >> > sleep but did not specify __GFP_HIGH.
>> >>
>> >> yeah but this was the case even before your patch. As the caller used
>> >> GFP_ATOMIC then it got __GFP_ATOMIC after your patch so it still
>> >> managed to do ALLOC_HARDER. I would agree if this was an explicit
>> >> GFP_NOWAIT. Unless I am missing something your patch hasn't changed the
>> >> behavior for this particular allocation.
>> >>
>> >
>> > You're right. I think it's this hunk that is the problem.
>> >
>> > @@ -1186,7 +1186,7 @@ static struct request *blk_mq_map_request(struct
>> > request_queue *q,
>> > ctx = blk_mq_get_ctx(q);
>> > hctx = q->mq_ops->map_queue(q, ctx->cpu);
>> > blk_mq_set_alloc_data(&alloc_data, q,
>> > - __GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
>> > + __GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
>> > rq = __blk_mq_alloc_request(&alloc_data, rw);
>> > ctx = alloc_data.ctx;
>> > hctx = alloc_data.hctx;
>> >
>> > This specific path at this patch is not waking kswapd any more when it
>> > should. A series of allocations there could hit the watermarks and never wake
>> > kswapd and then be followed by an atomic allocation failure that woke kswapd.
>> >
>> > This bug gets fixed later by the commit 71baba4b92dc ("mm, page_alloc:
>> > rename __GFP_WAIT to __GFP_RECLAIM") so it's not a bug in the current
>> > kernel. However, it happens to break bisection and would be caught if each
>> > individual commit was tested.
>> >
>> > Your __GFP_HIGH patch is still fine although not the direct fix for this
>> > specific problem. Commit 71baba4b92dc is.
>> >
>> > Ying, does the page allocation failure messages happen when the whole
>> > series is applied? i.e. is 4.4-rc3 ok?
>>
>> There are allocation errors for 4.4-rc3 too. dmesg is attached.
>>
>
> What is the result of the __GFP_HIGH patch to give it access to
> reserves?
Applied Michal's patch on v4.4-rc3 and tested again, now there is no
page allocation failure.
Best Regards,
Huang, Ying
On Fri 04-12-15 09:53:35, Huang, Ying wrote:
> Mel Gorman <[email protected]> writes:
[...]
> > What is the result of the __GFP_HIGH patch to give it access to
> > reserves?
>
> Applied Michal's patch on v4.4-rc3 and tested again, now there is no
> page allocation failure.
I still think this is just a coincidence and __GFP_HIGH just
papers over it (it still makes sense for other reasons though). It
would be better imho to log all allocations which do not include
___GFP_KSWAPD_RECLAIM. This might help us to pin point a caller which
was previously waking kswapd up but it doesn't do so anymore. There
would have to be a very specific pattern to trigger this because most
allocations would simply wake kswapd up. I am just wondering that
any debugging would simply alter the timing and/or allocation pattern
and we would see a different results so I am not sure this is worth
spending a lot of time on.
--
Michal Hocko
SUSE Labs