LinuxLists.cc - xfs ip->i_lock: inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM

2012-06-13 12:39:40

Subject: xfs ip->i_lock: inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage

Hi Christoph, Dave,

I got this lockdep warning on XFS when running the xfs tests:

[ 704.832019] =================================
[ 704.832019] [ INFO: inconsistent lock state ]
[ 704.832019] 3.5.0-rc1+ #8 Tainted: G W
[ 704.832019] ---------------------------------
[ 704.832019] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[ 704.832019] fsstress/11619 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 704.832019] (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff8143953d>] xfs_ilock_nowait+0xd7/0x1d0
[ 704.832019] {IN-RECLAIM_FS-W} state was registered at:
[ 704.832019] [<ffffffff810e30a2>] mark_irqflags+0x12d/0x13e
[ 704.832019] [<ffffffff810e32f6>] __lock_acquire+0x243/0x3f9
[ 704.832019] [<ffffffff810e3a1c>] lock_acquire+0x112/0x13d
[ 704.832019] [<ffffffff810b8931>] down_write_nested+0x54/0x8b
[ 704.832019] [<ffffffff81438fab>] xfs_ilock+0xd8/0x17d
[ 704.832019] [<ffffffff814431b8>] xfs_reclaim_inode+0x4a/0x2cb
[ 704.832019] [<ffffffff814435ee>] xfs_reclaim_inodes_ag+0x1b5/0x28e
[ 704.832019] [<ffffffff814437d7>] xfs_reclaim_inodes_nr+0x33/0x3a
[ 704.832019] [<ffffffff8144050e>] xfs_fs_free_cached_objects+0x15/0x17
[ 704.832019] [<ffffffff81196076>] prune_super+0x103/0x154
[ 704.832019] [<ffffffff81152fa7>] shrink_slab+0x1ec/0x316
[ 704.832019] [<ffffffff8115574f>] balance_pgdat+0x308/0x618
[ 704.832019] [<ffffffff81155c22>] kswapd+0x1c3/0x1dc
[ 704.832019] [<ffffffff810b3f77>] kthread+0xaf/0xb7
[ 704.832019] [<ffffffff82f480b4>] kernel_thread_helper+0x4/0x10
[ 704.832019] irq event stamp: 105253
[ 704.832019] hardirqs last enabled at (105253): [<ffffffff8114b693>] get_page_from_freelist+0x403/0x4e1
[ 704.832019] hardirqs last disabled at (105252): [<ffffffff8114b55d>] get_page_from_freelist+0x2cd/0x4e1
[ 704.832019] softirqs last enabled at (104506): [<ffffffff81099e7e>] __do_softirq+0x239/0x24f
[ 704.832019] softirqs last disabled at (104451): [<ffffffff82f481ac>] call_softirq+0x1c/0x30
[ 704.832019]
[ 704.832019] other info that might help us debug this:
[ 704.832019] Possible unsafe locking scenario:
[ 704.832019]
[ 704.832019] CPU0
[ 704.832019] ----
[ 704.832019] lock(&(&ip->i_lock)->mr_lock);
[ 704.832019] <Interrupt>
[ 704.832019] lock(&(&ip->i_lock)->mr_lock);
[ 704.832019]
[ 704.832019] *** DEADLOCK ***
[ 704.832019]
[ 704.832019] 3 locks held by fsstress/11619:
[ 704.832019] #0: (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffff8119eadc>] kern_path_create+0x7d/0x11e
[ 704.832019] #1: (&(&ip->i_lock)->mr_lock/1){+.+.+.}, at: [<ffffffff81438fab>] xfs_ilock+0xd8/0x17d
[ 704.832019] #2: (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff8143953d>] xfs_ilock_nowait+0xd7/0x1d0
[ 704.832019]
[ 704.832019] stack backtrace:
[ 704.832019] Pid: 11619, comm: fsstress Tainted: G W 3.5.0-rc1+ #8
[ 704.832019] Call Trace:
[ 704.832019] [<ffffffff82e92243>] print_usage_bug+0x1f5/0x206
[ 704.832019] [<ffffffff810e2220>] ? check_usage_forwards+0xa6/0xa6
[ 704.832019] [<ffffffff82e922c3>] mark_lock_irq+0x6f/0x120
[ 704.832019] [<ffffffff810e2f02>] mark_lock+0xaf/0x122
[ 704.832019] [<ffffffff810e3d4e>] mark_held_locks+0x6d/0x95
[ 704.832019] [<ffffffff810c5cd1>] ? local_clock+0x36/0x4d
[ 704.832019] [<ffffffff810e3de3>] __lockdep_trace_alloc+0x6d/0x6f
[ 704.832019] [<ffffffff810e42e7>] lockdep_trace_alloc+0x3d/0x57
[ 704.832019] [<ffffffff811837c8>] kmem_cache_alloc_node_trace+0x47/0x1b4
[ 704.832019] [<ffffffff810e377d>] ? lock_release_nested+0x9f/0xa6
[ 704.832019] [<ffffffff81431650>] ? _xfs_buf_find+0xaa/0x302
[ 704.832019] [<ffffffff811710a2>] ? new_vmap_block.constprop.18+0x3a/0x1de
[ 704.832019] [<ffffffff811710a2>] new_vmap_block.constprop.18+0x3a/0x1de
[ 704.832019] [<ffffffff8117144a>] vb_alloc.constprop.16+0x204/0x225
[ 704.832019] [<ffffffff8117149d>] vm_map_ram+0x32/0xaa
[ 704.832019] [<ffffffff81430c95>] _xfs_buf_map_pages+0xb3/0xf5
[ 704.832019] [<ffffffff81431a6a>] xfs_buf_get+0xd3/0x1ac
[ 704.832019] [<ffffffff81492dd9>] xfs_trans_get_buf+0x180/0x244
[ 704.832019] [<ffffffff8146947a>] xfs_da_do_buf+0x2a0/0x5cc
[ 704.832019] [<ffffffff81469826>] xfs_da_get_buf+0x21/0x23
[ 704.832019] [<ffffffff8146f894>] xfs_dir2_data_init+0x44/0xf9
[ 704.832019] [<ffffffff8146e94f>] xfs_dir2_sf_to_block+0x1ef/0x5d8
[ 704.832019] [<ffffffff81475a0e>] ? xfs_dir2_sfe_get_ino+0x1a/0x1c
[ 704.832019] [<ffffffff81475ed1>] ? xfs_dir2_sf_check.isra.18+0xc2/0x14e
[ 704.832019] [<ffffffff81476d37>] ? xfs_dir2_sf_lookup+0x26f/0x27e
[ 704.832019] [<ffffffff81476f7f>] xfs_dir2_sf_addname+0x239/0x2c0
[ 704.832019] [<ffffffff8146cfb6>] xfs_dir_createname+0x118/0x177
[ 704.832019] [<ffffffff81445eec>] xfs_create+0x3c6/0x594
[ 704.832019] [<ffffffff8143db9e>] xfs_vn_mknod+0xd8/0x165
[ 704.832019] [<ffffffff8119f02d>] vfs_mknod+0xa3/0xc5
[ 704.832019] [<ffffffff8119ebca>] ? user_path_create+0x4d/0x58
[ 704.832019] [<ffffffff811a05b6>] sys_mknodat+0x16b/0x1bb
[ 704.832019] [<ffffffff816f521e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 704.832019] [<ffffffff811a0623>] sys_mknod+0x1d/0x1f
[ 704.832019] [<ffffffff82f46c69>] system_call_fastpath+0x16/0x1b

Thanks,
Fengguang

2012-06-14 01:20:44

by Dave Chinner

[permalink] [raw]

Subject: Re: xfs ip->i_lock: inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage

On Wed, Jun 13, 2012 at 08:39:32PM +0800, Fengguang Wu wrote:
> Hi Christoph, Dave,
>
> I got this lockdep warning on XFS when running the xfs tests:
>
> [ 704.832019] =================================
> [ 704.832019] [ INFO: inconsistent lock state ]
> [ 704.832019] 3.5.0-rc1+ #8 Tainted: G W
> [ 704.832019] ---------------------------------
> [ 704.832019] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
> [ 704.832019] fsstress/11619 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 704.832019] (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff8143953d>] xfs_ilock_nowait+0xd7/0x1d0
> [ 704.832019] {IN-RECLAIM_FS-W} state was registered at:
> [ 704.832019] [<ffffffff810e30a2>] mark_irqflags+0x12d/0x13e
> [ 704.832019] [<ffffffff810e32f6>] __lock_acquire+0x243/0x3f9
> [ 704.832019] [<ffffffff810e3a1c>] lock_acquire+0x112/0x13d
> [ 704.832019] [<ffffffff810b8931>] down_write_nested+0x54/0x8b
> [ 704.832019] [<ffffffff81438fab>] xfs_ilock+0xd8/0x17d
> [ 704.832019] [<ffffffff814431b8>] xfs_reclaim_inode+0x4a/0x2cb
> [ 704.832019] [<ffffffff814435ee>] xfs_reclaim_inodes_ag+0x1b5/0x28e
> [ 704.832019] [<ffffffff814437d7>] xfs_reclaim_inodes_nr+0x33/0x3a
> [ 704.832019] [<ffffffff8144050e>] xfs_fs_free_cached_objects+0x15/0x17
> [ 704.832019] [<ffffffff81196076>] prune_super+0x103/0x154
> [ 704.832019] [<ffffffff81152fa7>] shrink_slab+0x1ec/0x316
> [ 704.832019] [<ffffffff8115574f>] balance_pgdat+0x308/0x618
> [ 704.832019] [<ffffffff81155c22>] kswapd+0x1c3/0x1dc
> [ 704.832019] [<ffffffff810b3f77>] kthread+0xaf/0xb7
> [ 704.832019] [<ffffffff82f480b4>] kernel_thread_helper+0x4/0x10

......
> [ 704.832019] stack backtrace:
> [ 704.832019] Pid: 11619, comm: fsstress Tainted: G W 3.5.0-rc1+ #8
> [ 704.832019] Call Trace:
> [ 704.832019] [<ffffffff82e92243>] print_usage_bug+0x1f5/0x206
> [ 704.832019] [<ffffffff810e2220>] ? check_usage_forwards+0xa6/0xa6
> [ 704.832019] [<ffffffff82e922c3>] mark_lock_irq+0x6f/0x120
> [ 704.832019] [<ffffffff810e2f02>] mark_lock+0xaf/0x122
> [ 704.832019] [<ffffffff810e3d4e>] mark_held_locks+0x6d/0x95
> [ 704.832019] [<ffffffff810c5cd1>] ? local_clock+0x36/0x4d
> [ 704.832019] [<ffffffff810e3de3>] __lockdep_trace_alloc+0x6d/0x6f
> [ 704.832019] [<ffffffff810e42e7>] lockdep_trace_alloc+0x3d/0x57
> [ 704.832019] [<ffffffff811837c8>] kmem_cache_alloc_node_trace+0x47/0x1b4
> [ 704.832019] [<ffffffff810e377d>] ? lock_release_nested+0x9f/0xa6
> [ 704.832019] [<ffffffff81431650>] ? _xfs_buf_find+0xaa/0x302
> [ 704.832019] [<ffffffff811710a2>] ? new_vmap_block.constprop.18+0x3a/0x1de
> [ 704.832019] [<ffffffff811710a2>] new_vmap_block.constprop.18+0x3a/0x1de
> [ 704.832019] [<ffffffff8117144a>] vb_alloc.constprop.16+0x204/0x225
> [ 704.832019] [<ffffffff8117149d>] vm_map_ram+0x32/0xaa
> [ 704.832019] [<ffffffff81430c95>] _xfs_buf_map_pages+0xb3/0xf5
> [ 704.832019] [<ffffffff81431a6a>] xfs_buf_get+0xd3/0x1ac
> [ 704.832019] [<ffffffff81492dd9>] xfs_trans_get_buf+0x180/0x244
> [ 704.832019] [<ffffffff8146947a>] xfs_da_do_buf+0x2a0/0x5cc
> [ 704.832019] [<ffffffff81469826>] xfs_da_get_buf+0x21/0x23
> [ 704.832019] [<ffffffff8146f894>] xfs_dir2_data_init+0x44/0xf9
> [ 704.832019] [<ffffffff8146e94f>] xfs_dir2_sf_to_block+0x1ef/0x5d8

Bug in vm_map_ram - it does an unconditional GFP_KERNEL allocation
here, and we are in a GFP_NOFS context. We can't pass a gfp_mask to
vm_map_ram(), so until vm_map_ram() grows that we can't fix it...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-14 01:22:21

by Dave Chinner

[permalink] [raw]

Subject: Re: xfs ip->i_lock: inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage

On Wed, Jun 13, 2012 at 08:39:32PM +0800, Fengguang Wu wrote:
> Hi Christoph, Dave,
>
> I got this lockdep warning on XFS when running the xfs tests:

BTW, Fengguang, can you report XFS things to the [email protected]
list rather than LKML? I only noticed this because of the CC, but I
pay much closer attention to the XFS list than I do LKML...

Thanks,

Dave.
--
Dave Chinner
[email protected]

2012-06-14 01:30:05

by Fengguang Wu

[permalink] [raw]

Subject: Re: xfs ip->i_lock: inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage

On Thu, Jun 14, 2012 at 11:22:09AM +1000, Dave Chinner wrote:
> On Wed, Jun 13, 2012 at 08:39:32PM +0800, Fengguang Wu wrote:
> > Hi Christoph, Dave,
> >
> > I got this lockdep warning on XFS when running the xfs tests:
>
> BTW, Fengguang, can you report XFS things to the [email protected]
> list rather than LKML? I only noticed this because of the CC, but I
> pay much closer attention to the XFS list than I do LKML...

Sure, I'll CC xfs list in future.

Thanks,
Fengguang

2012-06-14 01:49:10

by Fengguang Wu

[permalink] [raw]

Subject: [PATCH] mm: add gfp_mask parameter to vm_map_ram()

On Thu, Jun 14, 2012 at 11:20:26AM +1000, Dave Chinner wrote:
> On Wed, Jun 13, 2012 at 08:39:32PM +0800, Fengguang Wu wrote:
> > Hi Christoph, Dave,
> >
> > I got this lockdep warning on XFS when running the xfs tests:
> >
> > [ 704.832019] =================================
> > [ 704.832019] [ INFO: inconsistent lock state ]
> > [ 704.832019] 3.5.0-rc1+ #8 Tainted: G W
> > [ 704.832019] ---------------------------------
> > [ 704.832019] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
> > [ 704.832019] fsstress/11619 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > [ 704.832019] (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff8143953d>] xfs_ilock_nowait+0xd7/0x1d0
> > [ 704.832019] {IN-RECLAIM_FS-W} state was registered at:
> > [ 704.832019] [<ffffffff810e30a2>] mark_irqflags+0x12d/0x13e
> > [ 704.832019] [<ffffffff810e32f6>] __lock_acquire+0x243/0x3f9
> > [ 704.832019] [<ffffffff810e3a1c>] lock_acquire+0x112/0x13d
> > [ 704.832019] [<ffffffff810b8931>] down_write_nested+0x54/0x8b
> > [ 704.832019] [<ffffffff81438fab>] xfs_ilock+0xd8/0x17d
> > [ 704.832019] [<ffffffff814431b8>] xfs_reclaim_inode+0x4a/0x2cb
> > [ 704.832019] [<ffffffff814435ee>] xfs_reclaim_inodes_ag+0x1b5/0x28e
> > [ 704.832019] [<ffffffff814437d7>] xfs_reclaim_inodes_nr+0x33/0x3a
> > [ 704.832019] [<ffffffff8144050e>] xfs_fs_free_cached_objects+0x15/0x17
> > [ 704.832019] [<ffffffff81196076>] prune_super+0x103/0x154
> > [ 704.832019] [<ffffffff81152fa7>] shrink_slab+0x1ec/0x316
> > [ 704.832019] [<ffffffff8115574f>] balance_pgdat+0x308/0x618
> > [ 704.832019] [<ffffffff81155c22>] kswapd+0x1c3/0x1dc
> > [ 704.832019] [<ffffffff810b3f77>] kthread+0xaf/0xb7
> > [ 704.832019] [<ffffffff82f480b4>] kernel_thread_helper+0x4/0x10
>
> ......
> > [ 704.832019] stack backtrace:
> > [ 704.832019] Pid: 11619, comm: fsstress Tainted: G W 3.5.0-rc1+ #8
> > [ 704.832019] Call Trace:
> > [ 704.832019] [<ffffffff82e92243>] print_usage_bug+0x1f5/0x206
> > [ 704.832019] [<ffffffff810e2220>] ? check_usage_forwards+0xa6/0xa6
> > [ 704.832019] [<ffffffff82e922c3>] mark_lock_irq+0x6f/0x120
> > [ 704.832019] [<ffffffff810e2f02>] mark_lock+0xaf/0x122
> > [ 704.832019] [<ffffffff810e3d4e>] mark_held_locks+0x6d/0x95
> > [ 704.832019] [<ffffffff810c5cd1>] ? local_clock+0x36/0x4d
> > [ 704.832019] [<ffffffff810e3de3>] __lockdep_trace_alloc+0x6d/0x6f
> > [ 704.832019] [<ffffffff810e42e7>] lockdep_trace_alloc+0x3d/0x57
> > [ 704.832019] [<ffffffff811837c8>] kmem_cache_alloc_node_trace+0x47/0x1b4
> > [ 704.832019] [<ffffffff810e377d>] ? lock_release_nested+0x9f/0xa6
> > [ 704.832019] [<ffffffff81431650>] ? _xfs_buf_find+0xaa/0x302
> > [ 704.832019] [<ffffffff811710a2>] ? new_vmap_block.constprop.18+0x3a/0x1de
> > [ 704.832019] [<ffffffff811710a2>] new_vmap_block.constprop.18+0x3a/0x1de
> > [ 704.832019] [<ffffffff8117144a>] vb_alloc.constprop.16+0x204/0x225
> > [ 704.832019] [<ffffffff8117149d>] vm_map_ram+0x32/0xaa
> > [ 704.832019] [<ffffffff81430c95>] _xfs_buf_map_pages+0xb3/0xf5
> > [ 704.832019] [<ffffffff81431a6a>] xfs_buf_get+0xd3/0x1ac
> > [ 704.832019] [<ffffffff81492dd9>] xfs_trans_get_buf+0x180/0x244
> > [ 704.832019] [<ffffffff8146947a>] xfs_da_do_buf+0x2a0/0x5cc
> > [ 704.832019] [<ffffffff81469826>] xfs_da_get_buf+0x21/0x23
> > [ 704.832019] [<ffffffff8146f894>] xfs_dir2_data_init+0x44/0xf9
> > [ 704.832019] [<ffffffff8146e94f>] xfs_dir2_sf_to_block+0x1ef/0x5d8
>
> Bug in vm_map_ram - it does an unconditional GFP_KERNEL allocation
> here, and we are in a GFP_NOFS context. We can't pass a gfp_mask to
> vm_map_ram(), so until vm_map_ram() grows that we can't fix it...

This trivial patch should fix it.

The only behavior change is the XFS part:

@@ -406,7 +406,7 @@ _xfs_buf_map_pages(

do {
bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
- -1, PAGE_KERNEL);
+ -1, GFP_NOFS, PAGE_KERNEL);
if (bp->b_addr)
break;
vm_unmap_aliases();

Does that look fine to you?

Thanks,
Fengguang
---

>From 7301975d3211da2ce07723c294cf3260229fe84b Mon Sep 17 00:00:00 2001
From: Fengguang Wu <[email protected]>
Date: Thu, 14 Jun 2012 09:38:33 +0800
Subject: [PATCH] mm: add @gfp_mask parameter to vm_map_ram()

XFS needs GFP_NOFS allocation.

Signed-off-by: Fengguang Wu <[email protected]>
---
drivers/firewire/ohci.c | 2 +-
drivers/media/video/videobuf2-dma-sg.c | 1 +
drivers/media/video/videobuf2-vmalloc.c | 2 +-
fs/xfs/xfs_buf.c | 2 +-
include/linux/vmalloc.h | 2 +-
mm/nommu.c | 3 ++-
mm/vmalloc.c | 7 ++++---
7 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/firewire/ohci.c b/drivers/firewire/ohci.c
index c1af05e..b53f5a9 100644
--- a/drivers/firewire/ohci.c
+++ b/drivers/firewire/ohci.c
@@ -1000,7 +1000,7 @@ static int ar_context_init(struct ar_context *ctx, struct fw_ohci *ohci,
for (i = 0; i < AR_WRAPAROUND_PAGES; i++)
pages[AR_BUFFERS + i] = ctx->pages[i];
ctx->buffer = vm_map_ram(pages, AR_BUFFERS + AR_WRAPAROUND_PAGES,
- -1, PAGE_KERNEL);
+ -1, GFP_KERNEL, PAGE_KERNEL);
if (!ctx->buffer)
goto out_of_memory;

diff --git a/drivers/media/video/videobuf2-dma-sg.c b/drivers/media/video/videobuf2-dma-sg.c
index 25c3b36..d087f52 100644
--- a/drivers/media/video/videobuf2-dma-sg.c
+++ b/drivers/media/video/videobuf2-dma-sg.c
@@ -209,6 +209,7 @@ static void *vb2_dma_sg_vaddr(void *buf_priv)
buf->vaddr = vm_map_ram(buf->pages,
buf->sg_desc.num_pages,
-1,
+ GFP_KERNEL,
PAGE_KERNEL);

/* add offset in case userptr is not page-aligned */
diff --git a/drivers/media/video/videobuf2-vmalloc.c b/drivers/media/video/videobuf2-vmalloc.c
index 6b5ca6c..548ebda 100644
--- a/drivers/media/video/videobuf2-vmalloc.c
+++ b/drivers/media/video/videobuf2-vmalloc.c
@@ -111,7 +111,7 @@ static void *vb2_vmalloc_get_userptr(void *alloc_ctx, unsigned long vaddr,
goto fail_get_user_pages;

buf->vaddr = vm_map_ram(buf->pages, buf->n_pages, -1,
- PAGE_KERNEL);
+ GFP_KERNEL, PAGE_KERNEL);
if (!buf->vaddr)
goto fail_get_user_pages;
}
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 172d3cc..b3e7289 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -406,7 +406,7 @@ _xfs_buf_map_pages(

do {
bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
- -1, PAGE_KERNEL);
+ -1, GFP_NOFS, PAGE_KERNEL);
if (bp->b_addr)
break;
vm_unmap_aliases();
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index dcdfc2b..c811763 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -40,7 +40,7 @@ struct vm_struct {
*/
extern void vm_unmap_ram(const void *mem, unsigned int count);
extern void *vm_map_ram(struct page **pages, unsigned int count,
- int node, pgprot_t prot);
+ int node, gfp_t gfp_mask, pgprot_t prot);
extern void vm_unmap_aliases(void);

#ifdef CONFIG_MMU
diff --git a/mm/nommu.c b/mm/nommu.c
index d4b0c10..2fb4ec1 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -416,7 +416,8 @@ void vunmap(const void *addr)
}
EXPORT_SYMBOL(vunmap);

-void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot)
+void *vm_map_ram(struct page **pages, unsigned int count,
+ int node, gfp_t gfp_mask, pgprot_t prot)
{
BUG();
return NULL;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2aad499..3f736d1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1088,21 +1088,22 @@ EXPORT_SYMBOL(vm_unmap_ram);
*
* Returns: a pointer to the address that has been mapped, or %NULL on failure
*/
-void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot)
+void *vm_map_ram(struct page **pages, unsigned int count,
+ int node, gfp_t gfp_mask, pgprot_t prot)
{
unsigned long size = count << PAGE_SHIFT;
unsigned long addr;
void *mem;

if (likely(count <= VMAP_MAX_ALLOC)) {
- mem = vb_alloc(size, GFP_KERNEL);
+ mem = vb_alloc(size, gfp_mask);
if (IS_ERR(mem))
return NULL;
addr = (unsigned long)mem;
} else {
struct vmap_area *va;
va = alloc_vmap_area(size, PAGE_SIZE,
- VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
+ VMALLOC_START, VMALLOC_END, node, gfp_mask);
if (IS_ERR(va))
return NULL;

--
1.7.10

2012-06-14 02:07:57

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH] mm: add gfp_mask parameter to vm_map_ram()

Hi Fengguang,

On 06/14/2012 10:49 AM, Fengguang Wu wrote:

> On Thu, Jun 14, 2012 at 11:20:26AM +1000, Dave Chinner wrote:
>> On Wed, Jun 13, 2012 at 08:39:32PM +0800, Fengguang Wu wrote:
>>> Hi Christoph, Dave,
>>>
>>> I got this lockdep warning on XFS when running the xfs tests:
>>>
>>> [ 704.832019] =================================
>>> [ 704.832019] [ INFO: inconsistent lock state ]
>>> [ 704.832019] 3.5.0-rc1+ #8 Tainted: G W
>>> [ 704.832019] ---------------------------------
>>> [ 704.832019] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
>>> [ 704.832019] fsstress/11619 [HC0[0]:SC0[0]:HE1:SE1] takes:
>>> [ 704.832019] (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff8143953d>] xfs_ilock_nowait+0xd7/0x1d0
>>> [ 704.832019] {IN-RECLAIM_FS-W} state was registered at:
>>> [ 704.832019] [<ffffffff810e30a2>] mark_irqflags+0x12d/0x13e
>>> [ 704.832019] [<ffffffff810e32f6>] __lock_acquire+0x243/0x3f9
>>> [ 704.832019] [<ffffffff810e3a1c>] lock_acquire+0x112/0x13d
>>> [ 704.832019] [<ffffffff810b8931>] down_write_nested+0x54/0x8b
>>> [ 704.832019] [<ffffffff81438fab>] xfs_ilock+0xd8/0x17d
>>> [ 704.832019] [<ffffffff814431b8>] xfs_reclaim_inode+0x4a/0x2cb
>>> [ 704.832019] [<ffffffff814435ee>] xfs_reclaim_inodes_ag+0x1b5/0x28e
>>> [ 704.832019] [<ffffffff814437d7>] xfs_reclaim_inodes_nr+0x33/0x3a
>>> [ 704.832019] [<ffffffff8144050e>] xfs_fs_free_cached_objects+0x15/0x17
>>> [ 704.832019] [<ffffffff81196076>] prune_super+0x103/0x154
>>> [ 704.832019] [<ffffffff81152fa7>] shrink_slab+0x1ec/0x316
>>> [ 704.832019] [<ffffffff8115574f>] balance_pgdat+0x308/0x618
>>> [ 704.832019] [<ffffffff81155c22>] kswapd+0x1c3/0x1dc
>>> [ 704.832019] [<ffffffff810b3f77>] kthread+0xaf/0xb7
>>> [ 704.832019] [<ffffffff82f480b4>] kernel_thread_helper+0x4/0x10
>>
>> ......
>>> [ 704.832019] stack backtrace:
>>> [ 704.832019] Pid: 11619, comm: fsstress Tainted: G W 3.5.0-rc1+ #8
>>> [ 704.832019] Call Trace:
>>> [ 704.832019] [<ffffffff82e92243>] print_usage_bug+0x1f5/0x206
>>> [ 704.832019] [<ffffffff810e2220>] ? check_usage_forwards+0xa6/0xa6
>>> [ 704.832019] [<ffffffff82e922c3>] mark_lock_irq+0x6f/0x120
>>> [ 704.832019] [<ffffffff810e2f02>] mark_lock+0xaf/0x122
>>> [ 704.832019] [<ffffffff810e3d4e>] mark_held_locks+0x6d/0x95
>>> [ 704.832019] [<ffffffff810c5cd1>] ? local_clock+0x36/0x4d
>>> [ 704.832019] [<ffffffff810e3de3>] __lockdep_trace_alloc+0x6d/0x6f
>>> [ 704.832019] [<ffffffff810e42e7>] lockdep_trace_alloc+0x3d/0x57
>>> [ 704.832019] [<ffffffff811837c8>] kmem_cache_alloc_node_trace+0x47/0x1b4
>>> [ 704.832019] [<ffffffff810e377d>] ? lock_release_nested+0x9f/0xa6
>>> [ 704.832019] [<ffffffff81431650>] ? _xfs_buf_find+0xaa/0x302
>>> [ 704.832019] [<ffffffff811710a2>] ? new_vmap_block.constprop.18+0x3a/0x1de
>>> [ 704.832019] [<ffffffff811710a2>] new_vmap_block.constprop.18+0x3a/0x1de
>>> [ 704.832019] [<ffffffff8117144a>] vb_alloc.constprop.16+0x204/0x225
>>> [ 704.832019] [<ffffffff8117149d>] vm_map_ram+0x32/0xaa
>>> [ 704.832019] [<ffffffff81430c95>] _xfs_buf_map_pages+0xb3/0xf5
>>> [ 704.832019] [<ffffffff81431a6a>] xfs_buf_get+0xd3/0x1ac
>>> [ 704.832019] [<ffffffff81492dd9>] xfs_trans_get_buf+0x180/0x244
>>> [ 704.832019] [<ffffffff8146947a>] xfs_da_do_buf+0x2a0/0x5cc
>>> [ 704.832019] [<ffffffff81469826>] xfs_da_get_buf+0x21/0x23
>>> [ 704.832019] [<ffffffff8146f894>] xfs_dir2_data_init+0x44/0xf9
>>> [ 704.832019] [<ffffffff8146e94f>] xfs_dir2_sf_to_block+0x1ef/0x5d8
>>
>> Bug in vm_map_ram - it does an unconditional GFP_KERNEL allocation
>> here, and we are in a GFP_NOFS context. We can't pass a gfp_mask to
>> vm_map_ram(), so until vm_map_ram() grows that we can't fix it...
>
> This trivial patch should fix it.
>
> The only behavior change is the XFS part:
>
> @@ -406,7 +406,7 @@ _xfs_buf_map_pages(
>
> do {
> bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> - -1, PAGE_KERNEL);
> + -1, GFP_NOFS, PAGE_KERNEL);
> if (bp->b_addr)
> break;
> vm_unmap_aliases();
>
> Does that look fine to you?

>
> Thanks,

> Fengguang
> ---
>
> From 7301975d3211da2ce07723c294cf3260229fe84b Mon Sep 17 00:00:00 2001
> From: Fengguang Wu <[email protected]>
> Date: Thu, 14 Jun 2012 09:38:33 +0800
> Subject: [PATCH] mm: add @gfp_mask parameter to vm_map_ram()
>
> XFS needs GFP_NOFS allocation.
>
> Signed-off-by: Fengguang Wu <[email protected]>
> ---
> drivers/firewire/ohci.c | 2 +-
> drivers/media/video/videobuf2-dma-sg.c | 1 +
> drivers/media/video/videobuf2-vmalloc.c | 2 +-
> fs/xfs/xfs_buf.c | 2 +-
> include/linux/vmalloc.h | 2 +-
> mm/nommu.c | 3 ++-
> mm/vmalloc.c | 7 ++++---
> 7 files changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/firewire/ohci.c b/drivers/firewire/ohci.c
> index c1af05e..b53f5a9 100644
> --- a/drivers/firewire/ohci.c
> +++ b/drivers/firewire/ohci.c
> @@ -1000,7 +1000,7 @@ static int ar_context_init(struct ar_context *ctx, struct fw_ohci *ohci,
> for (i = 0; i < AR_WRAPAROUND_PAGES; i++)
> pages[AR_BUFFERS + i] = ctx->pages[i];
> ctx->buffer = vm_map_ram(pages, AR_BUFFERS + AR_WRAPAROUND_PAGES,
> - -1, PAGE_KERNEL);
> + -1, GFP_KERNEL, PAGE_KERNEL);
> if (!ctx->buffer)
> goto out_of_memory;
>
> diff --git a/drivers/media/video/videobuf2-dma-sg.c b/drivers/media/video/videobuf2-dma-sg.c
> index 25c3b36..d087f52 100644
> --- a/drivers/media/video/videobuf2-dma-sg.c
> +++ b/drivers/media/video/videobuf2-dma-sg.c
> @@ -209,6 +209,7 @@ static void *vb2_dma_sg_vaddr(void *buf_priv)
> buf->vaddr = vm_map_ram(buf->pages,
> buf->sg_desc.num_pages,
> -1,
> + GFP_KERNEL,
> PAGE_KERNEL);
>
> /* add offset in case userptr is not page-aligned */
> diff --git a/drivers/media/video/videobuf2-vmalloc.c b/drivers/media/video/videobuf2-vmalloc.c
> index 6b5ca6c..548ebda 100644
> --- a/drivers/media/video/videobuf2-vmalloc.c
> +++ b/drivers/media/video/videobuf2-vmalloc.c
> @@ -111,7 +111,7 @@ static void *vb2_vmalloc_get_userptr(void *alloc_ctx, unsigned long vaddr,
> goto fail_get_user_pages;
>
> buf->vaddr = vm_map_ram(buf->pages, buf->n_pages, -1,
> - PAGE_KERNEL);
> + GFP_KERNEL, PAGE_KERNEL);
> if (!buf->vaddr)
> goto fail_get_user_pages;
> }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 172d3cc..b3e7289 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -406,7 +406,7 @@ _xfs_buf_map_pages(
>
> do {
> bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> - -1, PAGE_KERNEL);
> + -1, GFP_NOFS, PAGE_KERNEL);
> if (bp->b_addr)
> break;
> vm_unmap_aliases();
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index dcdfc2b..c811763 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -40,7 +40,7 @@ struct vm_struct {
> */
> extern void vm_unmap_ram(const void *mem, unsigned int count);
> extern void *vm_map_ram(struct page **pages, unsigned int count,
> - int node, pgprot_t prot);
> + int node, gfp_t gfp_mask, pgprot_t prot);
> extern void vm_unmap_aliases(void);
>
> #ifdef CONFIG_MMU
> diff --git a/mm/nommu.c b/mm/nommu.c
> index d4b0c10..2fb4ec1 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -416,7 +416,8 @@ void vunmap(const void *addr)
> }
> EXPORT_SYMBOL(vunmap);
>
> -void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot)
> +void *vm_map_ram(struct page **pages, unsigned int count,
> + int node, gfp_t gfp_mask, pgprot_t prot)
> {
> BUG();
> return NULL;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 2aad499..3f736d1 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1088,21 +1088,22 @@ EXPORT_SYMBOL(vm_unmap_ram);
> *
> * Returns: a pointer to the address that has been mapped, or %NULL on failure
> */
> -void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot)
> +void *vm_map_ram(struct page **pages, unsigned int count,
> + int node, gfp_t gfp_mask, pgprot_t prot)
> {
> unsigned long size = count << PAGE_SHIFT;
> unsigned long addr;
> void *mem;
>
> if (likely(count <= VMAP_MAX_ALLOC)) {
> - mem = vb_alloc(size, GFP_KERNEL);
> + mem = vb_alloc(size, gfp_mask);
> if (IS_ERR(mem))
> return NULL;
> addr = (unsigned long)mem;
> } else {
> struct vmap_area *va;
> va = alloc_vmap_area(size, PAGE_SIZE,
> - VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> + VMALLOC_START, VMALLOC_END, node, gfp_mask);
> if (IS_ERR(va))
> return NULL;
>

It shouldn't work because vmap_page_range still can allocate GFP_KERNEL by pud_alloc in vmap_pud_range.
For it, I tried [1] but other mm guys want to add WARNING [2] so let's avoiding gfp context passing.

[1] https://lkml.org/lkml/2012/4/23/77
[2] https://lkml.org/lkml/2012/5/2/340

Thanks.
--
Kind regards,
Minchan Kim

2012-06-14 02:15:54

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] mm: add gfp_mask parameter to vm_map_ram()

On Thu, Jun 14, 2012 at 09:49:02AM +0800, Fengguang Wu wrote:
> On Thu, Jun 14, 2012 at 11:20:26AM +1000, Dave Chinner wrote:
> >
> > Bug in vm_map_ram - it does an unconditional GFP_KERNEL allocation
> > here, and we are in a GFP_NOFS context. We can't pass a gfp_mask to
> > vm_map_ram(), so until vm_map_ram() grows that we can't fix it...
>
> This trivial patch should fix it.
>
> The only behavior change is the XFS part:
>
> @@ -406,7 +406,7 @@ _xfs_buf_map_pages(
>
> do {
> bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> - -1, PAGE_KERNEL);
> + -1, GFP_NOFS, PAGE_KERNEL);

This function isn't always called in GFP_NOFS context - readahead
uses different memory allocation semantics (no retry, no warn), so
there are flags that tell it what to do. i.e.

- -1, PAGE_KERNEL);
+ -1, xb_to_gfp(flags), PAGE_KERNEL);

Otherwise looks fine to me...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-14 02:21:47

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH] mm: add gfp_mask parameter to vm_map_ram()

Hello, guys.

On Thu, Jun 14, 2012 at 11:07:53AM +0900, Minchan Kim wrote:
> It shouldn't work because vmap_page_range still can allocate
> GFP_KERNEL by pud_alloc in vmap_pud_range. For it, I tried [1] but
> other mm guys want to add WARNING [2] so let's avoiding gfp context
> passing.
>
> [1] https://lkml.org/lkml/2012/4/23/77
> [2] https://lkml.org/lkml/2012/5/2/340

Yeah, vmalloc area doesn't support !GFP_KERNEL allocations and as
Minchan said, changing this would require updating page table
allocation functions on all archs. This is the same reason why percpu
allocator doesn't support !GFP_KERNEL allocations which in turn made
blk-throttle implement its own private percpu pool.

If xfs can't live without GFP_NOFS vmalloc allocations, either it has
to implement its own pool or maybe it's time to implement !GFP_KERNEL
allocs for vmalloc area. I don't know.

Thanks.

--
tejun

2012-06-14 02:39:38

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH] mm: add gfp_mask parameter to vm_map_ram()

On 06/14/2012 11:21 AM, Tejun Heo wrote:

> Hello, guys.
>
> On Thu, Jun 14, 2012 at 11:07:53AM +0900, Minchan Kim wrote:
>> It shouldn't work because vmap_page_range still can allocate
>> GFP_KERNEL by pud_alloc in vmap_pud_range. For it, I tried [1] but
>> other mm guys want to add WARNING [2] so let's avoiding gfp context
>> passing.
>>
>> [1] https://lkml.org/lkml/2012/4/23/77
>> [2] https://lkml.org/lkml/2012/5/2/340
>
> Yeah, vmalloc area doesn't support !GFP_KERNEL allocations and as
> Minchan said, changing this would require updating page table
> allocation functions on all archs. This is the same reason why percpu
> allocator doesn't support !GFP_KERNEL allocations which in turn made
> blk-throttle implement its own private percpu pool.
>
> If xfs can't live without GFP_NOFS vmalloc allocations, either it has
> to implement its own pool or maybe it's time to implement !GFP_KERNEL
> allocs for vmalloc area. I don't know.

There is another example in ARM.
http://www.spinics.net/lists/arm-kernel/msg179202.html
They try to make pool for atomic vmalloc support. :(
Only GFP_KERNEL support vmalloc spreads out many pools in system, Sigh.

--
Kind regards,
Minchan Kim

2012-06-14 03:34:37

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] mm: add gfp_mask parameter to vm_map_ram()

On Thu, Jun 14, 2012 at 11:07:53AM +0900, Minchan Kim wrote:
> Hi Fengguang,
>
> On 06/14/2012 10:49 AM, Fengguang Wu wrote:
>
> > On Thu, Jun 14, 2012 at 11:20:26AM +1000, Dave Chinner wrote:
> >> On Wed, Jun 13, 2012 at 08:39:32PM +0800, Fengguang Wu wrote:
> >>> Hi Christoph, Dave,
> >>>
> >>> I got this lockdep warning on XFS when running the xfs tests:

[rant warning]

> >> Bug in vm_map_ram - it does an unconditional GFP_KERNEL allocation
> >> here, and we are in a GFP_NOFS context. We can't pass a gfp_mask to
> >> vm_map_ram(), so until vm_map_ram() grows that we can't fix it...
> >
> > This trivial patch should fix it.
.....
>
> It shouldn't work because vmap_page_range still can allocate GFP_KERNEL by pud_alloc in vmap_pud_range.
> For it, I tried [1] but other mm guys want to add WARNING [2] so let's avoiding gfp context passing.

Oh, wonderful, you're pulling the "it's not a MM issue, don't use
vmalloc" card.

> [1] https://lkml.org/lkml/2012/4/23/77

https://lkml.org/lkml/2012/4/24/29

"vmalloc was never supposed to use gfp flags for allocation
"context" restriction. I.e., it was always supposed to have
blocking, fs, and io capable allocation context."

vmalloc always was a badly neglected, ugly step-sister of kmalloc
that was kept in the basement and only brought out when the tax
collector called. But that inner ugliness doesn't change the fact
that beatiful things have been built around it. XFS has used
vm_map_ram() and it's predecessor since it was first ported to linux
some 13 or 14 years ago, so the above claim is way out of date. i.e.
vmalloc has been used in GFP_NOFS context since before that flag
even existed....

http://lkml.org/lkml/2012/4/24/67

"I would say add a bit of warnings and documentation, and see what
can be done about callers."

Wonderful. Well, there's about 2 years of work queued up for me
before I even get to the do the open heart surgery that would allow
XFS to handle memory allocation failures at this level without
causing the filesystem to shut down.

Andrew Morton's response:

https://lkml.org/lkml/2012/4/24/413

"There are gruesome problems in block/blk-throttle.c (thread
"mempool, percpu, blkcg: fix percpu stat allocation and remove
stats_lock"). It wants to do an alloc_percpu()->vmalloc() from the
IO submission path, under GFP_NOIO.

Changing vmalloc() to take a gfp_t does make lots of sense, although
I worry a bit about making vmalloc() easier to use!"

OK, so according to Andrew there is no technical reason why it can't
be done, it's just handwaving about "vmalloc is bad"....

> [2] https://lkml.org/lkml/2012/5/2/340

https://lkml.org/lkml/2012/5/2/452

"> Where are these offending callsites?

dm:
...
ubi:
....
ext4:
....
ntfs:
....
ubifs:
....
mm:
....
ceph:
...."

So, we've got a bunch of filesystems that require vmalloc under
GFP_NOFS conditions. Perhaps there's a reason for needing to be able
to do this in filesystem code? Like, perhaps, avoiding memory
reclaim deadlocks?

https://lkml.org/lkml/2012/5/3/27

"Note that in writeback paths, a "good citizen" filesystem should
not require any allocations, or at least it should be able to
tolerate allocation failures. So fixing that would be a good idea
anyway."

Oh, please. I have been hearing this for years, and are we any
closer to it? No, we are further away from ever being able to
acheive this than ever. Face it, filesystems require memory
allocation to write dirty data to disk, and the amount is almost
impossible to define. Hence mempools can't be used because we can't
give any guarantees of forward progress. And for vmalloc?

Filesystems widely use vmalloc/vm_map_ram because kmalloc fails on
large contiguous allocations. This renders kmalloc unfit for
purpose, so we have to fall back to single page allocation and
vm_map_ram or vmalloc so that the filesystem can function properly.
And to avoid deadlocks, all memory allocation must be able to
specify GFP_NOFS to prevent the MM subsystem from recursing into the
filesystem. Therefore, vmalloc needs to support GFP_NOFS.

I don't care how you make it happen, just fix it. Trying to place
the blame on the filesystem folk for using vmalloc in GFP_NOFS
contexts is a total and utter cop-out, because mm folk of all people
should know that non-zero order kmalloc is not a reliable
alternative....

[end rant]

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-14 03:53:46

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] mm: add gfp_mask parameter to vm_map_ram()

On Thu, 14 Jun 2012, Dave Chinner wrote:

> Oh, please. I have been hearing this for years, and are we any
> closer to it? No, we are further away from ever being able to
> acheive this than ever. Face it, filesystems require memory
> allocation to write dirty data to disk, and the amount is almost
> impossible to define. Hence mempools can't be used because we can't
> give any guarantees of forward progress. And for vmalloc?
>
> Filesystems widely use vmalloc/vm_map_ram because kmalloc fails on
> large contiguous allocations. This renders kmalloc unfit for
> purpose, so we have to fall back to single page allocation and
> vm_map_ram or vmalloc so that the filesystem can function properly.
> And to avoid deadlocks, all memory allocation must be able to
> specify GFP_NOFS to prevent the MM subsystem from recursing into the
> filesystem. Therefore, vmalloc needs to support GFP_NOFS.
>
> I don't care how you make it happen, just fix it. Trying to place
> the blame on the filesystem folk for using vmalloc in GFP_NOFS
> contexts is a total and utter cop-out, because mm folk of all people
> should know that non-zero order kmalloc is not a reliable
> alternative....
>

I'd actually like to see a demonstrated problem (i.e. not theoretical)
where vmalloc() stalls indefinitely because its passed GFP_NOFS. I've
never seen one reported.

This is because the per-arch pte allocators have hardwired GFP_KERNEL
flags, but then again they also have __GFP_REPEAT which would cause them
to loop infinitely in the page allocator if a page was not reclaimed,
which has little success without __GFP_FS. But nobody has ever reported a
livelock that was triaged back to passing !__GFP_FS to vmalloc().

2012-06-14 05:51:07

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH] mm: add gfp_mask parameter to vm_map_ram()

On 06/14/2012 12:34 PM, Dave Chinner wrote:

> On Thu, Jun 14, 2012 at 11:07:53AM +0900, Minchan Kim wrote:
>> Hi Fengguang,
>>
>> On 06/14/2012 10:49 AM, Fengguang Wu wrote:
>>
>>> On Thu, Jun 14, 2012 at 11:20:26AM +1000, Dave Chinner wrote:
>>>> On Wed, Jun 13, 2012 at 08:39:32PM +0800, Fengguang Wu wrote:
>>>>> Hi Christoph, Dave,
>>>>>
>>>>> I got this lockdep warning on XFS when running the xfs tests:
>
> [rant warning]
>
>>>> Bug in vm_map_ram - it does an unconditional GFP_KERNEL allocation
>>>> here, and we are in a GFP_NOFS context. We can't pass a gfp_mask to
>>>> vm_map_ram(), so until vm_map_ram() grows that we can't fix it...
>>>
>>> This trivial patch should fix it.
> .....
>>
>> It shouldn't work because vmap_page_range still can allocate GFP_KERNEL by pud_alloc in vmap_pud_range.
>> For it, I tried [1] but other mm guys want to add WARNING [2] so let's avoiding gfp context passing.
>
> Oh, wonderful, you're pulling the "it's not a MM issue, don't use
> vmalloc" card.
>
>> [1] https://lkml.org/lkml/2012/4/23/77
>
> https://lkml.org/lkml/2012/4/24/29
>
> "vmalloc was never supposed to use gfp flags for allocation
> "context" restriction. I.e., it was always supposed to have
> blocking, fs, and io capable allocation context."
>
> vmalloc always was a badly neglected, ugly step-sister of kmalloc
> that was kept in the basement and only brought out when the tax
> collector called. But that inner ugliness doesn't change the fact
> that beatiful things have been built around it. XFS has used
> vm_map_ram() and it's predecessor since it was first ported to linux
> some 13 or 14 years ago, so the above claim is way out of date. i.e.
> vmalloc has been used in GFP_NOFS context since before that flag
> even existed....
>
> http://lkml.org/lkml/2012/4/24/67
>
> "I would say add a bit of warnings and documentation, and see what
> can be done about callers."
>
> Wonderful. Well, there's about 2 years of work queued up for me
> before I even get to the do the open heart surgery that would allow
> XFS to handle memory allocation failures at this level without
> causing the filesystem to shut down.
>
> Andrew Morton's response:
>
> https://lkml.org/lkml/2012/4/24/413
>
> "There are gruesome problems in block/blk-throttle.c (thread
> "mempool, percpu, blkcg: fix percpu stat allocation and remove
> stats_lock"). It wants to do an alloc_percpu()->vmalloc() from the
> IO submission path, under GFP_NOIO.
>
> Changing vmalloc() to take a gfp_t does make lots of sense, although
> I worry a bit about making vmalloc() easier to use!"
>
> OK, so according to Andrew there is no technical reason why it can't
> be done, it's just handwaving about "vmalloc is bad"....
>
>
>> [2] https://lkml.org/lkml/2012/5/2/340
>
> https://lkml.org/lkml/2012/5/2/452
>
> "> Where are these offending callsites?
>
> dm:
> ...
> ubi:
> ....
> ext4:
> ....
> ntfs:
> ....
> ubifs:
> ....
> mm:
> ....
> ceph:
> ...."
>
> So, we've got a bunch of filesystems that require vmalloc under
> GFP_NOFS conditions. Perhaps there's a reason for needing to be able
> to do this in filesystem code? Like, perhaps, avoiding memory
> reclaim deadlocks?
>
> https://lkml.org/lkml/2012/5/3/27
>
> "Note that in writeback paths, a "good citizen" filesystem should
> not require any allocations, or at least it should be able to
> tolerate allocation failures. So fixing that would be a good idea
> anyway."
>
> Oh, please. I have been hearing this for years, and are we any
> closer to it? No, we are further away from ever being able to
> acheive this than ever. Face it, filesystems require memory
> allocation to write dirty data to disk, and the amount is almost
> impossible to define. Hence mempools can't be used because we can't
> give any guarantees of forward progress. And for vmalloc?
>
> Filesystems widely use vmalloc/vm_map_ram because kmalloc fails on
> large contiguous allocations. This renders kmalloc unfit for
> purpose, so we have to fall back to single page allocation and
> vm_map_ram or vmalloc so that the filesystem can function properly.
> And to avoid deadlocks, all memory allocation must be able to
> specify GFP_NOFS to prevent the MM subsystem from recursing into the
> filesystem. Therefore, vmalloc needs to support GFP_NOFS.
>
> I don't care how you make it happen, just fix it. Trying to place
> the blame on the filesystem folk for using vmalloc in GFP_NOFS
> contexts is a total and utter cop-out, because mm folk of all people
> should know that non-zero order kmalloc is not a reliable
> alternative....
>
> [end rant]
>
> Cheers,
>
> Dave.

Again, hot potato.

I understand your claim and biased on you.
Day by day, guys are adding their memory pool for avoiding this problem.
IMHO, It's not good. If we can support context gfp passing in vmalloc, we could save many code.
It means less error-prone as well as less code size.
Ccing Nick. I hope we reach a agreement about this problem in this chance.

--
Kind regards,
Minchan Kim

2012-06-14 07:52:22

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH] mm: add gfp_mask parameter to vm_map_ram()

On 2012-06-13, at 9:34 PM, Dave Chinner wrote:
> On Thu, Jun 14, 2012 at 11:07:53AM +0900, Minchan Kim wrote:
>> Hi Fengguang,
>>
>> On 06/14/2012 10:49 AM, Fengguang Wu wrote:
>>
>>> On Thu, Jun 14, 2012 at 11:20:26AM +1000, Dave Chinner wrote:
>>>> On Wed, Jun 13, 2012 at 08:39:32PM +0800, Fengguang Wu wrote:
>>>>> Hi Christoph, Dave,
>>>>>
>>>>> I got this lockdep warning on XFS when running the xfs tests:
>
> [rant warning]

I heartily agree with Dave on this. There are plenty of places
that would be simplified by making vmalloc() work better. Instead,
there are workarounds (e.g. allocating 2-level trees of pointers
to smaller kmalloc'd elements, instead of a single large array)
spread around that add to code complexity and memory inefficiency.

I don't think there is a real danger of people starting to abuse
vmalloc() for everything, but does it really make sense to keep
some component of the kernel at sub-par functionality/efficiency
just to discourage its use?

Cheers, Andreas

>>>> Bug in vm_map_ram - it does an unconditional GFP_KERNEL allocation
>>>> here, and we are in a GFP_NOFS context. We can't pass a gfp_mask to
>>>> vm_map_ram(), so until vm_map_ram() grows that we can't fix it...
>>>
>>> This trivial patch should fix it.
> .....
>>
>> It shouldn't work because vmap_page_range still can allocate GFP_KERNEL by pud_alloc in vmap_pud_range.
>> For it, I tried [1] but other mm guys want to add WARNING [2] so let's avoiding gfp context passing.
>
> Oh, wonderful, you're pulling the "it's not a MM issue, don't use
> vmalloc" card.
>
>> [1] https://lkml.org/lkml/2012/4/23/77
>
> https://lkml.org/lkml/2012/4/24/29
>
> "vmalloc was never supposed to use gfp flags for allocation
> "context" restriction. I.e., it was always supposed to have
> blocking, fs, and io capable allocation context."
>
> vmalloc always was a badly neglected, ugly step-sister of kmalloc
> that was kept in the basement and only brought out when the tax
> collector called. But that inner ugliness doesn't change the fact
> that beatiful things have been built around it. XFS has used
> vm_map_ram() and it's predecessor since it was first ported to linux
> some 13 or 14 years ago, so the above claim is way out of date. i.e.
> vmalloc has been used in GFP_NOFS context since before that flag
> even existed....
>
> http://lkml.org/lkml/2012/4/24/67
>
> "I would say add a bit of warnings and documentation, and see what
> can be done about callers."
>
> Wonderful. Well, there's about 2 years of work queued up for me
> before I even get to the do the open heart surgery that would allow
> XFS to handle memory allocation failures at this level without
> causing the filesystem to shut down.
>
> Andrew Morton's response:
>
> https://lkml.org/lkml/2012/4/24/413
>
> "There are gruesome problems in block/blk-throttle.c (thread
> "mempool, percpu, blkcg: fix percpu stat allocation and remove
> stats_lock"). It wants to do an alloc_percpu()->vmalloc() from the
> IO submission path, under GFP_NOIO.
>
> Changing vmalloc() to take a gfp_t does make lots of sense, although
> I worry a bit about making vmalloc() easier to use!"
>
> OK, so according to Andrew there is no technical reason why it can't
> be done, it's just handwaving about "vmalloc is bad"....
>
>
>> [2] https://lkml.org/lkml/2012/5/2/340
>
> https://lkml.org/lkml/2012/5/2/452
>
> "> Where are these offending callsites?
>
> dm:
> ...
> ubi:
> ....
> ext4:
> ....
> ntfs:
> ....
> ubifs:
> ....
> mm:
> ....
> ceph:
> ...."
>
> So, we've got a bunch of filesystems that require vmalloc under
> GFP_NOFS conditions. Perhaps there's a reason for needing to be able
> to do this in filesystem code? Like, perhaps, avoiding memory
> reclaim deadlocks?
>
> https://lkml.org/lkml/2012/5/3/27
>
> "Note that in writeback paths, a "good citizen" filesystem should
> not require any allocations, or at least it should be able to
> tolerate allocation failures. So fixing that would be a good idea
> anyway."
>
> Oh, please. I have been hearing this for years, and are we any
> closer to it? No, we are further away from ever being able to
> acheive this than ever. Face it, filesystems require memory
> allocation to write dirty data to disk, and the amount is almost
> impossible to define. Hence mempools can't be used because we can't
> give any guarantees of forward progress. And for vmalloc?
>
> Filesystems widely use vmalloc/vm_map_ram because kmalloc fails on
> large contiguous allocations. This renders kmalloc unfit for
> purpose, so we have to fall back to single page allocation and
> vm_map_ram or vmalloc so that the filesystem can function properly.
> And to avoid deadlocks, all memory allocation must be able to
> specify GFP_NOFS to prevent the MM subsystem from recursing into the
> filesystem. Therefore, vmalloc needs to support GFP_NOFS.
>
> I don't care how you make it happen, just fix it. Trying to place
> the blame on the filesystem folk for using vmalloc in GFP_NOFS
> contexts is a total and utter cop-out, because mm folk of all people
> should know that non-zero order kmalloc is not a reliable
> alternative....
>
> [end rant]
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Cheers, Andreas