On Mon, 29 Feb 2016, Kirill A. Shutemov wrote:
> On Sun, Feb 28, 2016 at 08:49:10PM -0800, Hugh Dickins wrote:
> > Commit e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount()
> > for compound pages") changed the famous BUG_ON(page_mapped(page)) in
> > __delete_from_page_cache() to VM_BUG_ON_PAGE(page_mapped(page)): which
> > gives us more info when CONFIG_DEBUG_VM=y, but nothing at all when not.
> >
> > Although it has not usually been very helpul, being hit long after the
> > error in question, we do need to know if it actually happens on users'
> > systems; but reinstating a crash there is likely to be opposed :)
> >
> > In the non-debug case, use WARN_ON() plus dump_page() and add_taint() -
> > I don't really believe LOCKDEP_NOW_UNRELIABLE, but that seems to be the
> > standard procedure now.
>
> So you put here TAINT_WARN plus TAINT_BAD_PAGE. I guess just the second
> would be enough.
You're right, I hadn't thought about the over-tainting at all:
one's enough, yes.
>
> We can replace WARN_ON() with plain page_mapped(page), plus dump_stack()
> below add_taint().
Okay, I'll post another now, but it does remind me why I used WARN_ON():
that was an easy way of printing a standard format, without having to
think too much. Now I'm adding a "BUG: Bad page cache" header line to
make it fit in with the "Bad page map" and "Bad page state" messages.
>
> > Move that, or the VM_BUG_ON_PAGE(), up before
> > the deletion from tree: so that the unNULLified page->mapping gives a
> > little more information.
> >
> > If the inode is being evicted (rather than truncated), it won't have
> > any vmas left, so it's safe(ish) to assume that the raised mapcount is
> > erroneous, and we can discount it from page_count to avoid leaking the
> > page (I'm less worried by leaking the occasional 4kB, than losing a
> > potential 2MB page with each 4kB page leaked).
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
>
> Otherwise,
>
> Acked-by: Kirill A. Shutemov <[email protected]>
Thank you, but since I haven't done it exactly the way you suggest,
I won't assume your Ack carries over to the "Bad page cache" version.
>
> > ---
> > I think this should go into v4.5, so I've written it with an atomic_sub
> > on page->_count; but Joonsoo will probably want some page_ref thingy.
And thanks to Joonsoo for taking it on board.
Hugh
Commit e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount()
for compound pages") changed the famous BUG_ON(page_mapped(page)) in
__delete_from_page_cache() to VM_BUG_ON_PAGE(page_mapped(page)): which
gives us more info when CONFIG_DEBUG_VM=y, but nothing at all when not.
Although it has not usually been very helpul, being hit long after the
error in question, we do need to know if it actually happens on users'
systems; but reinstating a crash there is likely to be opposed :)
In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
but that seems to be the standard procedure now. Move that, or the
VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
unNULLified page->mapping gives a little more information.
If the inode is being evicted (rather than truncated), it won't have
any vmas left, so it's safe(ish) to assume that the raised mapcount is
erroneous, and we can discount it from page_count to avoid leaking the
page (I'm less worried by leaking the occasional 4kB, than losing a
potential 2MB page with each 4kB page leaked).
Signed-off-by: Hugh Dickins <[email protected]>
---
I think this should go into v4.5, so I've written it with an atomic_sub
on page->_count; Joonsoo has noticed, and kindly agreed to page_ref'ify
it for mmotm after it's merged.
mm/filemap.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
--- 4.5-rc6/mm/filemap.c 2016-02-28 09:04:38.816707844 -0800
+++ linux/mm/filemap.c 2016-02-29 22:04:30.229738939 -0800
@@ -195,6 +195,30 @@ void __delete_from_page_cache(struct pag
else
cleancache_invalidate_page(mapping, page);
+ VM_BUG_ON_PAGE(page_mapped(page), page);
+ if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(page_mapped(page))) {
+ int mapcount;
+
+ pr_alert("BUG: Bad page cache in process %s pfn:%05lx\n",
+ current->comm, page_to_pfn(page));
+ dump_page(page, "still mapped when deleted");
+ dump_stack();
+ add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
+
+ mapcount = page_mapcount(page);
+ if (mapping_exiting(mapping) &&
+ page_count(page) >= mapcount + 2) {
+ /*
+ * All vmas have already been torn down, so it's
+ * a good bet that actually the page is unmapped,
+ * and we'd prefer not to leak it: if we're wrong,
+ * some other bad page check should catch it later.
+ */
+ page_mapcount_reset(page);
+ atomic_sub(mapcount, &page->_count);
+ }
+ }
+
page_cache_tree_delete(mapping, page, shadow);
page->mapping = NULL;
@@ -205,7 +229,6 @@ void __delete_from_page_cache(struct pag
__dec_zone_page_state(page, NR_FILE_PAGES);
if (PageSwapBacked(page))
__dec_zone_page_state(page, NR_SHMEM);
- VM_BUG_ON_PAGE(page_mapped(page), page);
/*
* At this point page must be either written or cleaned by truncate.
On Mon, Feb 29, 2016 at 10:45:59PM -0800, Hugh Dickins wrote:
> Commit e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount()
> for compound pages") changed the famous BUG_ON(page_mapped(page)) in
> __delete_from_page_cache() to VM_BUG_ON_PAGE(page_mapped(page)): which
> gives us more info when CONFIG_DEBUG_VM=y, but nothing at all when not.
>
> Although it has not usually been very helpul, being hit long after the
> error in question, we do need to know if it actually happens on users'
> systems; but reinstating a crash there is likely to be opposed :)
>
> In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
> dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
> but that seems to be the standard procedure now. Move that, or the
> VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
> unNULLified page->mapping gives a little more information.
>
> If the inode is being evicted (rather than truncated), it won't have
> any vmas left, so it's safe(ish) to assume that the raised mapcount is
> erroneous, and we can discount it from page_count to avoid leaking the
> page (I'm less worried by leaking the occasional 4kB, than losing a
> potential 2MB page with each 4kB page leaked).
>
> Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On 03/01/2016 01:45 AM, Hugh Dickins wrote:
> Commit e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount()
> for compound pages") changed the famous BUG_ON(page_mapped(page)) in
> __delete_from_page_cache() to VM_BUG_ON_PAGE(page_mapped(page)): which
> gives us more info when CONFIG_DEBUG_VM=y, but nothing at all when not.
>
> Although it has not usually been very helpul, being hit long after the
> error in question, we do need to know if it actually happens on users'
> systems; but reinstating a crash there is likely to be opposed :)
>
> In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
> dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
> but that seems to be the standard procedure now. Move that, or the
> VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
> unNULLified page->mapping gives a little more information.
>
> If the inode is being evicted (rather than truncated), it won't have
> any vmas left, so it's safe(ish) to assume that the raised mapcount is
> erroneous, and we can discount it from page_count to avoid leaking the
> page (I'm less worried by leaking the occasional 4kB, than losing a
> potential 2MB page with each 4kB page leaked).
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> I think this should go into v4.5, so I've written it with an atomic_sub
> on page->_count; Joonsoo has noticed, and kindly agreed to page_ref'ify
> it for mmotm after it's merged.
Hey Hugh,
I seem to be hitting this while fuzzing:
[ 817.413969] BUG: Bad rss-counter state mm:ffff8801b2a1d000 idx:0 val:1
[ 817.413974] BUG: Bad rss-counter state mm:ffff8801b2a1d000 idx:1 val:2
[ 817.413977] BUG: non-zero nr_ptes on freeing mm: 2
[ 817.606318] page:ffffea0002fb6cc0 count:3 mapcount:1 mapping:ffff8801b58129e8 index:0x0
[ 817.606328] flags: 0x1fffff8000002d(locked|referenced|uptodate|lru)
[ 817.606333] page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
[ 817.606336] page->mem_cgroup:ffff8801d1ce1be0
[ 817.606489] ------------[ cut here ]------------
[ 817.606493] kernel BUG at mm/filemap.c:196!
[ 817.606509] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 817.606519] Modules linked in:
[ 817.606529] CPU: 0 PID: 23148 Comm: modprobe Tainted: G B 4.6.0-rc3-next-20160412-sasha-00023-g0b02d6d-dirty #2998
[ 817.606535] task: ffff8801acc31000 ti: ffff8801acc38000 task.ti: ffff8801acc38000
[ 817.606573] RIP: __delete_from_page_cache (mm/filemap.c:196 (discriminator 1))
[ 817.606577] RSP: 0000:ffff8801acc3f660 EFLAGS: 00010082
[ 817.606582] RAX: 0000000000000000 RBX: ffffea0002fb6cc0 RCX: 0000000000000000
[ 817.606586] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffffed0035987eb2
[ 817.606591] RBP: ffff8801acc3f7a0 R08: ffff8801d3ddfbf7 R09: ffff8801d3ddfbf5
[ 817.606596] R10: 000000000013000d R11: 00000000e949c223 R12: ffff8801b58129e8
[ 817.606601] R13: ffff8800c7296620 R14: ffff8801acc3f778 R15: ffffea0002b298a0
[ 817.606612] FS: 00007f5159f41700(0000) GS:ffff8801d4200000(0000) knlGS:0000000000000000
[ 817.606617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 817.606622] CR2: 000055c523664d20 CR3: 00000001c28ec000 CR4: 00000000000406f0
[ 817.606629] Stack:
[ 817.606638] ffff8800cb5d43c0 0000000000000282 1ffff10035987ed3 ffffea0002fb6cc8
[ 817.606646] ffff8801acc31000 ffffea0002fb6ce0 0000000000000000 0000000041b58ab3
[ 817.606654] ffffffffa967ddb8 ffffffff9d637a20 ffffffff9d3e26e0 0000000000000000
[ 817.606655] Call Trace:
[ 817.606717] delete_from_page_cache (include/linux/spinlock.h:362 mm/filemap.c:264)
[ 817.606726] truncate_inode_page (mm/truncate.c:165)
[ 817.606735] truncate_inode_pages_range (mm/truncate.c:290)
[ 817.606817] truncate_inode_pages_final (mm/truncate.c:447)
[ 817.606826] v9fs_evict_inode (fs/9p/vfs_inode.c:455)
[ 817.606837] evict (fs/inode.c:548)
[ 817.606855] iput (fs/inode.c:1483 fs/inode.c:1510)
[ 817.606872] __dentry_kill (fs/dcache.c:345 fs/dcache.c:532)
[ 817.606917] dput (fs/dcache.c:786)
[ 817.606938] __fput (fs/file_table.c:227)
[ 817.606947] ____fput (fs/file_table.c:245)
[ 817.606958] task_work_run (kernel/task_work.c:117 (discriminator 1))
[ 817.606969] do_exit (kernel/exit.c:749)
[ 817.607031] do_group_exit (kernel/exit.c:862)
[ 817.607048] SyS_exit_group (kernel/exit.c:889)
[ 817.607056] do_syscall_64 (arch/x86/entry/common.c:350)
[ 817.607076] entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:251)
[ 817.607151] Code: e7 e8 8c 7d 19 00 e8 97 6f f6 ff 48 89 df e8 af 00 07 00 84 c0 74 22 e8 86 6f f6 ff 48 c7 c6 c0 5f 30 a7 48 89 df e8 87 8d 09 00 <0f> 0b 48 c7 c7 60 3a 42
ab e8 cb c7 a8 01 e8 64 6f f6 ff 48 8b
All code
========
0: e7 e8 out %eax,$0xe8
2: 8c 7d 19 mov %?,0x19(%rbp)
5: 00 e8 add %ch,%al
7: 97 xchg %eax,%edi
8: 6f outsl %ds:(%rsi),(%dx)
9: f6 ff idiv %bh
b: 48 89 df mov %rbx,%rdi
e: e8 af 00 07 00 callq 0x700c2
13: 84 c0 test %al,%al
15: 74 22 je 0x39
17: e8 86 6f f6 ff callq 0xfffffffffff66fa2
1c: 48 c7 c6 c0 5f 30 a7 mov $0xffffffffa7305fc0,%rsi
23: 48 89 df mov %rbx,%rdi
26: e8 87 8d 09 00 callq 0x98db2
2b:* 0f 0b ud2 <-- trapping instruction
2d: 48 c7 c7 60 3a 42 ab mov $0xffffffffab423a60,%rdi
34: e8 cb c7 a8 01 callq 0x1a8c804
39: e8 64 6f f6 ff callq 0xfffffffffff66fa2
3e: 48 8b 00 mov (%rax),%rax
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 c7 c7 60 3a 42 ab mov $0xffffffffab423a60,%rdi
9: e8 cb c7 a8 01 callq 0x1a8c7d9
e: e8 64 6f f6 ff callq 0xfffffffffff66f77
13: 48 8b 00 mov (%rax),%rax
[ 817.607160] RIP __delete_from_page_cache (mm/filemap.c:196 (discriminator 1))
[ 817.607162] RSP <ffff8801acc3f660>