2004-03-29 15:07:11

by Andrea Arcangeli

[permalink] [raw]
Subject: 2.6.5-rc2-aa5

More fixes. Notably there is a BUG_ON(page->mapping) triggering in
page_remove_rmap in the pagecache case. that could be ex-pagecache being
removed from pagecache before all ptes have been zapped, infact the
page_remove_rmap triggers in the vmtruncate path.

I believe that is an harmless controlled race, that could happen in 2.4
too IIRC, so I turned the BUG_ON into a WARN_ON and I added a further
WARN_ON(page->mapcount) in remove_from_pagecache so that I will see two
WARN_ON instead of just one, and the first WARN_ON will be the new one.
This will give confirmation that it's not a deadly condition but just a
controlled race, and I will see the path that is generating this
controlled race to verify it's not a mistake.

Note that this was triggering even before the anon-vma patch was
applied, at the very least this triggers with just objrmap from Dave
applied, and probably it can trigger with mainline 2.6 too.

Comments from Andrew, Hugh, and others about this race are welcome
thanks.

this is the oops on top of my current code:

ar 29 09:40:47 linux kernel: ------------[ cut here ]------------
Mar 29 09:40:47 linux kernel: kernel BUG at mm/objrmap.c:354!
Mar 29 09:40:47 linux kernel: invalid operand: 0000 [#1]
Mar 29 09:40:47 linux kernel: CPU: 0
Mar 29 09:40:47 linux kernel: EIP: 0060:[page_remove_rmap+117/128] Not tainted
Mar 29 09:40:47 linux kernel: EIP: 0060:[<c014a1e5>] Not tainted
Mar 29 09:40:47 linux kernel: EFLAGS: 00010246 (2.6.4-32.4-default)
Mar 29 09:40:47 linux kernel: EIP is at page_remove_rmap+0x75/0x80
Mar 29 09:40:47 linux kernel: eax: 00000000 ebx: cbcd24e0 ecx: 000013cd edx: c1093640
Mar 29 09:40:47 linux kernel: esi: 00000000 edi: 00001000 ebp: c1093640 esp: c79d9d8c
Mar 29 09:40:47 linux kernel: ds: 007b es: 007b ss: 0068
Mar 29 09:40:47 linux kernel: Process ld-linux.so.2 (pid: 3751, threadinfo=c79d8000 task=c217f830)
Mar 29 09:40:47 linux kernel: Stack: c0145155 049b2025 40538000 c53a6400 40139000 40138000 c53a6400 40139000
Mar 29 09:40:47 linux kernel: c03f9b38 00001000 40139000 40139000 40139000 c014539b 40139000 00000000
Mar 29 09:40:47 linux kernel: c79d8000 00000001 00000001 ffffffff cbd12200 cc6a0780 c79d9e08 cbd12200
Mar 29 09:40:47 linux kernel: Call Trace:
Mar 29 09:40:47 linux kernel: [unmap_page_range+405/688] unmap_page_range+0x195/0x2b0
Mar 29 09:40:47 linux kernel: [<c0145155>] unmap_page_range+0x195/0x2b0
Mar 29 09:40:47 linux kernel: [unmap_vmas+299/640] unmap_vmas+0x12b/0x280
Mar 29 09:40:47 linux kernel: [<c014539b>] unmap_vmas+0x12b/0x280
Mar 29 09:40:47 linux kernel: [zap_page_range+116/192] zap_page_range+0x74/0xc0
Mar 29 09:40:47 linux kernel: [<c0145564>] zap_page_range+0x74/0xc0
Mar 29 09:40:47 linux kernel: [invalidate_mmap_range_list+162/176] invalidate_mmap_range_list+0xa2/0xb0
Mar 29 09:40:47 linux kernel: [<c0145652>] invalidate_mmap_range_list+0xa2/0xb0
Mar 29 09:40:47 linux kernel: [invalidate_mmap_range+144/160] invalidate_mmap_range+0x90/0xa0
Mar 29 09:40:47 linux kernel: [<c01456f0>] invalidate_mmap_range+0x90/0xa0
Mar 29 09:40:47 linux kernel: [vmtruncate+62/208] vmtruncate+0x3e/0xd0
Mar 29 09:40:47 linux kernel: [<c014573e>] vmtruncate+0x3e/0xd0
Mar 29 09:40:47 linux kernel: [__crc_unregister_framebuffer+5198340/6716426] linvfs_setattr+0x191/0x1a0 [xfs]
Mar 29 09:40:47 linux kernel: [<cfa84841>] linvfs_setattr+0x191/0x1a0 [xfs]
Mar 29 09:40:47 linux kernel: [notify_change+514/576] notify_change+0x202/0x240
Mar 29 09:40:47 linux kernel: [<c016a172>] notify_change+0x202/0x240
Mar 29 09:40:47 linux kernel: [do_truncate+78/128] do_truncate+0x4e/0x80
Mar 29 09:40:47 linux kernel: [<c0151dde>] do_truncate+0x4e/0x80
Mar 29 09:40:47 linux kernel: [sys_ftruncate+225/384] sys_ftruncate+0xe1/0x180
Mar 29 09:40:47 linux kernel: [<c0152731>] sys_ftruncate+0xe1/0x180
Mar 29 09:40:47 linux kernel: [generic_file_llseek+0/208] generic_file_llseek+0x0/0xd0
Mar 29 09:40:47 linux kernel: [<c0152ed0>] generic_file_llseek+0x0/0xd0
Mar 29 09:40:47 linux kernel: [sys_llseek+171/212] sys_llseek+0xab/0xd4
Mar 29 09:40:47 linux kernel: [<c0153f5b>] sys_llseek+0xab/0xd4
Mar 29 09:40:47 linux kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Mar 29 09:40:47 linux kernel: [<c0107e27>] syscall_call+0x7/0xb
Mar 29 09:40:47 linux kernel:
Mar 29 09:40:47 linux kernel: Code: 0f 0b 62 01 1e 3b 30 c0 eb d8 90 55 57 56 53 89 c3 89 cd 8b

This is the same oops but with a kernel before anon-vma was applied (so with
only the objrmap from Dave, and I think it would trigger with mainline 2.6
regardless of objrmap):

Mar 19 18:33:48 linux kernel: ------------[ cut here ]------------
Mar 19 18:33:48 linux kernel: kernel BUG at mm/rmap.c:393!
Mar 19 18:33:48 linux kernel: invalid operand: 0000 [#1]
Mar 19 18:33:48 linux kernel: CPU: 0
Mar 19 18:33:48 linux kernel: EIP: 0060:[page_remove_rmap+219/384] Not tainted
Mar 19 18:33:48 linux kernel: EIP: 0060:[<c0148beb>] Not tainted
Mar 19 18:33:48 linux kernel: EFLAGS: 00010246 (2.6.4-9.19-default)
Mar 19 18:33:48 linux kernel: EIP is at page_remove_rmap+0xdb/0x180
Mar 19 18:33:48 linux kernel: eax: 00000000 ebx: 000004e0 ecx: 00000001
edx: c1000000
Mar 19 18:33:48 linux kernel: esi: c115fce0 edi: 00000000 ebp: 0b3d54e0
esp: c4e79d74
Mar 19 18:33:48 linux kernel: ds: 007b es: 007b ss: 0068
Mar 19 18:33:48 linux kernel: Process ld-linux.so.2 (pid: 2899,
threadinfo=c4e78000 task=c66fe780)
Mar 19 18:33:48 linux kernel: Stack: 00000e09 000001d0 c123a800 00000001
40138000 cb3d54e0 00000000 00001000
Mar 19 18:33:48 linux kernel: c0142e50 c115fce0 0afe7025 40538000
40139000 c97ac400 c97ac400 40139000
Mar 19 18:33:48 linux kernel: c03f5c18 00001000 40139000 40139000
40139000 c014304b 40139000 00000000
Mar 19 18:33:48 linux kernel: Call Trace:
Mar 19 18:33:48 linux kernel: [unmap_page_range+464/672]
unmap_page_range+0x1d0/0x2a0
Mar 19 18:33:48 linux kernel: [<c0142e50>] unmap_page_range+0x1d0/0x2a0
Mar 19 18:33:48 linux kernel: [unmap_vmas+299/640] unmap_vmas+0x12b/0x280
Mar 19 18:33:48 linux kernel: [<c014304b>] unmap_vmas+0x12b/0x280
Mar 19 18:33:48 linux kernel: [zap_page_range+116/192] zap_page_range+0x74/0xc0
Mar 19 18:33:48 linux kernel: [<c0143214>] zap_page_range+0x74/0xc0
Mar 19 18:33:48 linux kernel: [invalidate_mmap_range_list+162/176]
invalidate_mmap_range_list+0xa2/0xb0
Mar 19 18:33:48 linux kernel: [<c0143302>] invalidate_mmap_range_list+0xa2/0xb0
Mar 19 18:33:48 linux kernel: [invalidate_mmap_range+144/160]
invalidate_mmap_range+0x90/0xa0
Mar 19 18:33:48 linux kernel: [<c01433a0>] invalidate_mmap_range+0x90/0xa0
Mar 19 18:33:48 linux kernel: [vmtruncate+62/208] vmtruncate+0x3e/0xd0
Mar 19 18:33:48 linux kernel: [<c01433ee>] vmtruncate+0x3e/0xd0
Mar 19 18:33:48 linux kernel: [__crc_cap_capset_set+547932/2369266]
linvfs_setattr+0x191/0x1a0 [xfs]
Mar 19 18:33:48 linux kernel: [<cfa3a7b1>] linvfs_setattr+0x191/0x1a0 [xfs]
Mar 19 18:33:48 linux kernel: [notify_change+504/576] notify_change+0x1f8/0x240
Mar 19 18:33:48 linux kernel: [<c0168708>] notify_change+0x1f8/0x240
Mar 19 18:33:48 linux kernel: [do_truncate+54/80] do_truncate+0x36/0x50
Mar 19 18:33:48 linux kernel: [<c0150436>] do_truncate+0x36/0x50
Mar 19 18:33:48 linux kernel: [sys_fstat64+35/48] sys_fstat64+0x23/0x30
Mar 19 18:33:48 linux kernel: [<c015a2d3>] sys_fstat64+0x23/0x30
Mar 19 18:33:48 linux kernel: [sys_ftruncate+225/384] sys_ftruncate+0xe1/0x180
Mar 19 18:33:48 linux kernel: [<c0150d71>] sys_ftruncate+0xe1/0x180
Mar 19 18:33:48 linux kernel: [generic_file_llseek+0/208]
generic_file_llseek+0x0/0xd0
Mar 19 18:33:48 linux kernel: [<c01514f0>] generic_file_llseek+0x0/0xd0
Mar 19 18:33:48 linux kernel: [sys_llseek+171/212] sys_llseek+0xab/0xd4
Mar 19 18:33:48 linux kernel: [<c015257b>] sys_llseek+0xab/0xd4
Mar 19 18:33:48 linux kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Mar 19 18:33:48 linux kernel: [<c0106e27>] syscall_call+0x7/0xb
Mar 19 18:33:48 linux kernel:
Mar 19 18:33:48 linux kernel: Code: 0f 0b 89 01 d7 f4 2f c0 3d 60 cb 33 c0 74 2d
85 c9 75 08 0f
Mar 19 18:33:48 linux kernel: <6>note: ld-linux.so.2[2899] exited with
preempt_count 1


the testcase is a:

rpm -i glibc-2.3.2-92.src.rpm
rpmbuild -ba /usr/src/packages/SPECS/glibc.spec

I will wait feedback after my WARN_ON addition, if my theory is right this new
tree will survive the load with just two WARN_ON triggers (one in vmtruncate
like the above, and the other new one in remove_from_page_cache).

Hugh, I didn't look into your nonlinear-more-graceful-swapout algorithm
yet, but my first prio at the moment is to get the nonlinaer in a I/O
bound manner, making it graceful is very low prio, especially if it
makes the code less obvious.

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.6/2.6.5-rc2-aa5.gz
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.6/2.6.5-rc2-aa5/

Binary files 2.6.5-rc2-aa4/anon-vma.gz and 2.6.5-rc2-aa5/anon-vma.gz differ

Add bugcheck for page->mapcount == 0 in alloc/free page paths
(just in case).

Convert BUG_ON to WARN_ON in page_remove_rmap and added a
WARN_ON(page->mapcount) in remove_from_page_cache, to try
to trap the case of pages being removed from pagecache while
they still had some mapping.

Fix ppc (patch was superflous apparently, so backed out, only
ppc64 required fixing).

Binary files 2.6.5-rc2-aa4/prio-tree.gz and 2.6.5-rc2-aa5/prio-tree.gz differ

Fixup nonlinear swapout from Rajesh Venkatasubramanian.

Fixed a bugcheck after a split_vma of a MAP_PRIVATE with some anon page
on it.


2004-03-29 20:45:54

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.5-rc2-aa5

Andrea Arcangeli <[email protected]> wrote:
>
> Notably there is a BUG_ON(page->mapping) triggering in
> page_remove_rmap in the pagecache case. that could be ex-pagecache being
> removed from pagecache before all ptes have been zapped, infact the
> page_remove_rmap triggers in the vmtruncate path.

Confused. vmtruncate zaps the ptes before removing pages from pagecache,
so I'd expect a non-null ->mapping in page_remove_rmap() is a very common
thing. truncate a file which someone has mmapped and it'll happen every
time, will it not?

2004-03-29 22:45:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.6.5-rc2-aa5

On Mon, Mar 29, 2004 at 12:48:03PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > Notably there is a BUG_ON(page->mapping) triggering in
> > page_remove_rmap in the pagecache case. that could be ex-pagecache being
> > removed from pagecache before all ptes have been zapped, infact the
> > page_remove_rmap triggers in the vmtruncate path.
>
> Confused. vmtruncate zaps the ptes before removing pages from pagecache,
> so I'd expect a non-null ->mapping in page_remove_rmap() is a very common

the bugcheck was for NULL ->mapping in page_remove_rmap:

BUG_ON(!page->mapping);

I tend to forget the ! in the pseudocode in emails sorry (today I did it
twice, luckily I didn't get it wrong in the actual patches ;).

> thing. truncate a file which someone has mmapped and it'll happen every
> time, will it not?

as you say vmtruncate zaps the pte _first_, so the page->mapcount should
be down to 0 by the time we set page->mapping = NULL.

the thing I was wondering about is the controlled race where some page
can go out of pagecache despite still being mapped somewhere, that could
happen in the past IIRC.

2004-03-30 16:11:08

by Andrea Arcangeli

[permalink] [raw]
Subject: mapped pages being truncated [was Re: 2.6.5-rc2-aa5]

On Tue, Mar 30, 2004 at 12:45:26AM +0200, Andrea Arcangeli wrote:
> On Mon, Mar 29, 2004 at 12:48:03PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > Notably there is a BUG_ON(page->mapping) triggering in
> > > page_remove_rmap in the pagecache case. that could be ex-pagecache being
> > > removed from pagecache before all ptes have been zapped, infact the
> > > page_remove_rmap triggers in the vmtruncate path.
> >
> > Confused. vmtruncate zaps the ptes before removing pages from pagecache,
> > so I'd expect a non-null ->mapping in page_remove_rmap() is a very common
>
> the bugcheck was for NULL ->mapping in page_remove_rmap:
>
> BUG_ON(!page->mapping);
>
> I tend to forget the ! in the pseudocode in emails sorry (today I did it
> twice, luckily I didn't get it wrong in the actual patches ;).
>
> > thing. truncate a file which someone has mmapped and it'll happen every
> > time, will it not?
>
> as you say vmtruncate zaps the pte _first_, so the page->mapcount should
> be down to 0 by the time we set page->mapping = NULL.
>
> the thing I was wondering about is the controlled race where some page
> can go out of pagecache despite still being mapped somewhere, that could
> happen in the past IIRC.

here we go, my new debugging WARN_ON in in __remove_from_page_cache
triggered just before the other one in page_remove_rmap, as I expected
it was truncate removing pages from pagecache before all mappings were
dropped:

Mar 30 01:27:56 linux -- MARK --
Mar 30 01:37:18 linux kernel: Badness in __remove_from_page_cache at mm/filemap.c:107
Mar 30 01:37:18 linux kernel: Call Trace:
Mar 30 01:37:18 linux kernel: [__remove_from_page_cache+138/160] __remove_from_page_cache+0x8a/0xa0
Mar 30 01:37:18 linux kernel: [<c013bc4a>] __remove_from_page_cache+0x8a/0xa0
Mar 30 01:37:18 linux kernel: [remove_from_page_cache+27/39] remove_from_page_cache+0x1b/0x27
Mar 30 01:37:18 linux kernel: [<c013bc7b>] remove_from_page_cache+0x1b/0x27
Mar 30 01:37:18 linux kernel: [truncate_complete_page+45/192] truncate_complete_page+0x2d/0xc0
Mar 30 01:37:18 linux kernel: [<c0141d6d>] truncate_complete_page+0x2d/0xc0
Mar 30 01:37:18 linux kernel: [truncate_inode_pages+163/768] truncate_inode_pages+0xa3/0x300
Mar 30 01:37:18 linux kernel: [<c0141ea3>] truncate_inode_pages+0xa3/0x300
Mar 30 01:37:18 linux kernel: [__crc_tty_wait_until_sent+6263840/8120074] xfs_bmap_last_offset+0xed/0x100 [xfs]
Mar 30 01:37:18 linux kernel: [<cfa31f5d>] xfs_bmap_last_offset+0xed/0x100 [xfs]
Mar 30 01:37:18 linux kernel: [__crc_tty_wait_until_sent+6459933/8120074] xfs_itruncate_start+0x6a/0x90 [xfs]
Mar 30 01:37:18 linux kernel: [<cfa61d5a>] xfs_itruncate_start+0x6a/0x90 [xfs]
Mar 30 01:37:18 linux kernel: [__crc_tty_wait_until_sent+6571075/8120074] xfs_setattr+0xc30/0xe40 [xfs]
Mar 30 01:37:18 linux kernel: [<cfa7cf80>] xfs_setattr+0xc30/0xe40 [xfs]
Mar 30 01:37:18 linux kernel: [__crc_tty_wait_until_sent+6601861/8120074] linvfs_setattr+0x112/0x1a0 [xfs]
Mar 30 01:37:18 linux kernel: [<cfa847c2>] linvfs_setattr+0x112/0x1a0 [xfs]
Mar 30 01:37:18 linux kernel: [notify_change+514/576] notify_change+0x202/0x240
Mar 30 01:37:18 linux kernel: [<c016b4d2>] notify_change+0x202/0x240
Mar 30 01:37:18 linux kernel: [do_truncate+78/128] do_truncate+0x4e/0x80
Mar 30 01:37:18 linux kernel: [<c01530de>] do_truncate+0x4e/0x80
Mar 30 01:37:18 linux kernel: [sys_ftruncate+225/384] sys_ftruncate+0xe1/0x180
Mar 30 01:37:18 linux kernel: [<c0153a31>] sys_ftruncate+0xe1/0x180
Mar 30 01:37:18 linux kernel: [generic_file_llseek+0/208] generic_file_llseek+0x0/0xd0
Mar 30 01:37:18 linux kernel: [<c01541d0>] generic_file_llseek+0x0/0xd0
Mar 30 01:37:18 linux kernel: [sys_llseek+171/212] sys_llseek+0xab/0xd4
Mar 30 01:37:18 linux kernel: [<c015525b>] sys_llseek+0xab/0xd4
Mar 30 01:37:18 linux kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Mar 30 01:37:18 linux kernel: [<c0107e27>] syscall_call+0x7/0xb
Mar 30 01:37:18 linux kernel:
Mar 30 01:37:18 linux kernel: Badness in page_remove_rmap at mm/objrmap.c:379
Mar 30 01:37:18 linux kernel: Call Trace:
Mar 30 01:37:18 linux kernel: [page_remove_rmap+137/152] page_remove_rmap+0x89/0x98
Mar 30 01:37:18 linux kernel: [<c014bdb9>] page_remove_rmap+0x89/0x98
Mar 30 01:37:18 linux kernel: [unmap_page_range+405/688] unmap_page_range+0x195/0x2b0
Mar 30 01:37:18 linux kernel: [<c0145f55>] unmap_page_range+0x195/0x2b0
Mar 30 01:37:18 linux kernel: [unmap_vmas+299/640] unmap_vmas+0x12b/0x280
Mar 30 01:37:18 linux kernel: [<c014619b>] unmap_vmas+0x12b/0x280
Mar 30 01:37:18 linux kernel: [zap_page_range+116/192] zap_page_range+0x74/0xc0
Mar 30 01:37:18 linux kernel: [<c0146364>] zap_page_range+0x74/0xc0
Mar 30 01:37:18 linux kernel: [invalidate_mmap_range_list+163/208] invalidate_mmap_range_list+0xa3/0xd0
Mar 30 01:37:18 linux kernel: [<c0146453>] invalidate_mmap_range_list+0xa3/0xd0
Mar 30 01:37:18 linux kernel: [invalidate_mmap_range+148/176] invalidate_mmap_range+0x94/0xb0
Mar 30 01:37:18 linux kernel: [<c0146514>] invalidate_mmap_range+0x94/0xb0
Mar 30 01:37:18 linux kernel: [vmtruncate+62/208] vmtruncate+0x3e/0xd0
Mar 30 01:37:18 linux kernel: [<c014656e>] vmtruncate+0x3e/0xd0
Mar 30 01:37:18 linux kernel: [__crc_tty_wait_until_sent+6601988/8120074] linvfs_setattr+0x191/0x1a0 [xfs]
Mar 30 01:37:18 linux kernel: [<cfa84841>] linvfs_setattr+0x191/0x1a0 [xfs]
Mar 30 01:37:18 linux kernel: [notify_change+514/576] notify_change+0x202/0x240
Mar 30 01:37:18 linux kernel: [<c016b4d2>] notify_change+0x202/0x240
Mar 30 01:37:18 linux kernel: [do_truncate+78/128] do_truncate+0x4e/0x80
Mar 30 01:37:18 linux kernel: [<c01530de>] do_truncate+0x4e/0x80
Mar 30 01:37:18 linux kernel: [sys_ftruncate+225/384] sys_ftruncate+0xe1/0x180
Mar 30 01:37:18 linux kernel: [<c0153a31>] sys_ftruncate+0xe1/0x180
Mar 30 01:37:18 linux kernel: [generic_file_llseek+0/208] generic_file_llseek+0x0/0xd0
Mar 30 01:37:18 linux kernel: [<c01541d0>] generic_file_llseek+0x0/0xd0
Mar 30 01:37:18 linux kernel: [sys_llseek+171/212] sys_llseek+0xab/0xd4
Mar 30 01:37:18 linux kernel: [<c015525b>] sys_llseek+0xab/0xd4
Mar 30 01:37:18 linux kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Mar 30 01:37:18 linux kernel: [<c0107e27>] syscall_call+0x7/0xb
Mar 30 01:37:18 linux kernel:

I didn't focused too much on the mremap race yet (that's the primary
reason why I asked Hugh to extract the fix, so I can focus on the fix
without being distracted by the handle_mm_fault and other stuff for
early-cow in mremap that anonmm requires), but are you sure mremap
will lead to removing pages from pagecache with mapping still intact?
(we're not talking about random ptes and random reference counts in
page->count here, we're talking about ptes being actively tracked by
objrmap and pages going away from pagecache! Infact in my tree mremap
never calls page_remove_rmap or page_add_rmap ever, there's no need of
doing so with anon-vma+objrmap in mremap since anon-vma+objrmap is
by design mremap-aware, I dropped both of them).

the funny thing is that it seems to be the same truncate doing
truncate_inode_pages first, and zap_page_range later. It would be better
if WARN_ON would show the pid of the task too, if they were two
different tasks that would be more realistic. Maybe an xfs screwup of
some sort? I could ask the tester to try again with ext2, but then if it
doesn't trigger anymore we still have to wonder about timings.

Anyways now the kernel is solid, it just bugs out those warnings so we
don't forget. I don't think it's a bug in my tree.

Comments welcome.

2004-03-30 18:01:48

by Hugh Dickins

[permalink] [raw]
Subject: Re: mapped pages being truncated [was Re: 2.6.5-rc2-aa5]

On Tue, 30 Mar 2004, Andrea Arcangeli wrote:
>
> the funny thing is that it seems to be the same truncate doing
> truncate_inode_pages first, and zap_page_range later. It would be better
> if WARN_ON would show the pid of the task too, if they were two
> different tasks that would be more realistic. Maybe an xfs screwup of
> some sort? I could ask the tester to try again with ext2, but then if it
> doesn't trigger anymore we still have to wonder about timings.
>
> Anyways now the kernel is solid, it just bugs out those warnings so we
> don't forget. I don't think it's a bug in my tree.

I do think it's something to worry about, it does seem peculiar.

Dunno why, but I never received the first mail in this thread,
neither directly nor via the list, but have now got it from MARC.

I doubt this is the cause of the problem (would not, I think,
cause all of the associated symptoms you describe), but I think it
is a bug in your code which could cause the WARN_ON(!page->mapping):

Imagine if the filesystem (or driver) nopage gave you the empty zero
page for a private writable mapping (it better not give it you for a
shared writable mapping!), perhaps to represent a hole in the file.

I think it will pass the various tests in your do_no_page, and if
it's a write_access, that will correctly copy the page and set_pte
for this private copy: but it doesn't update pageable (from 0 to 1)
for it, so skips the page_add_rmap; and eventually page_remove_rmap
will be passed this page with neither PageAnon nor page->mapping.

As I say, I doubt it's your case, but worth fixing:
force pageable on where you set anon in do_no_page.

Hugh

2004-03-30 18:20:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mapped pages being truncated [was Re: 2.6.5-rc2-aa5]

On Tue, Mar 30, 2004 at 07:01:44PM +0100, Hugh Dickins wrote:
> On Tue, 30 Mar 2004, Andrea Arcangeli wrote:
> >
> > the funny thing is that it seems to be the same truncate doing
> > truncate_inode_pages first, and zap_page_range later. It would be better
> > if WARN_ON would show the pid of the task too, if they were two
> > different tasks that would be more realistic. Maybe an xfs screwup of
> > some sort? I could ask the tester to try again with ext2, but then if it
> > doesn't trigger anymore we still have to wonder about timings.
> >
> > Anyways now the kernel is solid, it just bugs out those warnings so we
> > don't forget. I don't think it's a bug in my tree.
>
> I do think it's something to worry about, it does seem peculiar.
>
> Dunno why, but I never received the first mail in this thread,
> neither directly nor via the list, but have now got it from MARC.
>
> I doubt this is the cause of the problem (would not, I think,
> cause all of the associated symptoms you describe), but I think it
> is a bug in your code which could cause the WARN_ON(!page->mapping):

note that the very same bug triggers with objrmap only applied (before I
applied anon-vma and prio-tree on top of it), so at very least this is a
bug in Dave's code too. See the same BUG_ON triggering in rmap.c before
I replace it with objrmap.c in anon-vma. Almost certainly it will trigger with
your patches applied too and probably it happens with mainline 2.6 too
but nobody tested that yet.

> Imagine if the filesystem (or driver) nopage gave you the empty zero
> page for a private writable mapping (it better not give it you for a
> shared writable mapping!), perhaps to represent a hole in the file.

zero page is reserved, so page_add_rmap does nothing on it, zero page is
gauranteed to have page->mapcount == 0.

> I think it will pass the various tests in your do_no_page, and if

zero page will get intercepted in the page-reserved checks, so
page_add_rmap is a noop and the zero page has always page->mapcount == 0.

Furthermore I WARN_ON if anybody returns a page->mapping == NULL on a
non-reserved VMA, so it can't be the case (zeropage has page->mapping =
NULL).

> it's a write_access, that will correctly copy the page and set_pte
> for this private copy: but it doesn't update pageable (from 0 to 1)
> for it, so skips the page_add_rmap; and eventually page_remove_rmap
> will be passed this page with neither PageAnon nor page->mapping.

A COW on a zero page will call page_add_rmap on the copy just fine and
it will call page_remove_rmap on the old page as well, though the
latter will be a noop for a reserved page or for a page with
page->mapping == 0 like the zero page is.

> As I say, I doubt it's your case, but worth fixing:
> force pageable on where you set anon in do_no_page.

there's no bug to fix in my code as far as I can tell. You probably
overlooked the zeropage is reserved.

2004-03-30 18:28:39

by Andrew Morton

[permalink] [raw]
Subject: Re: mapped pages being truncated [was Re: 2.6.5-rc2-aa5]

Andrea Arcangeli <[email protected]> wrote:
>
> here we go, my new debugging WARN_ON in in __remove_from_page_cache
> triggered just before the other one in page_remove_rmap, as I expected
> it was truncate removing pages from pagecache before all mappings were
> dropped:

XFS is doing peculiar things - xfs_setattr calls truncate_inode_pages()
before running vmtruncate().

xfs_setattr
->xfs_itruncate_start
->VOP_TOSS_PAGES
->fs_tosspages
->truncate_inode_pages


2004-03-30 18:48:47

by Hugh Dickins

[permalink] [raw]
Subject: Re: mapped pages being truncated [was Re: 2.6.5-rc2-aa5]

On Tue, 30 Mar 2004, Andrea Arcangeli wrote:
>
> note that the very same bug triggers with objrmap only applied (before I
> applied anon-vma and prio-tree on top of it), so at very least this is a
> bug in Dave's code too. See the same BUG_ON triggering in rmap.c before
> I replace it with objrmap.c in anon-vma. Almost certainly it will trigger with
> your patches applied too and probably it happens with mainline 2.6 too
> but nobody tested that yet.

Do you have enough evidence that it's the very same bug?
I believe there were other loopholes in the original objrmap code,
we've both moved on from there (e.g. we both decided it's safer to
set and clear PageAnon inside the maplock), so I'm not concerned
about the original objrmap.

> > Imagine if the filesystem (or driver) nopage gave you the empty zero
> > page for a private writable mapping (it better not give it you for a
> > shared writable mapping!), perhaps to represent a hole in the file.
>
> zero page is reserved, so page_add_rmap does nothing on it, zero page is
> gauranteed to have page->mapcount == 0.

Yes, I'm not talking about page_add_rmap on the zero page, but
about how with pageable 0 you skip page_add_rmap of copy of zero page.

> > I think it will pass the various tests in your do_no_page, and if
>
> zero page will get intercepted in the page-reserved checks, so
> page_add_rmap is a noop and the zero page has always page->mapcount == 0.

Yes, repeat sentence above.

> Furthermore I WARN_ON if anybody returns a page->mapping == NULL on a
> non-reserved VMA, so it can't be the case (zeropage has page->mapping =
> NULL).

Perhaps that's in a different version than I'm looking at, 2.6.5-rc2-aa5.

> > it's a write_access, that will correctly copy the page and set_pte
> > for this private copy: but it doesn't update pageable (from 0 to 1)
> > for it, so skips the page_add_rmap; and eventually page_remove_rmap
> > will be passed this page with neither PageAnon nor page->mapping.
>
> A COW on a zero page will call page_add_rmap on the copy just fine and

Where does "pageable" get set to 1 in do_no_page's COW case?

> it will call page_remove_rmap on the old page as well, though the
> latter will be a noop for a reserved page or for a page with
> page->mapping == 0 like the zero page is.

It'll get a page which is !PageReserved !PageAnon !page->mapping.

> > As I say, I doubt it's your case, but worth fixing:
> > force pageable on where you set anon in do_no_page.
>
> there's no bug to fix in my code as far as I can tell. You probably
> overlooked the zeropage is reserved.

Er, probably not. I may be wrong, but you're not convincing me.
Go slower, read my mail and read your code, then explain where
"pageable" gets set to 1 if do_no_page is COWing the zero page.

Hugh

2004-03-30 18:53:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mapped pages being truncated [was Re: 2.6.5-rc2-aa5]

On Tue, Mar 30, 2004 at 10:28:34AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > here we go, my new debugging WARN_ON in in __remove_from_page_cache
> > triggered just before the other one in page_remove_rmap, as I expected
> > it was truncate removing pages from pagecache before all mappings were
> > dropped:
>
> XFS is doing peculiar things - xfs_setattr calls truncate_inode_pages()
> before running vmtruncate().
>
> xfs_setattr
> ->xfs_itruncate_start
> ->VOP_TOSS_PAGES
> ->fs_tosspages
> ->truncate_inode_pages

Ok, so objrmap needs my WARN_ON changes to survive the above. I believe
I can close the bug as fixed now (however I will leave the WARN_ON in
the code).

Still xfs seems pretty broken doing the above.

2004-03-30 19:02:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mapped pages being truncated [was Re: 2.6.5-rc2-aa5]

On Tue, Mar 30, 2004 at 07:48:42PM +0100, Hugh Dickins wrote:
> On Tue, 30 Mar 2004, Andrea Arcangeli wrote:
> >
> > note that the very same bug triggers with objrmap only applied (before I
> > applied anon-vma and prio-tree on top of it), so at very least this is a
> > bug in Dave's code too. See the same BUG_ON triggering in rmap.c before
> > I replace it with objrmap.c in anon-vma. Almost certainly it will trigger with
> > your patches applied too and probably it happens with mainline 2.6 too
> > but nobody tested that yet.
>
> Do you have enough evidence that it's the very same bug?

yes, see the two stack traces, they trigger in the same place and it's
the very same workload. Andrew just noticed that xfs indeed calls
truncate_inode_pages before vmtruncate. It will trigger with your
patches too.

> I believe there were other loopholes in the original objrmap code,
> we've both moved on from there (e.g. we both decided it's safer to
> set and clear PageAnon inside the maplock), so I'm not concerned
> about the original objrmap.

Ok I see what you mean, this should fix it, agreed?

--- x/mm/memory.c.~1~ 2004-03-29 19:24:39.000000000 +0200
+++ x/mm/memory.c 2004-03-30 20:57:35.889344056 +0200
@@ -1448,6 +1448,7 @@ retry:
lru_cache_add_active(page);
new_page = page;
anon = 1;
+ pageable = 1;
}

spin_lock(&mm->page_table_lock);


In practice doing a cow on a dma region is pretty useless, so I doubt
anybody would run into it but I agree the above is making the anon page
swappable and it's more correct.

2004-03-30 19:06:47

by Hugh Dickins

[permalink] [raw]
Subject: Re: mapped pages being truncated [was Re: 2.6.5-rc2-aa5]

On Tue, 30 Mar 2004, Andrea Arcangeli wrote:
> On Tue, Mar 30, 2004 at 07:48:42PM +0100, Hugh Dickins wrote:
> >
> > Do you have enough evidence that it's the very same bug?
>
> yes, see the two stack traces, they trigger in the same place and it's
> the very same workload. Andrew just noticed that xfs indeed calls
> truncate_inode_pages before vmtruncate. It will trigger with your
> patches too.

Yes, Andrew has got it (and I agree XFS is wrong to be doing that).

> Ok I see what you mean, this should fix it, agreed?

Yes, that's exactly the fix (for when COWing a reserved page).

Hugh

2004-03-30 19:12:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mapped pages being truncated [was Re: 2.6.5-rc2-aa5]

On Tue, Mar 30, 2004 at 08:06:42PM +0100, Hugh Dickins wrote:
> On Tue, 30 Mar 2004, Andrea Arcangeli wrote:
> > On Tue, Mar 30, 2004 at 07:48:42PM +0100, Hugh Dickins wrote:
> > >
> > > Do you have enough evidence that it's the very same bug?
> >
> > yes, see the two stack traces, they trigger in the same place and it's
> > the very same workload. Andrew just noticed that xfs indeed calls
> > truncate_inode_pages before vmtruncate. It will trigger with your
> > patches too.
>
> Yes, Andrew has got it (and I agree XFS is wrong to be doing that).

indeed.

> > Ok I see what you mean, this should fix it, agreed?
>
> Yes, that's exactly the fix (for when COWing a reserved page).

thanks, great spotting!

2004-03-30 20:13:41

by Nathan Scott

[permalink] [raw]
Subject: Re: mapped pages being truncated [was Re: 2.6.5-rc2-aa5]

On Tue, Mar 30, 2004 at 10:28:34AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > here we go, my new debugging WARN_ON in in __remove_from_page_cache
> > triggered just before the other one in page_remove_rmap, as I expected
> > it was truncate removing pages from pagecache before all mappings were
> > dropped:
>
> XFS is doing peculiar things - xfs_setattr calls truncate_inode_pages()
> before running vmtruncate().
>
> xfs_setattr
> ->xfs_itruncate_start
> ->VOP_TOSS_PAGES
> ->fs_tosspages
> ->truncate_inode_pages

Oh. Fix in progress...

thanks.

--
Nathan