Hi Hugh,
Your recent patch 9a1ea439b16b "mm: put_and_wait_on_page_locked() while
page is migrated" seems to have introduced a race into page migration
process. I have a host that eagerly reproduces the following BUG under
stress:
[ 302.847402] page:f000000000021700 count:0 mapcount:0 mapping:c0000000b2710bb0 index:0x19
[ 302.848096] xfs_address_space_operations [xfs]
[ 302.848100] name:"libc-2.28.so"
[ 302.848244] flags: 0x3ffff800000006(referenced|uptodate)
[ 302.848521] raw: 003ffff800000006 5deadbeef0000100 5deadbeef0000200 0000000000000000
[ 302.848724] raw: 0000000000000019 0000000000000000 00000001ffffffff c0000000bc0b1000
[ 302.848919] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[ 302.849076] page->mem_cgroup:c0000000bc0b1000
[ 302.849269] ------------[ cut here ]------------
[ 302.849397] kernel BUG at include/linux/mm.h:546!
[ 302.849586] Oops: Exception in kernel mode, sig: 5 [#1]
[ 302.849711] LE SMP NR_CPUS=2048 NUMA pSeries
[ 302.849839] Modules linked in: pseries_rng sunrpc xts vmx_crypto virtio_balloon xfs libcrc32c virtio_net net_failover virtio_console failover virtio_blk
[ 302.850400] CPU: 3 PID: 8759 Comm: cc1 Not tainted 5.0.0-rc4+ #36
[ 302.850571] NIP: c00000000039c8b8 LR: c00000000039c8b4 CTR: c00000000080a0e0
[ 302.850758] REGS: c0000000b0d7f7e0 TRAP: 0700 Not tainted (5.0.0-rc4+)
[ 302.850952] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 48024422 XER: 00000000
[ 302.851150] CFAR: c0000000003ff584 IRQMASK: 0
[ 302.851150] GPR00: c00000000039c8b4 c0000000b0d7fa70 c000000001bcca00 0000000000000021
[ 302.851150] GPR04: c0000000b044c628 0000000000000007 55555555555555a0 c000000001fc3760
[ 302.851150] GPR08: 0000000000000007 0000000000000000 c0000000b0d7c000 c0000000b0d7f5ff
[ 302.851150] GPR12: 0000000000004400 c00000003fffae80 0000000000000000 0000000000000000
[ 302.851150] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 302.851150] GPR20: c0000000689f5aa8 c00000002a13ee48 0000000000000000 c000000001da29b0
[ 302.851150] GPR24: c000000001bf7d80 c0000000689f5a00 0000000000000000 0000000000000000
[ 302.851150] GPR28: c000000001bf9e80 c0000000b0d7fab8 0000000000000001 f000000000021700
[ 302.852914] NIP [c00000000039c8b8] put_and_wait_on_page_locked+0x398/0x3d0
[ 302.853080] LR [c00000000039c8b4] put_and_wait_on_page_locked+0x394/0x3d0
[ 302.853235] Call Trace:
[ 302.853305] [c0000000b0d7fa70] [c00000000039c8b4] put_and_wait_on_page_locked+0x394/0x3d0 (unreliable)
[ 302.853540] [c0000000b0d7fb10] [c00000000047b838] __migration_entry_wait+0x178/0x250
[ 302.853738] [c0000000b0d7fb50] [c00000000040c928] do_swap_page+0xd78/0xf60
[ 302.853997] [c0000000b0d7fbd0] [c000000000411078] __handle_mm_fault+0xbf8/0xe80
[ 302.854187] [c0000000b0d7fcb0] [c000000000411548] handle_mm_fault+0x248/0x450
[ 302.854379] [c0000000b0d7fd00] [c000000000078ca4] __do_page_fault+0x2d4/0xdf0
[ 302.854877] [c0000000b0d7fde0] [c0000000000797f8] do_page_fault+0x38/0xf0
[ 302.855057] [c0000000b0d7fe20] [c00000000000a7c4] handle_page_fault+0x18/0x38
[ 302.855300] Instruction dump:
[ 302.855432] 4bfffcf0 60000000 3948ffff 4bfffd20 60000000 60000000 3c82ff36 7fe3fb78
[ 302.855689] fb210068 38843b78 48062f09 60000000 <0fe00000> 60000000 3b400001 3b600001
[ 302.855950] ---[ end trace a52140e0f9751ae0 ]---
What seems to be happening is migrate_page_move_mapping() calling
page_ref_freeze() on another cpu somewhere between __migration_entry_wait()
taking a reference and wait_on_page_bit_common() calling page_put().
--
Artem
On Mon, 4 Feb 2019, Artem Savkov wrote:
> Hi Hugh,
>
> Your recent patch 9a1ea439b16b "mm: put_and_wait_on_page_locked() while
> page is migrated" seems to have introduced a race into page migration
> process. I have a host that eagerly reproduces the following BUG under
> stress:
>
> [ 302.847402] page:f000000000021700 count:0 mapcount:0 mapping:c0000000b2710bb0 index:0x19
> [ 302.848096] xfs_address_space_operations [xfs]
> [ 302.848100] name:"libc-2.28.so"
> [ 302.848244] flags: 0x3ffff800000006(referenced|uptodate)
> [ 302.848521] raw: 003ffff800000006 5deadbeef0000100 5deadbeef0000200 0000000000000000
> [ 302.848724] raw: 0000000000000019 0000000000000000 00000001ffffffff c0000000bc0b1000
> [ 302.848919] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> [ 302.849076] page->mem_cgroup:c0000000bc0b1000
> [ 302.849269] ------------[ cut here ]------------
> [ 302.849397] kernel BUG at include/linux/mm.h:546!
> [ 302.849586] Oops: Exception in kernel mode, sig: 5 [#1]
> [ 302.849711] LE SMP NR_CPUS=2048 NUMA pSeries
> [ 302.849839] Modules linked in: pseries_rng sunrpc xts vmx_crypto virtio_balloon xfs libcrc32c virtio_net net_failover virtio_console failover virtio_blk
> [ 302.850400] CPU: 3 PID: 8759 Comm: cc1 Not tainted 5.0.0-rc4+ #36
> [ 302.850571] NIP: c00000000039c8b8 LR: c00000000039c8b4 CTR: c00000000080a0e0
> [ 302.850758] REGS: c0000000b0d7f7e0 TRAP: 0700 Not tainted (5.0.0-rc4+)
> [ 302.850952] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 48024422 XER: 00000000
> [ 302.851150] CFAR: c0000000003ff584 IRQMASK: 0
> [ 302.851150] GPR00: c00000000039c8b4 c0000000b0d7fa70 c000000001bcca00 0000000000000021
> [ 302.851150] GPR04: c0000000b044c628 0000000000000007 55555555555555a0 c000000001fc3760
> [ 302.851150] GPR08: 0000000000000007 0000000000000000 c0000000b0d7c000 c0000000b0d7f5ff
> [ 302.851150] GPR12: 0000000000004400 c00000003fffae80 0000000000000000 0000000000000000
> [ 302.851150] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 302.851150] GPR20: c0000000689f5aa8 c00000002a13ee48 0000000000000000 c000000001da29b0
> [ 302.851150] GPR24: c000000001bf7d80 c0000000689f5a00 0000000000000000 0000000000000000
> [ 302.851150] GPR28: c000000001bf9e80 c0000000b0d7fab8 0000000000000001 f000000000021700
> [ 302.852914] NIP [c00000000039c8b8] put_and_wait_on_page_locked+0x398/0x3d0
> [ 302.853080] LR [c00000000039c8b4] put_and_wait_on_page_locked+0x394/0x3d0
> [ 302.853235] Call Trace:
> [ 302.853305] [c0000000b0d7fa70] [c00000000039c8b4] put_and_wait_on_page_locked+0x394/0x3d0 (unreliable)
> [ 302.853540] [c0000000b0d7fb10] [c00000000047b838] __migration_entry_wait+0x178/0x250
> [ 302.853738] [c0000000b0d7fb50] [c00000000040c928] do_swap_page+0xd78/0xf60
> [ 302.853997] [c0000000b0d7fbd0] [c000000000411078] __handle_mm_fault+0xbf8/0xe80
> [ 302.854187] [c0000000b0d7fcb0] [c000000000411548] handle_mm_fault+0x248/0x450
> [ 302.854379] [c0000000b0d7fd00] [c000000000078ca4] __do_page_fault+0x2d4/0xdf0
> [ 302.854877] [c0000000b0d7fde0] [c0000000000797f8] do_page_fault+0x38/0xf0
> [ 302.855057] [c0000000b0d7fe20] [c00000000000a7c4] handle_page_fault+0x18/0x38
> [ 302.855300] Instruction dump:
> [ 302.855432] 4bfffcf0 60000000 3948ffff 4bfffd20 60000000 60000000 3c82ff36 7fe3fb78
> [ 302.855689] fb210068 38843b78 48062f09 60000000 <0fe00000> 60000000 3b400001 3b600001
> [ 302.855950] ---[ end trace a52140e0f9751ae0 ]---
>
> What seems to be happening is migrate_page_move_mapping() calling
> page_ref_freeze() on another cpu somewhere between __migration_entry_wait()
> taking a reference and wait_on_page_bit_common() calling page_put().
Thank you for reporting, Artem.
And see the mm thread https://marc.info/?l=linux-mm&m=154821775401218&w=2
That was on arm64, you are on power I think: both point towards xfs
(Cai could not reproduce it on ext4), but that should not be taken too
seriously - it could just be easier to reproduce on one than the other.
Your description in your last paragraph is what I imagined happening too.
And nothing wrong with that, except that the page_ref_freeze() should
have failed, but succeeded. We believe that something has done an
improper put_page(), on a libc-2.28.so page that's normally always
in use, and the put_and_wait_on_page_locked() commit has exposed that
by making its migration possible when it was almost impossible before
(Cai has reproduced it without the put_and_wait_on_page_locked commit).
I don't think any of us have made progress on this since the 25th.
I'll wrap up what I'm working on in the next hour or two, and switch
my attention to this. Even if put_and_wait_on_page_locked() happens to
be correct, and just makes a pre-existing bug much easier to hit, we
shall have to revert it from 5.0 if we cannot find the right answer
in the next week or so. Which would be sad: I'll try to rescue it,
but don't have great confidence that I'll be successful.
I'll be looking through the source, thinking around it, and trying
to find a surplus put_page(). I don't have any experiments in mind
to try at this stage.
Something I shall not be doing, is verifying the correctness of the
low-level get_page_unless_zero() versus page_ref_freeze() protocol
on arm64 and power - nobody has reported on x86, and I do wonder if
there's a barrier missing somewhere, that could manifest in this way -
but I'm unlikely to be the one to find that (and also think that any
weakness there should have shown up long before now).
Hugh
On 2/4/19 3:42 PM, Hugh Dickins wrote:
> On Mon, 4 Feb 2019, Artem Savkov wrote:
>
>> Hi Hugh,
>>
>> Your recent patch 9a1ea439b16b "mm: put_and_wait_on_page_locked() while
>> page is migrated" seems to have introduced a race into page migration
>> process. I have a host that eagerly reproduces the following BUG under
>> stress:
>>
>> [ 302.847402] page:f000000000021700 count:0 mapcount:0 mapping:c0000000b2710bb0 index:0x19
>> [ 302.848096] xfs_address_space_operations [xfs]
>> [ 302.848100] name:"libc-2.28.so"
>> [ 302.848244] flags: 0x3ffff800000006(referenced|uptodate)
>> [ 302.848521] raw: 003ffff800000006 5deadbeef0000100 5deadbeef0000200 0000000000000000
>> [ 302.848724] raw: 0000000000000019 0000000000000000 00000001ffffffff c0000000bc0b1000
>> [ 302.848919] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>> [ 302.849076] page->mem_cgroup:c0000000bc0b1000
>> [ 302.849269] ------------[ cut here ]------------
>> [ 302.849397] kernel BUG at include/linux/mm.h:546!
>> [ 302.849586] Oops: Exception in kernel mode, sig: 5 [#1]
>> [ 302.849711] LE SMP NR_CPUS=2048 NUMA pSeries
>> [ 302.849839] Modules linked in: pseries_rng sunrpc xts vmx_crypto virtio_balloon xfs libcrc32c virtio_net net_failover virtio_console failover virtio_blk
>> [ 302.850400] CPU: 3 PID: 8759 Comm: cc1 Not tainted 5.0.0-rc4+ #36
>> [ 302.850571] NIP: c00000000039c8b8 LR: c00000000039c8b4 CTR: c00000000080a0e0
>> [ 302.850758] REGS: c0000000b0d7f7e0 TRAP: 0700 Not tainted (5.0.0-rc4+)
>> [ 302.850952] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 48024422 XER: 00000000
>> [ 302.851150] CFAR: c0000000003ff584 IRQMASK: 0
>> [ 302.851150] GPR00: c00000000039c8b4 c0000000b0d7fa70 c000000001bcca00 0000000000000021
>> [ 302.851150] GPR04: c0000000b044c628 0000000000000007 55555555555555a0 c000000001fc3760
>> [ 302.851150] GPR08: 0000000000000007 0000000000000000 c0000000b0d7c000 c0000000b0d7f5ff
>> [ 302.851150] GPR12: 0000000000004400 c00000003fffae80 0000000000000000 0000000000000000
>> [ 302.851150] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 302.851150] GPR20: c0000000689f5aa8 c00000002a13ee48 0000000000000000 c000000001da29b0
>> [ 302.851150] GPR24: c000000001bf7d80 c0000000689f5a00 0000000000000000 0000000000000000
>> [ 302.851150] GPR28: c000000001bf9e80 c0000000b0d7fab8 0000000000000001 f000000000021700
>> [ 302.852914] NIP [c00000000039c8b8] put_and_wait_on_page_locked+0x398/0x3d0
>> [ 302.853080] LR [c00000000039c8b4] put_and_wait_on_page_locked+0x394/0x3d0
>> [ 302.853235] Call Trace:
>> [ 302.853305] [c0000000b0d7fa70] [c00000000039c8b4] put_and_wait_on_page_locked+0x394/0x3d0 (unreliable)
>> [ 302.853540] [c0000000b0d7fb10] [c00000000047b838] __migration_entry_wait+0x178/0x250
>> [ 302.853738] [c0000000b0d7fb50] [c00000000040c928] do_swap_page+0xd78/0xf60
>> [ 302.853997] [c0000000b0d7fbd0] [c000000000411078] __handle_mm_fault+0xbf8/0xe80
>> [ 302.854187] [c0000000b0d7fcb0] [c000000000411548] handle_mm_fault+0x248/0x450
>> [ 302.854379] [c0000000b0d7fd00] [c000000000078ca4] __do_page_fault+0x2d4/0xdf0
>> [ 302.854877] [c0000000b0d7fde0] [c0000000000797f8] do_page_fault+0x38/0xf0
>> [ 302.855057] [c0000000b0d7fe20] [c00000000000a7c4] handle_page_fault+0x18/0x38
>> [ 302.855300] Instruction dump:
>> [ 302.855432] 4bfffcf0 60000000 3948ffff 4bfffd20 60000000 60000000 3c82ff36 7fe3fb78
>> [ 302.855689] fb210068 38843b78 48062f09 60000000 <0fe00000> 60000000 3b400001 3b600001
>> [ 302.855950] ---[ end trace a52140e0f9751ae0 ]---
>>
>> What seems to be happening is migrate_page_move_mapping() calling
>> page_ref_freeze() on another cpu somewhere between __migration_entry_wait()
>> taking a reference and wait_on_page_bit_common() calling page_put().
>
> Thank you for reporting, Artem.
>
> And see the mm thread https://marc.info/?l=linux-mm&m=154821775401218&w=2
>
> That was on arm64, you are on power I think: both point towards xfs
> (Cai could not reproduce it on ext4), but that should not be taken too
> seriously - it could just be easier to reproduce on one than the other.
Agree, although I have never been able to trigger it for ext4 running LTP
migrate_pages03 exclusively overnight (500+ iterations) and spontaneously for a
few weeks now. It might just be lucky.
On Mon, Feb 4, 2019 at 8:43 PM Hugh Dickins <[email protected]> wrote:
>
> Something I shall not be doing, is verifying the correctness of the
> low-level get_page_unless_zero() versus page_ref_freeze() protocol
> on arm64 and power - nobody has reported on x86, and I do wonder if
> there's a barrier missing somewhere, that could manifest in this way -
> but I'm unlikely to be the one to find that (and also think that any
> weakness there should have shown up long before now).
Remind me what the page_ref_freeze() rules even _are_?
It's a very special thing, setting the page count down to zero if it
matches the "expected" count.
Now, if another CPU does a put_page() at that point, that certainly
will hit the "oops, we dropped the ref to something that was zero".
So the "expected" count had better be only references we have and own
100%, but some of those references aren't really necessarily private
to our thread.
For example, what happens if
(a) one CPU is doing migration_entry_wait() (counting expected page
refs etc, before doing page_ref_freeze)
(b) another CPU is dirtying a page that was in the swap cache and
takes a reference to it, but drops it from the swap cache
Note how (b) does not change the refcount on the page at all, because
it just moves the ref-count from "swap cache entry" to "I own the page
in my page tables". Which means that when (a) does the "count expected
count, and match it", it happily matches, and the page_ref_freeze()
succeeds and makes the page count be zero.
But now (b) has a private reference to that page, and can drop it, so
the "freeze" isn't a freeze at all.
Ok, so clearly the above cannot happen, and there's something I'm
missing with the freezing. I think we hold the page lock while this is
going on, which means those two things cannot happen at the same time.
But maybe there is something else that does the above kind of "move
page ref from one owner to another"?
The page_ref_freeze() rules don't seem to be documented anywhere.
Linus
On Mon, Feb 04, 2019 at 12:42:50PM -0800, Hugh Dickins wrote:
> On Mon, 4 Feb 2019, Artem Savkov wrote:
>
> > Hi Hugh,
> >
> > Your recent patch 9a1ea439b16b "mm: put_and_wait_on_page_locked() while
> > page is migrated" seems to have introduced a race into page migration
> > process. I have a host that eagerly reproduces the following BUG under
> > stress:
> >
> > [ 302.847402] page:f000000000021700 count:0 mapcount:0 mapping:c0000000b2710bb0 index:0x19
> > [ 302.848096] xfs_address_space_operations [xfs]
> > [ 302.848100] name:"libc-2.28.so"
> > [ 302.848244] flags: 0x3ffff800000006(referenced|uptodate)
> > [ 302.848521] raw: 003ffff800000006 5deadbeef0000100 5deadbeef0000200 0000000000000000
> > [ 302.848724] raw: 0000000000000019 0000000000000000 00000001ffffffff c0000000bc0b1000
> > [ 302.848919] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> > [ 302.849076] page->mem_cgroup:c0000000bc0b1000
> > [ 302.849269] ------------[ cut here ]------------
> > [ 302.849397] kernel BUG at include/linux/mm.h:546!
> > [ 302.849586] Oops: Exception in kernel mode, sig: 5 [#1]
> > [ 302.849711] LE SMP NR_CPUS=2048 NUMA pSeries
> > [ 302.849839] Modules linked in: pseries_rng sunrpc xts vmx_crypto virtio_balloon xfs libcrc32c virtio_net net_failover virtio_console failover virtio_blk
> > [ 302.850400] CPU: 3 PID: 8759 Comm: cc1 Not tainted 5.0.0-rc4+ #36
> > [ 302.850571] NIP: c00000000039c8b8 LR: c00000000039c8b4 CTR: c00000000080a0e0
> > [ 302.850758] REGS: c0000000b0d7f7e0 TRAP: 0700 Not tainted (5.0.0-rc4+)
> > [ 302.850952] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 48024422 XER: 00000000
> > [ 302.851150] CFAR: c0000000003ff584 IRQMASK: 0
> > [ 302.851150] GPR00: c00000000039c8b4 c0000000b0d7fa70 c000000001bcca00 0000000000000021
> > [ 302.851150] GPR04: c0000000b044c628 0000000000000007 55555555555555a0 c000000001fc3760
> > [ 302.851150] GPR08: 0000000000000007 0000000000000000 c0000000b0d7c000 c0000000b0d7f5ff
> > [ 302.851150] GPR12: 0000000000004400 c00000003fffae80 0000000000000000 0000000000000000
> > [ 302.851150] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > [ 302.851150] GPR20: c0000000689f5aa8 c00000002a13ee48 0000000000000000 c000000001da29b0
> > [ 302.851150] GPR24: c000000001bf7d80 c0000000689f5a00 0000000000000000 0000000000000000
> > [ 302.851150] GPR28: c000000001bf9e80 c0000000b0d7fab8 0000000000000001 f000000000021700
> > [ 302.852914] NIP [c00000000039c8b8] put_and_wait_on_page_locked+0x398/0x3d0
> > [ 302.853080] LR [c00000000039c8b4] put_and_wait_on_page_locked+0x394/0x3d0
> > [ 302.853235] Call Trace:
> > [ 302.853305] [c0000000b0d7fa70] [c00000000039c8b4] put_and_wait_on_page_locked+0x394/0x3d0 (unreliable)
> > [ 302.853540] [c0000000b0d7fb10] [c00000000047b838] __migration_entry_wait+0x178/0x250
> > [ 302.853738] [c0000000b0d7fb50] [c00000000040c928] do_swap_page+0xd78/0xf60
> > [ 302.853997] [c0000000b0d7fbd0] [c000000000411078] __handle_mm_fault+0xbf8/0xe80
> > [ 302.854187] [c0000000b0d7fcb0] [c000000000411548] handle_mm_fault+0x248/0x450
> > [ 302.854379] [c0000000b0d7fd00] [c000000000078ca4] __do_page_fault+0x2d4/0xdf0
> > [ 302.854877] [c0000000b0d7fde0] [c0000000000797f8] do_page_fault+0x38/0xf0
> > [ 302.855057] [c0000000b0d7fe20] [c00000000000a7c4] handle_page_fault+0x18/0x38
> > [ 302.855300] Instruction dump:
> > [ 302.855432] 4bfffcf0 60000000 3948ffff 4bfffd20 60000000 60000000 3c82ff36 7fe3fb78
> > [ 302.855689] fb210068 38843b78 48062f09 60000000 <0fe00000> 60000000 3b400001 3b600001
> > [ 302.855950] ---[ end trace a52140e0f9751ae0 ]---
> >
> > What seems to be happening is migrate_page_move_mapping() calling
> > page_ref_freeze() on another cpu somewhere between __migration_entry_wait()
> > taking a reference and wait_on_page_bit_common() calling page_put().
>
> Thank you for reporting, Artem.
>
> And see the mm thread https://marc.info/?l=linux-mm&m=154821775401218&w=2
Ah, thank you. Should have searched through linux-mm, not just lkml.
> That was on arm64, you are on power I think: both point towards xfs
> (Cai could not reproduce it on ext4), but that should not be taken too
> seriously - it could just be easier to reproduce on one than the other.
>
> Your description in your last paragraph is what I imagined happening too.
> And nothing wrong with that, except that the page_ref_freeze() should
> have failed, but succeeded. We believe that something has done an
> improper put_page(), on a libc-2.28.so page that's normally always
> in use, and the put_and_wait_on_page_locked() commit has exposed that
> by making its migration possible when it was almost impossible before
> (Cai has reproduced it without the put_and_wait_on_page_locked commit).
This is what I saw as well, only reproduces on xfs and page_ref_count == 0
BUG through generic_file_buffered_read() when your patch is reverted.
Wasn't sure that's the same issue though.
> I don't think any of us have made progress on this since the 25th.
> I'll wrap up what I'm working on in the next hour or two, and switch
> my attention to this. Even if put_and_wait_on_page_locked() happens to
> be correct, and just makes a pre-existing bug much easier to hit, we
> shall have to revert it from 5.0 if we cannot find the right answer
> in the next week or so. Which would be sad: I'll try to rescue it,
> but don't have great confidence that I'll be successful.
>
> I'll be looking through the source, thinking around it, and trying
> to find a surplus put_page(). I don't have any experiments in mind
> to try at this stage.
>
> Something I shall not be doing, is verifying the correctness of the
> low-level get_page_unless_zero() versus page_ref_freeze() protocol
> on arm64 and power - nobody has reported on x86, and I do wonder if
> there's a barrier missing somewhere, that could manifest in this way -
> but I'm unlikely to be the one to find that (and also think that any
> weakness there should have shown up long before now).
I tried reproducing it with 5.0-rc5 and failed. There is one patch that
seems to be fixing an xfs page reference issue which to me sounds a lot
like what you describe. The patch is 8e47a457321c "iomap: get/put the
page in iomap_page_create/release()". That would explain why
page_ref_freeze() and all the expected_page_refs() checks succeed when
they shouldn't.
Apart from no longer reproducing the bug I also see a drastic reduce in
pgmigrate_fails in /proc/vmstat (from tens of thousands and
being >pgmigrate_success, to just tens) so I assume it is possible for it
to be just masking the problem by performing less retries. What do you think?
Cai, can you please check if you can reproduce this issue in your
environment with 5.0-rc5?
--
Artem
On Tue, 5 Feb 2019, Artem Savkov wrote:
> On Mon, Feb 04, 2019 at 12:42:50PM -0800, Hugh Dickins wrote:
> > On Mon, 4 Feb 2019, Artem Savkov wrote:
> >
> > > Hi Hugh,
> > >
> > > Your recent patch 9a1ea439b16b "mm: put_and_wait_on_page_locked() while
> > > page is migrated" seems to have introduced a race into page migration
> > > process. I have a host that eagerly reproduces the following BUG under
> > > stress:
> > >
> > > [ 302.847402] page:f000000000021700 count:0 mapcount:0 mapping:c0000000b2710bb0 index:0x19
> > > [ 302.848096] xfs_address_space_operations [xfs]
> > > [ 302.848100] name:"libc-2.28.so"
> > > [ 302.848244] flags: 0x3ffff800000006(referenced|uptodate)
> > > [ 302.848521] raw: 003ffff800000006 5deadbeef0000100 5deadbeef0000200 0000000000000000
> > > [ 302.848724] raw: 0000000000000019 0000000000000000 00000001ffffffff c0000000bc0b1000
> > > [ 302.848919] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> > > [ 302.849076] page->mem_cgroup:c0000000bc0b1000
> > > [ 302.849269] ------------[ cut here ]------------
> > > [ 302.849397] kernel BUG at include/linux/mm.h:546!
> > > [ 302.849586] Oops: Exception in kernel mode, sig: 5 [#1]
> > > [ 302.849711] LE SMP NR_CPUS=2048 NUMA pSeries
> > > [ 302.849839] Modules linked in: pseries_rng sunrpc xts vmx_crypto virtio_balloon xfs libcrc32c virtio_net net_failover virtio_console failover virtio_blk
> > > [ 302.850400] CPU: 3 PID: 8759 Comm: cc1 Not tainted 5.0.0-rc4+ #36
> > > [ 302.850571] NIP: c00000000039c8b8 LR: c00000000039c8b4 CTR: c00000000080a0e0
> > > [ 302.850758] REGS: c0000000b0d7f7e0 TRAP: 0700 Not tainted (5.0.0-rc4+)
> > > [ 302.850952] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 48024422 XER: 00000000
> > > [ 302.851150] CFAR: c0000000003ff584 IRQMASK: 0
> > > [ 302.851150] GPR00: c00000000039c8b4 c0000000b0d7fa70 c000000001bcca00 0000000000000021
> > > [ 302.851150] GPR04: c0000000b044c628 0000000000000007 55555555555555a0 c000000001fc3760
> > > [ 302.851150] GPR08: 0000000000000007 0000000000000000 c0000000b0d7c000 c0000000b0d7f5ff
> > > [ 302.851150] GPR12: 0000000000004400 c00000003fffae80 0000000000000000 0000000000000000
> > > [ 302.851150] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > > [ 302.851150] GPR20: c0000000689f5aa8 c00000002a13ee48 0000000000000000 c000000001da29b0
> > > [ 302.851150] GPR24: c000000001bf7d80 c0000000689f5a00 0000000000000000 0000000000000000
> > > [ 302.851150] GPR28: c000000001bf9e80 c0000000b0d7fab8 0000000000000001 f000000000021700
> > > [ 302.852914] NIP [c00000000039c8b8] put_and_wait_on_page_locked+0x398/0x3d0
> > > [ 302.853080] LR [c00000000039c8b4] put_and_wait_on_page_locked+0x394/0x3d0
> > > [ 302.853235] Call Trace:
> > > [ 302.853305] [c0000000b0d7fa70] [c00000000039c8b4] put_and_wait_on_page_locked+0x394/0x3d0 (unreliable)
> > > [ 302.853540] [c0000000b0d7fb10] [c00000000047b838] __migration_entry_wait+0x178/0x250
> > > [ 302.853738] [c0000000b0d7fb50] [c00000000040c928] do_swap_page+0xd78/0xf60
> > > [ 302.853997] [c0000000b0d7fbd0] [c000000000411078] __handle_mm_fault+0xbf8/0xe80
> > > [ 302.854187] [c0000000b0d7fcb0] [c000000000411548] handle_mm_fault+0x248/0x450
> > > [ 302.854379] [c0000000b0d7fd00] [c000000000078ca4] __do_page_fault+0x2d4/0xdf0
> > > [ 302.854877] [c0000000b0d7fde0] [c0000000000797f8] do_page_fault+0x38/0xf0
> > > [ 302.855057] [c0000000b0d7fe20] [c00000000000a7c4] handle_page_fault+0x18/0x38
> > > [ 302.855300] Instruction dump:
> > > [ 302.855432] 4bfffcf0 60000000 3948ffff 4bfffd20 60000000 60000000 3c82ff36 7fe3fb78
> > > [ 302.855689] fb210068 38843b78 48062f09 60000000 <0fe00000> 60000000 3b400001 3b600001
> > > [ 302.855950] ---[ end trace a52140e0f9751ae0 ]---
> > >
> > > What seems to be happening is migrate_page_move_mapping() calling
> > > page_ref_freeze() on another cpu somewhere between __migration_entry_wait()
> > > taking a reference and wait_on_page_bit_common() calling page_put().
> >
> > Thank you for reporting, Artem.
> >
> > And see the mm thread https://marc.info/?l=linux-mm&m=154821775401218&w=2
>
> Ah, thank you. Should have searched through linux-mm, not just lkml.
>
> > That was on arm64, you are on power I think: both point towards xfs
> > (Cai could not reproduce it on ext4), but that should not be taken too
> > seriously - it could just be easier to reproduce on one than the other.
> >
> > Your description in your last paragraph is what I imagined happening too.
> > And nothing wrong with that, except that the page_ref_freeze() should
> > have failed, but succeeded. We believe that something has done an
> > improper put_page(), on a libc-2.28.so page that's normally always
> > in use, and the put_and_wait_on_page_locked() commit has exposed that
> > by making its migration possible when it was almost impossible before
> > (Cai has reproduced it without the put_and_wait_on_page_locked commit).
>
> This is what I saw as well, only reproduces on xfs and page_ref_count == 0
> BUG through generic_file_buffered_read() when your patch is reverted.
> Wasn't sure that's the same issue though.
>
> > I don't think any of us have made progress on this since the 25th.
> > I'll wrap up what I'm working on in the next hour or two, and switch
> > my attention to this. Even if put_and_wait_on_page_locked() happens to
> > be correct, and just makes a pre-existing bug much easier to hit, we
> > shall have to revert it from 5.0 if we cannot find the right answer
> > in the next week or so. Which would be sad: I'll try to rescue it,
> > but don't have great confidence that I'll be successful.
> >
> > I'll be looking through the source, thinking around it, and trying
> > to find a surplus put_page(). I don't have any experiments in mind
> > to try at this stage.
> >
> > Something I shall not be doing, is verifying the correctness of the
> > low-level get_page_unless_zero() versus page_ref_freeze() protocol
> > on arm64 and power - nobody has reported on x86, and I do wonder if
> > there's a barrier missing somewhere, that could manifest in this way -
> > but I'm unlikely to be the one to find that (and also think that any
> > weakness there should have shown up long before now).
>
> I tried reproducing it with 5.0-rc5 and failed. There is one patch that
> seems to be fixing an xfs page reference issue which to me sounds a lot
> like what you describe. The patch is 8e47a457321c "iomap: get/put the
> page in iomap_page_create/release()". That would explain why
> page_ref_freeze() and all the expected_page_refs() checks succeed when
> they shouldn't.
Yes! I'm sure you've got it, that fits exactly, thank you so much Artem.
iomap_migrate_page() was very much on my search path yesterday, but I
was looking at latest source, had missed checking it for recent fixes.
>
> Apart from no longer reproducing the bug I also see a drastic reduce in
> pgmigrate_fails in /proc/vmstat (from tens of thousands and
> being >pgmigrate_success, to just tens) so I assume it is possible for it
> to be just masking the problem by performing less retries. What do you think?
I'm a little surprised that it managed to get as far as so many fails
without crashing, when it had the refcounting wrong like that: but I'm
sure you've found the root fix, not something masking the bug.
>
> Cai, can you please check if you can reproduce this issue in your
> environment with 5.0-rc5?
Yes, please do - practical confirmation more convincing than my certainty.
(Linus, I'll reply a little to yours later.)
Hugh
>> Cai, can you please check if you can reproduce this issue in your
>> environment with 5.0-rc5?
>
> Yes, please do - practical confirmation more convincing than my certainty.
Indeed, I am no longer be able to reproduce this anymore.
On Tue, 5 Feb 2019, Qian Cai wrote:
>
> >> Cai, can you please check if you can reproduce this issue in your
> >> environment with 5.0-rc5?
> >
> > Yes, please do - practical confirmation more convincing than my certainty.
>
> Indeed, I am no longer be able to reproduce this anymore.
Great, thanks. I'll add a message to the other thread to wrap that up too.
Hugh
On Tue, 5 Feb 2019, Linus Torvalds wrote:
> On Mon, Feb 4, 2019 at 8:43 PM Hugh Dickins <[email protected]> wrote:
> >
> > Something I shall not be doing, is verifying the correctness of the
> > low-level get_page_unless_zero() versus page_ref_freeze() protocol
> > on arm64 and power - nobody has reported on x86, and I do wonder if
> > there's a barrier missing somewhere, that could manifest in this way -
> > but I'm unlikely to be the one to find that (and also think that any
> > weakness there should have shown up long before now).
>
> Remind me what the page_ref_freeze() rules even _are_?
>
> It's a very special thing, setting the page count down to zero if it
> matches the "expected" count.
>
> Now, if another CPU does a put_page() at that point, that certainly
> will hit the "oops, we dropped the ref to something that was zero".
>
> So the "expected" count had better be only references we have and own
> 100%, but some of those references aren't really necessarily private
> to our thread.
>
> For example, what happens if
>
> (a) one CPU is doing migration_entry_wait() (counting expected page
> refs etc, before doing page_ref_freeze)
s/migration_entry_wait/migrate_page_move_mapping/
>
> (b) another CPU is dirtying a page that was in the swap cache and
> takes a reference to it, but drops it from the swap cache
This is reuse_swap_page() called from do_wp_page(), I presume.
>
> Note how (b) does not change the refcount on the page at all, because
> it just moves the ref-count from "swap cache entry" to "I own the page
> in my page tables". Which means that when (a) does the "count expected
> count, and match it", it happily matches, and the page_ref_freeze()
> succeeds and makes the page count be zero.
>
> But now (b) has a private reference to that page, and can drop it, so
> the "freeze" isn't a freeze at all.
>
> Ok, so clearly the above cannot happen, and there's something I'm
> missing with the freezing. I think we hold the page lock while this is
> going on, which means those two things cannot happen at the same time.
> But maybe there is something else that does the above kind of "move
> page ref from one owner to another"?
You're right that the page lock prevents even getting there (and is
essential whenever mucking around with PageSwapCache), but more to
the point is that the expected_count passed to page_ref_freeze()
does not include any user mapping references (mapcount).
All user mappings (known of at that instant) have been unmapped before
migrate_page_move_mapping() is called, and if any got added since
(difficult without page lock, but I wouldn't assume impossible),
their associated page references are sure to make the page_ref_freeze()
fail (so long as the page refcounting has not been broken).
reuse_swap_page() is called while holding the page table lock: so
although do_wp_page() cannot quite be said to own the page, it is
making sure that it cannot be racily unmapped at that point. So
until the pte_unmap_unlock() (by which time it has done its own
get_page()) it can treat the page reference associated with the
user mapping as safe, as if it were its own. And no racing
page_ref_freeze() could succeed while it's there, page lock or not.
Page lock is more important at the "outer" level of the page
migration protocol: holding together the "atomic" switch from old
to new page with the copying of data and flags from old to new.
And more important with anon (swapless) pages, for which there's no
shared visible cache, so migrate_page_move_mapping() does not even
bother with a page_ref_freeze() (though sometimes I want it to).
>
> The page_ref_freeze() rules don't seem to be documented anywhere.
I would not enjoy documenting what has to be done at what stage
in the page migration sequence: it has evolved, it is subtle,
and we're grateful just to have working code.
At the inner level (where I worried we might have some barrier problem),
the relation between page_ref_freeze() and get_page_unless_zero():
the best documentation I can think of on page_ref_freeze() indeed does
not mention it as such at all (I think it has gone through renamings):
the big comment Nick wrote above page_cache_get_speculative() in
include/linux/pagemap.h. And there's a helpful little comment in
include/linux/mm.h get_page() too.
(Ah, those innocent days, when the word "speculative" filled us
with delight and hope, instead of horror and dread.)
By the way, I won't put this bug out of my mind, until I've done
an audit of PagePrivate: I've a feeling that fs/iomap.c is not the
first place to forget that PagePrivate must be associated with a
page reference (which is not a special demand of page migration:
page reclaim won't work without it either); and put_and_wait_blah
may now expose such incorrectness as crashes.
Hugh