2015-12-12 16:24:15

by Chris Mason

[permalink] [raw]
Subject: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

We have two reports of frequent crashes in btrfs where asserts in
clear_page_dirty_for_io() were triggering on missing page locks.

The crashes were much easier to trigger when processes were catching
ctrl-c's, and after much debugging it really looked like lock_page was a
noop.

This recent commit looks pretty suspect to me, and I confirmed that we
were exiting __wait_on_bit_lock() with -EINTR when it was called with
TASK_UNINTERRUPTIBLE

commit 68985633bccb6066bf1803e316fbc6c1f5b796d6
Author: Peter Zijlstra <[email protected]>
Date: Tue Dec 1 14:04:04 2015 +0100

sched/wait: Fix signal handling in bit wait helpers

The patch below is mostly untested, and probably not the right solution.
Dave's trinity run doesn't explode immediately anymore, and I wanted to
get this out for discussion. A quick look on the list doesn't show
anyone else has tracked this down, sorry if it's a dup.

Reported-by: Dave Jones <[email protected]>,
Reported-by: Jon Christopherson <[email protected]>
Signed-off-by: Chris Mason <[email protected]>

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index f10bd87..12f69df 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -434,6 +434,8 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
ret = action(&q->key);
if (!ret)
continue;
+ if (ret == -EINTR && mode == TASK_UNINTERRUPTIBLE)
+ continue;
abort_exclusive_wait(wq, &q->wait, mode, &q->key);
return ret;
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));


2015-12-12 18:33:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

On Sat, Dec 12, 2015 at 8:23 AM, Chris Mason <[email protected]> wrote:
> We have two reports of frequent crashes in btrfs where asserts in
> clear_page_dirty_for_io() were triggering on missing page locks.
>
> The crashes were much easier to trigger when processes were catching
> ctrl-c's, and after much debugging it really looked like lock_page was a
> noop.

Hmm. PeterZ has another patch for wait_ob_bit signal handling pending,
but I *think* that one only affects the "killable()" case.

Peter, did that patch also handle just plain "lock_page()" case?

Linus

2015-12-12 19:41:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

On Sat, Dec 12, 2015 at 10:33 AM, Linus Torvalds
<[email protected]> wrote:
>
> Peter, did that patch also handle just plain "lock_page()" case?

Looking more at it, I think this all goes back to commit 743162013d40
("sched: Remove proliferation of wait_on_bit() action functions").

Before that, we had wait_on_page_bit() doing:

__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
TASK_UNINTERRUPTIBLE);

and after that, the "sleep_on_page" got changed to "bit_wait_io".

But that is bogus, because sleep_on_page() used to look like this:

static int sleep_on_page(void *word)
{
io_schedule();
return 0;
}

while bit_wait_io() looks like this:

__sched int bit_wait_io(void *word)
{
if (signal_pending_state(current->state, current))
return 1;
io_schedule();
return 0;
}

which is ok, because as long as the task state is
TASK_UNINTERRUPTIBLE, the whole signal_pending_state() thing turns
into a no-op.

So far, so fine.

However, then commit 68985633bccb ("sched/wait: Fix signal handling in
bit wait helpers") _really_ screwed up, and changed the function to

__sched int bit_wait(struct wait_bit_key *word)
{
schedule();
if (signal_pending(current))
return -EINTR;
return 0;
}

so now it returns an error when no error should happen. Which in turn
makes __wait_on_bit() exit the bit-wait loop early.

It looks like PeterZ's pending patch should fix this, by passing in
the proper TASK_UNINTERRUPTIBLE to the bit_wait_io function, and going
back to signal_pending_state(). PeterZ, did I follow the history of
this correctly?

Linus

2015-12-13 00:08:31

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

On Sat, Dec 12, 2015 at 11:41:26AM -0800, Linus Torvalds wrote:
> On Sat, Dec 12, 2015 at 10:33 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Peter, did that patch also handle just plain "lock_page()" case?
>
> Looking more at it, I think this all goes back to commit 743162013d40
> ("sched: Remove proliferation of wait_on_bit() action functions").
>
> It looks like PeterZ's pending patch should fix this, by passing in
> the proper TASK_UNINTERRUPTIBLE to the bit_wait_io function, and going
> back to signal_pending_state(). PeterZ, did I follow the history of
> this correctly?

Looks right to me, I found Peter's patch and have it running now. After
about 6 hours my patch did eventually crash again under trinity. Btrfs has a
very old (from 2011) bug in the error handling path that trinity is
banging on.

Doing another run with Peter's patch and btrfs fixed up. The btrfs patch is
small, but not urgent enough to shove in on Sunday. I'll send for rc6
along with a few others we've queued up.

-chris

2015-12-13 09:50:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

On Sat, Dec 12, 2015 at 11:41:26AM -0800, Linus Torvalds wrote:
> It looks like PeterZ's pending patch should fix this, by passing in
> the proper TASK_UNINTERRUPTIBLE to the bit_wait_io function, and going
> back to signal_pending_state(). PeterZ, did I follow the history of
> this correctly?

Yes. I made a right mess of things with my 'fix'.

2015-12-13 15:56:30

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

On Sun, Dec 13, 2015 at 10:50:01AM +0100, Peter Zijlstra wrote:
> On Sat, Dec 12, 2015 at 11:41:26AM -0800, Linus Torvalds wrote:
> > It looks like PeterZ's pending patch should fix this, by passing in
> > the proper TASK_UNINTERRUPTIBLE to the bit_wait_io function, and going
> > back to signal_pending_state(). PeterZ, did I follow the history of
> > this correctly?
>
> Yes. I made a right mess of things with my 'fix'.

Our two fixes together made it overnight. So at least for the
lock_page() case things are extra super fixed.

-chris

2015-12-13 20:51:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

On Sun, Dec 13, 2015 at 1:50 AM, Peter Zijlstra <[email protected]> wrote:
>
> Yes. I made a right mess of things with my 'fix'.

Can I get the latest version of your fix as a patch (or pull request,
I'm not picky)? I'd like to get this fixed for 4.4-rc5, which is
supposed to go out today..

Linus

2015-12-13 21:12:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

On Sun, Dec 13, 2015 at 12:51:08PM -0800, Linus Torvalds wrote:
> On Sun, Dec 13, 2015 at 1:50 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > Yes. I made a right mess of things with my 'fix'.
>
> Can I get the latest version of your fix as a patch (or pull request,
> I'm not picky)? I'd like to get this fixed for 4.4-rc5, which is
> supposed to go out today..

On its way (or there already, since I send it before this one). I did an
allmodconfig build with it on your latest just in case.

2015-12-14 18:34:36

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

On Sat, Dec 12, 2015 at 07:07:46PM -0500, Chris Mason wrote:
> On Sat, Dec 12, 2015 at 11:41:26AM -0800, Linus Torvalds wrote:
> > On Sat, Dec 12, 2015 at 10:33 AM, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > Peter, did that patch also handle just plain "lock_page()" case?
> >
> > Looking more at it, I think this all goes back to commit 743162013d40
> > ("sched: Remove proliferation of wait_on_bit() action functions").
> >
> > It looks like PeterZ's pending patch should fix this, by passing in
> > the proper TASK_UNINTERRUPTIBLE to the bit_wait_io function, and going
> > back to signal_pending_state(). PeterZ, did I follow the history of
> > this correctly?
>
> Looks right to me, I found Peter's patch and have it running now. After
> about 6 hours my patch did eventually crash again under trinity. Btrfs has a
> very old (from 2011) bug in the error handling path that trinity is
> banging on.

Is the other bug this one ? I've hit this quite a lot over the last 12 months,
and now that the lock_page bug is fixed this is showing up again.

page:ffffea00110d2700 count:4 mapcount:0 mapping:ffff88045b5160a0 index:0x0
flags: 0x8000000000000806(error|referenced|private)
page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
------------[ cut here ]------------
kernel BUG at mm/filemap.c:812!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
CPU: 1 PID: 22062 Comm: trinity-c4 Tainted: G W 4.4.0-rc5-think+ #1
task: ffff8800bce51b80 ti: ffff8803f7210000 task.ti: ffff8803f7210000
RIP: 0010:[<ffffffffad25e3c7>] [<ffffffffad25e3c7>] unlock_page+0xa7/0xb0
RSP: 0018:ffff8803f72176b8 EFLAGS: 00010292
RAX: fffff9400221a4e0 RBX: 0000000000001000 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: ffffffffad144aee RDI: ffffea00110d2700
RBP: ffff8803f72176d8 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffffea00110d2700
R13: 0000000000000000 R14: ffffea00110d2700 R15: 0000000000000000
FS: 00007f82b1f73700(0000) GS:ffff880468a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000389120 CR3: 000000040bc54000 CR4: 00000000001406e0
Stack:
0000000000001000 0000000000000000 0000000000000000 ffffea00110d2700
ffff8803f7217898 ffffffffc010b777 0000000000000000 ffff880302400040
0000000000000000 ffff8803f7217728 ffff8803f7217728 ffff8803f7217760
Call Trace:
[<ffffffffc010b777>] __do_readpage+0xa97/0xd80 [btrfs]
[<ffffffffad13351f>] ? mark_lock+0x6f/0x8a0
[<ffffffffadcea0ac>] ? _raw_spin_unlock_irq+0x2c/0x50
[<ffffffffc00db310>] ? btrfs_real_readdir+0x8d0/0x8d0 [btrfs]
[<ffffffffc010ace0>] ? extent_write_cache_pages.isra.37.constprop.54+0x540/0x540 [btrfs]
[<ffffffffad153fea>] ? rcu_read_lock_sched_held+0x8a/0xa0
[<ffffffffad25e9c4>] ? __add_to_page_cache_locked+0x464/0x500
[<ffffffffc010bb66>] __extent_read_full_page+0x106/0x120 [btrfs]
[<ffffffffc00db310>] ? btrfs_real_readdir+0x8d0/0x8d0 [btrfs]
[<ffffffffc010c09d>] extent_read_full_page+0xad/0x110 [btrfs]
[<ffffffffc010bff0>] ? set_page_extent_mapped+0x30/0x30 [btrfs]
[<ffffffffad122f70>] ? __wake_up_locked_key+0x20/0x20
[<ffffffffad25ea80>] ? add_to_page_cache_locked+0x20/0x20
[<ffffffffad25f4fb>] ? find_get_entry+0x14b/0x270
[<ffffffffad25f3b5>] ? find_get_entry+0x5/0x270
[<ffffffffc00d7a00>] btrfs_readpage+0x40/0x50 [btrfs]
[<ffffffffc00f17f9>] prepare_uptodate_page+0x39/0x80 [btrfs]
[<ffffffffc00f19de>] prepare_pages+0x19e/0x210 [btrfs]
[<ffffffffc00f2d21>] __btrfs_buffered_write+0x351/0x8a0 [btrfs]
[<ffffffffc00f29d0>] ? btrfs_dirty_pages+0xf0/0xf0 [btrfs]
[<ffffffffad2619aa>] ? generic_file_direct_write+0x1aa/0x2c0
[<ffffffffad261800>] ? generic_file_read_iter+0xa00/0xa00
[<ffffffffc00f866d>] btrfs_file_write_iter+0x6dd/0x800 [btrfs]
[<ffffffffad2f694d>] __vfs_write+0x21d/0x260
[<ffffffffad2f6730>] ? __vfs_read+0x260/0x260
[<ffffffffad12ed32>] ? __lock_is_held+0x92/0xd0
[<ffffffffad0ee3b1>] ? preempt_count_sub+0xc1/0x120
[<ffffffffad12cd17>] ? percpu_down_read+0x57/0xa0
[<ffffffffad2fbd24>] ? __sb_start_write+0xb4/0xf0
[<ffffffffad2f7736>] vfs_write+0xf6/0x260
[<ffffffffad2f8d4f>] SyS_write+0xbf/0x160
[<ffffffffad2f8c90>] ? SyS_read+0x160/0x160
[<ffffffffad002017>] ? trace_hardirqs_on_thunk+0x17/0x19
[<ffffffffadceab17>] entry_SYSCALL_64_fastpath+0x12/0x6b
Code: 48 8d 14 80 48 8d 04 50 31 d2 49 8d 3c c6 e8 c1 4b ec ff 5b 41 5c 41 5d 41 5e 5d c3 48 c7 c6 e0 3a ec ad 4c 89 e7 e8 49 22 04 00 <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89
RIP [<ffffffffad25e3c7>] unlock_page+0xa7/0xb0
RSP <ffff8803f72176b8>
---[ end trace 5d1ded6801ca2e6c ]---

2015-12-14 20:02:29

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

On Mon, Dec 14, 2015 at 01:33:56PM -0500, Dave Jones wrote:
> On Sat, Dec 12, 2015 at 07:07:46PM -0500, Chris Mason wrote:
> > On Sat, Dec 12, 2015 at 11:41:26AM -0800, Linus Torvalds wrote:
> > > On Sat, Dec 12, 2015 at 10:33 AM, Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > Peter, did that patch also handle just plain "lock_page()" case?
> > >
> > > Looking more at it, I think this all goes back to commit 743162013d40
> > > ("sched: Remove proliferation of wait_on_bit() action functions").
> > >
> > > It looks like PeterZ's pending patch should fix this, by passing in
> > > the proper TASK_UNINTERRUPTIBLE to the bit_wait_io function, and going
> > > back to signal_pending_state(). PeterZ, did I follow the history of
> > > this correctly?
> >
> > Looks right to me, I found Peter's patch and have it running now. After
> > about 6 hours my patch did eventually crash again under trinity. Btrfs has a
> > very old (from 2011) bug in the error handling path that trinity is
> > banging on.
>
> Is the other bug this one ? I've hit this quite a lot over the last 12 months,
> and now that the lock_page bug is fixed this is showing up again.
>
> page:ffffea00110d2700 count:4 mapcount:0 mapping:ffff88045b5160a0 index:0x0
> flags: 0x8000000000000806(error|referenced|private)
> page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))

[ snip ]

> [<ffffffffc00f17f9>] prepare_uptodate_page+0x39/0x80 [btrfs]
> [<ffffffffc00f19de>] prepare_pages+0x19e/0x210 [btrfs]

This should be the second call to prepare_uptodate_page() in
prepare_pages(). If we get an error on the first call, and the write
only spans a single page, we'll call prepare_uptodate_page a second time
on an unlocked page.

I'll send out the patch a little later this afternoon.

> [<ffffffffc00f2d21>] __btrfs_buffered_write+0x351/0x8a0 [btrfs]
> [<ffffffffc00f29d0>] ? btrfs_dirty_pages+0xf0/0xf0 [btrfs]
> [<ffffffffad2619aa>] ? generic_file_direct_write+0x1aa/0x2c0
> [<ffffffffad261800>] ? generic_file_read_iter+0xa00/0xa00
> [<ffffffffc00f866d>] btrfs_file_write_iter+0x6dd/0x800 [btrfs]
> [<ffffffffad2f694d>] __vfs_write+0x21d/0x260
> [<ffffffffad2f6730>] ? __vfs_read+0x260/0x260
> [<ffffffffad12ed32>] ? __lock_is_held+0x92/0xd0
> [<ffffffffad0ee3b1>] ? preempt_count_sub+0xc1/0x120
> [<ffffffffad12cd17>] ? percpu_down_read+0x57/0xa0
> [<ffffffffad2fbd24>] ? __sb_start_write+0xb4/0xf0
> [<ffffffffad2f7736>] vfs_write+0xf6/0x260
> [<ffffffffad2f8d4f>] SyS_write+0xbf/0x160
> [<ffffffffad2f8c90>] ? SyS_read+0x160/0x160
> [<ffffffffad002017>] ? trace_hardirqs_on_thunk+0x17/0x19
> [<ffffffffadceab17>] entry_SYSCALL_64_fastpath+0x12/0x6b

-chris

2015-12-15 00:00:20

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

On Mon, Dec 14, 2015 at 01:33:56PM -0500, Dave Jones wrote:
> On Sat, Dec 12, 2015 at 07:07:46PM -0500, Chris Mason wrote:
> > On Sat, Dec 12, 2015 at 11:41:26AM -0800, Linus Torvalds wrote:
> > > On Sat, Dec 12, 2015 at 10:33 AM, Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > Peter, did that patch also handle just plain "lock_page()" case?
> > >
> > > Looking more at it, I think this all goes back to commit 743162013d40
> > > ("sched: Remove proliferation of wait_on_bit() action functions").
> > >
> > > It looks like PeterZ's pending patch should fix this, by passing in
> > > the proper TASK_UNINTERRUPTIBLE to the bit_wait_io function, and going
> > > back to signal_pending_state(). PeterZ, did I follow the history of
> > > this correctly?
> >
> > Looks right to me, I found Peter's patch and have it running now. After
> > about 6 hours my patch did eventually crash again under trinity. Btrfs has a
> > very old (from 2011) bug in the error handling path that trinity is
> > banging on.
>
> Is the other bug this one ? I've hit this quite a lot over the last 12 months,
> and now that the lock_page bug is fixed this is showing up again.

Linus, I'll send this in a pull request, but just to close the loop in
this thread:

From: Chris Mason <[email protected]>
Subject: [PATCH] Btrfs: check prepare_uptodate_page() error code earlier

prepare_pages() may end up calling prepare_uptodate_page() twice if our
write only spans a single page. But if the first call returns an error,
our page will be unlocked and its not safe to call it again.

This bug goes all the way back to 2011, and it's not something commonly
hit.

While we're here, add a more explicit check for the page being truncated
away. The bare lock_page() alone is protected only by good thoughts and
i_mutex, which we're sure to regret eventually.

Reported-by: Dave Jones <[email protected]>
Signed-off-by: Chris Mason <[email protected]>
---
fs/btrfs/file.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 72e7346..0f09526 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1291,7 +1291,8 @@ out:
* on error we return an unlocked page and the error value
* on success we return a locked page and 0
*/
-static int prepare_uptodate_page(struct page *page, u64 pos,
+static int prepare_uptodate_page(struct inode *inode,
+ struct page *page, u64 pos,
bool force_uptodate)
{
int ret = 0;
@@ -1306,6 +1307,10 @@ static int prepare_uptodate_page(struct page *page, u64 pos,
unlock_page(page);
return -EIO;
}
+ if (page->mapping != inode->i_mapping) {
+ unlock_page(page);
+ return -EAGAIN;
+ }
}
return 0;
}
@@ -1324,6 +1329,7 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
int faili;

for (i = 0; i < num_pages; i++) {
+again:
pages[i] = find_or_create_page(inode->i_mapping, index + i,
mask | __GFP_WRITE);
if (!pages[i]) {
@@ -1333,13 +1339,17 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
}

if (i == 0)
- err = prepare_uptodate_page(pages[i], pos,
+ err = prepare_uptodate_page(inode, pages[i], pos,
force_uptodate);
- if (i == num_pages - 1)
- err = prepare_uptodate_page(pages[i],
+ if (!err && i == num_pages - 1)
+ err = prepare_uptodate_page(inode, pages[i],
pos + write_bytes, false);
if (err) {
page_cache_release(pages[i]);
+ if (err == -EAGAIN) {
+ err = 0;
+ goto again;
+ }
faili = i - 1;
goto fail;
}
--
2.4.6