2014-01-16 01:13:09

by Alan Ott

[permalink] [raw]
Subject: Deadlock in do_page_fault() on ARM (old kernel)

Hello,

I have a deadlock that I'm trying to understand. The symptom is multiple
tasks trying to acquire a read lock (down_read()) on mm->mmap_sem in
do_page_fault(). I'll be right up front and say that this is a fairly
old kernel (2.6.37 TI PSP kernel) on a fairly old processor DaVinci 6446.

At the time of the deadlock, sysrq's show-all-tasks shows the following
for three of the tasks which are deadlocked (there are more, but I just
picked the interesting ones; the full output is at [1]):

ui D c0ea8208 0 1405 1293 0x00000000
[<c0ea8208>] (schedule+0x33c/0x3c4) from [<c0eaa3b4>]
(__down_read+0xbc/0xd4)
[<c0eaa3b4>] (__down_read+0xbc/0xd4) from [<c0c0b378>]
(do_page_fault+0x94/0x248)
[<c0c0b378>] (do_page_fault+0x94/0x248) from [<c0c052e0>]
(do_DataAbort+0x34/0x94)
[<c0c052e0>] (do_DataAbort+0x34/0x94) from [<c0c05b0c>]
(__dabt_svc+0x4c/0x60)
Exception stack(0xc048dce8 to 0xc048dd30)
dce0: 400e9a94 c048ddb0 ffffffec 00000000 c048c000
c048dda4
dd00: 400e9a94 00000000 ffffff92 c048c000 00000000 00000001 00000014
c048dd34
dd20: 00000000 c0d1f68c 00000013 ffffffff
[<c0c05b0c>] (__dabt_svc+0x4c/0x60) from [<c0d1f68c>]
(__copy_to_user_std+0xcc/0x3a8)

ui D c0ea8208 0 1406 1293 0x00000000
[<c0ea8208>] (schedule+0x33c/0x3c4) from [<c0eaa3b4>]
(__down_read+0xbc/0xd4)
[<c0eaa3b4>] (__down_read+0xbc/0xd4) from [<c0c0b378>]
(do_page_fault+0x94/0x248)
[<c0c0b378>] (do_page_fault+0x94/0x248) from [<c0c052e0>]
(do_DataAbort+0x34/0x94)
[<c0c052e0>] (do_DataAbort+0x34/0x94) from [<c0c05f0c>]
(ret_from_exception+0x0/0x10)
Exception stack(0xc048ffb0 to 0xc048fff8)
ffa0: 00000060 0000000a 000000a8
0010d000
ffc0: 00c23d80 00c23de8 405af06c 00000000 405af03c 405af074 00000050
000001ff
ffe0: 405ae000 40185748 404f5c4c 404f393c 80000010 ffffffff

ui D c0ea8208 0 1411 1293 0x00000000
[<c0ea8208>] (schedule+0x33c/0x3c4) from [<c0eaa3b4>]
(__down_read+0xbc/0xd4)
[<c0eaa3b4>] (__down_read+0xbc/0xd4) from [<c0c0b378>]
(do_page_fault+0x94/0x248)
[<c0c0b378>] (do_page_fault+0x94/0x248) from [<c0c052e0>]
(do_DataAbort+0x34/0x94)
[<c0c052e0>] (do_DataAbort+0x34/0x94) from [<c0c05f0c>]
(ret_from_exception+0x0/0x10)
Exception stack(0xc053bfb0 to 0xc053bff8)
bfa0: 00000000 00000001 00ba3610
00000000
bfc0: 00000000 00ba3610 00bb6020 00ba3610 40074000 00b91024 415e4930
00000583
bfe0: 00b611a0 415e38e0 4005f3e4 ffff0fc0 60000010 ffffffff

---- [snip] ----

Showing all locks held in the system:
1 lock held by getty/1294:
#0: (&tty->atomic_read_lock){+.+...}, at: [<c0d45bf0>]
n_tty_read+0x21c/0x670
1 lock held by ui/1405:
#0: (&mm->mmap_sem){++++++}, at: [<c0c0b378>] do_page_fault+0x94/0x248
1 lock held by ui/1406:
#0: (&mm->mmap_sem){++++++}, at: [<c0c0b378>] do_page_fault+0x94/0x248
1 lock held by ui/1408:
#0: (&mm->mmap_sem){++++++}, at: [<c0c0b378>] do_page_fault+0x94/0x248
1 lock held by ui/1409:
#0: (&mm->mmap_sem){++++++}, at: [<c0c0b378>] do_page_fault+0x94/0x248
1 lock held by ui/1411:
#0: (&mm->mmap_sem){++++++}, at: [<c0c0b378>] do_page_fault+0x94/0x248
1 lock held by ui/1416:
#0: (&mm->mmap_sem){++++++}, at: [<c0c6e604>] sys_mmap_pgoff+0x70/0xc0
1 lock held by ui/1418:
#0: (&mm->mmap_sem){++++++}, at: [<c0c0b378>] do_page_fault+0x94/0x248
1 lock held by ui/1420:
#0: (&mm->mmap_sem){++++++}, at: [<c0c6e604>] sys_mmap_pgoff+0x70/0xc0
1 lock held by ui/1434:
#0: (&tty->atomic_read_lock){+.+...}, at: [<c0d45bf0>]
n_tty_read+0x21c/0x670


Note that above, do_page_fault() takes out a read lock (down_read()) and
sys_mmap_pgoff() takes out a write lock (down_write()).

I've searched for this kind of problem and found two patches which seem
to be related to this issue[2]. I have applied both with no better results.

So my questions are:
1. Why don't I see a full backtrace beyond the exception stack? It's the
same when dump_stack() is called manually.
2. __copy_to_user_memcpy() takes a read lock (down_read()) on
mm->mmap_sem. While that lock is held, __copy_to_user_memcpy() can
generate a page fault, causing do_page_fault() to get called, which will
also try to get a read lock (down_read()) on mm->mmap_sem. Multiple read
locks can be taken on an rw_semaphore, but deadlock will occur if
another thread tries to get a write lock (down_write()) in between. For
example:
Task 1: Task 2:
down_read(sem)
down_write(sem) <-- Goes to sleep
down_read(sem) <-- Goes to sleep

There is a thread from 2005[3] which seems to discuss the same concept
of recursive rw_semaphores, but for futexes.

Other comments:
1. My analysis of this probably wrong. Otherwise it seems many others
would have the same problem, and they don't seem to. I'm hoping this
email will help to correct my understanding.
2. I looked through the git logs for recent (since 2.6.37 time frame)
and nothing else jumped out at me as being an obvious fix for this
situation.

Thanks for any insight you can give,

Alan.

[1] http://www.signal11.us/~alan/show-all-tasks-deadlock.txt

[2] Some websites/bugtrackers mention this commit with a similar issue,
but I'm not entirely sure how it's related:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8878a539ff19a43cf3729e7562cd528f490246ae

This one seems obviously related, but has no effect on my system:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=435a7ef52db7d86e67a009b36cac1457f8972391

[3] http://thread.gmane.org/gmane.linux.kernel/280900


2014-01-17 13:46:52

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Deadlock in do_page_fault() on ARM (old kernel)

On Wed, Jan 15, 2014 at 08:13:04PM -0500, Alan Ott wrote:
> So my questions are:
> 1. Why don't I see a full backtrace beyond the exception stack? It's the
> same when dump_stack() is called manually.

No idea - it looks like you're not using frame pointers, but are using
the unwinder. Full backtraces can always be created with frame pointers,
it's just that unwinding seems unreliable.

I think we do need to see the full backtrace here - from looking at the
full state dump, I don't see any sign of the mmap_sem being held except
by an attempt to process a fault, and two threads trying to do a
sys_mmap_pgoff().

My suspicion therefore is that some other thread must have died while
holding the mmap_sem, so there's probably a kernel oops earlier...
that's my best guess at the moment without seeing the full backtrace.

--
FTTC broadband for 0.8mile line: 5.8Mbps down 500kbps up. Estimation
in database were 13.1 to 19Mbit for a good line, about 7.5+ for a bad.
Estimate before purchase was "up to 13.2Mbit".

2014-01-18 00:57:17

by Alan Ott

[permalink] [raw]
Subject: Re: Deadlock in do_page_fault() on ARM (old kernel)

On 01/17/2014 08:46 AM, Russell King - ARM Linux wrote:
> On Wed, Jan 15, 2014 at 08:13:04PM -0500, Alan Ott wrote:
>> So my questions are:
>> 1. Why don't I see a full backtrace beyond the exception stack? It's the
>> same when dump_stack() is called manually.
> No idea - it looks like you're not using frame pointers, but are using
> the unwinder. Full backtraces can always be created with frame pointers,
> it's just that unwinding seems unreliable.

Hi Russell,

I managed to get frame pointers turned on, which in this kernel version,
seems like it requires the unwinder to be turned off[1].

When I built with frame pointers, the backtraces show differently. It
only shows the frames which were _not_ shown with the unwinder.
Backtrace at [2]. As you can see, it shows the non-exception stack dumps
ending at places which can't page fault (and if it did page fault in
those locations, it wouldn't try to take the mmap_sem lock). Its as
though it shows what's in the modules but not what's in the image, where
with the unwinder, it showed what's in the image and not what's in the
modules[3].

> I think we do need to see the full backtrace here - from looking at the
> full state dump, I don't see any sign of the mmap_sem being held except
> by an attempt to process a fault, and two threads trying to do a
> sys_mmap_pgoff().

The lockdep printout at the end of [4] (which is the original log I sent
the other day) shows do_page_fault() holding mmap_sem in six tasks and
sys_mmap_pgoff() holding it in two. (I don't like the term "held" in the
lockdep printout and believe it to mean "held or waiting for," based on
my analysis of the rw_semaphore code.)

Each of those tasks seems to be blocking for it.

> My suspicion therefore is that some other thread must have died while
> holding the mmap_sem, so there's probably a kernel oops earlier...
> that's my best guess at the moment without seeing the full backtrace.

There's no oops that I'm able to see.

Each of the tasks which lockdep reports as "holding" mmap_sem are
blocking for it. If some other task had taken it and then crashed, I
assume lockdep would list the crashed task as also holding the resource
in the printout.

Thanks for having a look at this,

Alan.

[1] It seems a little strange to me even in the newest kernel:
1: lib/Kconfig.debug specifies FRAME_POINTER . The text of this instance
shows up when one searches for FRAME_POINTER in menuconfig. It can
become selected even though the dependencies listed here are not met.
2: arch/arm/Kconfig.debug also specifies FRAME_POINTER, with a
dependency only on !THUMB2_KERNEL, defaulting to yes on !ARM_UNWIND
3: ARM doens't set ARCH_WANT_FRAME_POINTERS

[2] http://www.signal11.us/~alan/stack_dump_with_frame_pointers.txt
Note: This is a sysrq dump of the locks held at lockup time and then the
automatic hung task detection.

[3] The modules are being built in the standard way for out-of-tree:
$(MAKE) -C $(PRJDIR)/linux M=$(CWD) modules

[4] http://www.signal11.us/~alan/show-all-tasks-deadlock.txt

2014-01-18 01:20:52

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Deadlock in do_page_fault() on ARM (old kernel)

On Fri, Jan 17, 2014 at 07:57:16PM -0500, Alan Ott wrote:
> On 01/17/2014 08:46 AM, Russell King - ARM Linux wrote:
>> My suspicion therefore is that some other thread must have died while
>> holding the mmap_sem, so there's probably a kernel oops earlier...
>> that's my best guess at the moment without seeing the full backtrace.
>
> There's no oops that I'm able to see.
>
> Each of the tasks which lockdep reports as "holding" mmap_sem are
> blocking for it. If some other task had taken it and then crashed, I
> assume lockdep would list the crashed task as also holding the resource
> in the printout.

My point is this:

- the five (or six) threads which are trying to take the mmap_sem in
read-mode in the fault handler are all blocked on it - they haven't
taken the lock, which will only happen because there's a pending writer.
- of these in your original post, there are two which faulted from
__copy_to_user_std(). __copy_to_user_std() doesn't take the mmap_sem -
this is the non-uaccess-with-memcpy path.
- the pending writers are the two threads in sys_mmap_pgoff(), both of
which are blocked waiting to gain the write lock.
- there are no *other* threads holding the mmap_sem lock.

So... there's a question here how we got into this state - and frankly
I don't know. What I do see from your latest dump is that there's two
unknown modules there - something called rcu2m and another called
buttoms, and there are two threads inside ioctls there. Both have
faulted from the function at 0xc0d2a394 (which won't appear in the
backtrace, but is most likely __copy_to_user_std.)

So, in the absence of you saying anything about there being any preceding
oopses, my conclusion now is that one of those modules is taking the
mmap_sem itself, and is the culpret inducing this deadlock.

Note that your dump ([2]) in your reply was just the hung task detector
printing out the stacktrace for a few tasks, not the full all-threads
stack dump which I was expecting.

So I'm pulling out these conclusions from the very little information
you're supplying.

--
FTTC broadband for 0.8mile line: 5.8Mbps down 500kbps up. Estimation
in database were 13.1 to 19Mbit for a good line, about 7.5+ for a bad.
Estimate before purchase was "up to 13.2Mbit".

2014-01-20 10:15:13

by Michal Hocko

[permalink] [raw]
Subject: Re: Deadlock in do_page_fault() on ARM (old kernel)

On Wed 15-01-14 20:13:04, Alan Ott wrote:
[...]
> 2. __copy_to_user_memcpy() takes a read lock (down_read()) on

This looks like a bug. copy_to_user_* shouldn't take mmap_sem at all
Check the might_fault annotation used in generic code. Arm version of
copy_to_user* doesn't seem to use the annotation and I do not see a good
reason for that.

> mm->mmap_sem. While that lock is held, __copy_to_user_memcpy() can
> generate a page fault, causing do_page_fault() to get called, which
> will also try to get a read lock (down_read()) on mm->mmap_sem.
> Multiple read locks can be taken on an rw_semaphore, but deadlock
> will occur if another thread tries to get a write lock
> (down_write()) in between. For example:
> Task 1: Task 2:
> down_read(sem)
> down_write(sem) <-- Goes to sleep
> down_read(sem) <-- Goes to sleep
>
> There is a thread from 2005[3] which seems to discuss the same
> concept of recursive rw_semaphores, but for futexes.
>
> Other comments:
> 1. My analysis of this probably wrong. Otherwise it seems many
> others would have the same problem, and they don't seem to. I'm
> hoping this email will help to correct my understanding.
> 2. I looked through the git logs for recent (since 2.6.37 time
> frame) and nothing else jumped out at me as being an obvious fix for
> this situation.
>
> Thanks for any insight you can give,
>
> Alan.
>
> [1] http://www.signal11.us/~alan/show-all-tasks-deadlock.txt
>
> [2] Some websites/bugtrackers mention this commit with a similar
> issue, but I'm not entirely sure how it's related:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8878a539ff19a43cf3729e7562cd528f490246ae
>
> This one seems obviously related, but has no effect on my system:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=435a7ef52db7d86e67a009b36cac1457f8972391
>
> [3] http://thread.gmane.org/gmane.linux.kernel/280900
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs

2014-01-20 18:45:41

by Michal Hocko

[permalink] [raw]
Subject: Re: Deadlock in do_page_fault() on ARM (old kernel)

On Mon 20-01-14 11:15:09, Michal Hocko wrote:
> On Wed 15-01-14 20:13:04, Alan Ott wrote:
> [...]
> > 2. __copy_to_user_memcpy() takes a read lock (down_read()) on
>
> This looks like a bug. copy_to_user_* shouldn't take mmap_sem at all
> Check the might_fault annotation used in generic code. Arm version of
> copy_to_user* doesn't seem to use the annotation and I do not see a good
> reason for that.

OK, so I have looked at the implementation of __copy_to_user_memcpy and
it drops the semaphore before it does __put_user to fault memory in. It
then reacquires the lock to make sure that the pte doesn't vanish during
memcpy. It holds pte lock to ensure that.

The mmap_sem reacquire happens with pte lock held though and this smells
like a deadlock situation because the page fault takes mmap_sem first
and only then takes ptl. I am not sure this is exactly what happens in
your case though because you seem to have tasks blocked on the mmap_sem
already.

> > mm->mmap_sem. While that lock is held, __copy_to_user_memcpy() can
> > generate a page fault, causing do_page_fault() to get called, which
> > will also try to get a read lock (down_read()) on mm->mmap_sem.
> > Multiple read locks can be taken on an rw_semaphore, but deadlock
> > will occur if another thread tries to get a write lock
> > (down_write()) in between. For example:
> > Task 1: Task 2:
> > down_read(sem)
> > down_write(sem) <-- Goes to sleep
> > down_read(sem) <-- Goes to sleep
> >
> > There is a thread from 2005[3] which seems to discuss the same
> > concept of recursive rw_semaphores, but for futexes.
> >
> > Other comments:
> > 1. My analysis of this probably wrong. Otherwise it seems many
> > others would have the same problem, and they don't seem to. I'm
> > hoping this email will help to correct my understanding.
> > 2. I looked through the git logs for recent (since 2.6.37 time
> > frame) and nothing else jumped out at me as being an obvious fix for
> > this situation.
> >
> > Thanks for any insight you can give,
> >
> > Alan.
> >
> > [1] http://www.signal11.us/~alan/show-all-tasks-deadlock.txt
> >
> > [2] Some websites/bugtrackers mention this commit with a similar
> > issue, but I'm not entirely sure how it's related:
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8878a539ff19a43cf3729e7562cd528f490246ae
> >
> > This one seems obviously related, but has no effect on my system:
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=435a7ef52db7d86e67a009b36cac1457f8972391
> >
> > [3] http://thread.gmane.org/gmane.linux.kernel/280900
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs

2014-01-20 23:58:24

by Alan Ott

[permalink] [raw]
Subject: Re: Deadlock in do_page_fault() on ARM (old kernel)

On 01/17/2014 08:20 PM, Russell King - ARM Linux wrote:
> On Fri, Jan 17, 2014 at 07:57:16PM -0500, Alan Ott wrote:
>> On 01/17/2014 08:46 AM, Russell King - ARM Linux wrote:
>>> My suspicion therefore is that some other thread must have died while
>>> holding the mmap_sem, so there's probably a kernel oops earlier...
>>> that's my best guess at the moment without seeing the full backtrace.
>> There's no oops that I'm able to see.
>>
>> Each of the tasks which lockdep reports as "holding" mmap_sem are
>> blocking for it. If some other task had taken it and then crashed, I
>> assume lockdep would list the crashed task as also holding the resource
>> in the printout.
> My point is this:
>
> - the five (or six) threads which are trying to take the mmap_sem in
> read-mode in the fault handler are all blocked on it - they haven't
> taken the lock, which will only happen because there's a pending writer.
> - of these in your original post, there are two which faulted from
> __copy_to_user_std(). __copy_to_user_std() doesn't take the mmap_sem -
> this is the non-uaccess-with-memcpy path.
> - the pending writers are the two threads in sys_mmap_pgoff(), both of
> which are blocked waiting to gain the write lock.
> - there are no *other* threads holding the mmap_sem lock.

Yes, all true. I don't remember why I started looking at the memcpy() case.

> So... there's a question here how we got into this state - and frankly
> I don't know. What I do see from your latest dump is that there's two
> unknown modules there - something called rcu2m and another called
> buttoms, and there are two threads inside ioctls there. Both have
> faulted from the function at 0xc0d2a394 (which won't appear in the
> backtrace, but is most likely __copy_to_user_std.)

Yes, there are a handful of out-of-tree modules.

> So, in the absence of you saying anything about there being any preceding
> oopses, my conclusion now is that one of those modules is taking the
> mmap_sem itself, and is the culpret inducing this deadlock.

Yes, I came to that as well. I had checked for the presence of mmap_sem
in the sources of the out-of-tree modules and didn't see it. However,
upon closer inspection, my grep-fu failed me as there were some backward
symlinks I didn't account for. TI's cmemk module _is_ taking out
mmap_sem. I wish I had seen this days ago. That's my new investigation path.

> Note that your dump ([2]) in your reply was just the hung task detector
> printing out the stacktrace for a few tasks, not the full all-threads
> stack dump which I was expecting.

Yes, in a misguided attempt to keep the SNR high, I didn't include the
full dump, but only what I thought was the interesting part. I did
another capture and the full dump is at [1] .

> So I'm pulling out these conclusions from the very little information
> you're supplying.

I appreciate it. Thank you for taking the time to reply.

Alan.

[1]
http://www.signal11.us/~alan/stack_dump_all_tasks_with_frame_pointers.txt