From: Martin Schwidefsky <[email protected]>
The walk_page_range function may only be called while holding the mmap
semaphore. Otherwise a concurrent munmap could free a page table that
is read by the generic page table walker.
Cc: Matt Mackall <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
---
fs/proc/task_mmu.c | 2 ++
1 file changed, 2 insertions(+)
diff -urpN linux-2.6/fs/proc/task_mmu.c linux-2.6-patched/fs/proc/task_mmu.c
--- linux-2.6/fs/proc/task_mmu.c 2009-03-12 11:32:51.000000000 +0100
+++ linux-2.6-patched/fs/proc/task_mmu.c 2009-03-12 11:33:16.000000000 +0100
@@ -716,7 +716,9 @@ static ssize_t pagemap_read(struct file
* user buffer is tracked in "pm", and the walk
* will stop when we hit the end of the buffer.
*/
+ down_read(&mm->mmap_sem);
ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
+ up_read(&mm->mmap_sem);
if (ret == PM_END_OF_BUFFER)
ret = 0;
/* don't need mmap_sem for these, but this looks cleaner */
On Thu, Mar 12, 2009 at 11:33:08AM +0100, Martin Schwidefsky wrote:
> --- linux-2.6/fs/proc/task_mmu.c
> +++ linux-2.6-patched/fs/proc/task_mmu.c
> @@ -716,7 +716,9 @@ static ssize_t pagemap_read(struct file
> * user buffer is tracked in "pm", and the walk
> * will stop when we hit the end of the buffer.
> */
> + down_read(&mm->mmap_sem);
> ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
> + up_read(&mm->mmap_sem);
This will introduce "put_user under mmap_sem" which is deadlockable.
On Thu, 12 Mar 2009 14:45:33 +0300
Alexey Dobriyan <[email protected]> wrote:
> On Thu, Mar 12, 2009 at 11:33:08AM +0100, Martin Schwidefsky wrote:
> > --- linux-2.6/fs/proc/task_mmu.c
> > +++ linux-2.6-patched/fs/proc/task_mmu.c
> > @@ -716,7 +716,9 @@ static ssize_t pagemap_read(struct file
> > * user buffer is tracked in "pm", and the walk
> > * will stop when we hit the end of the buffer.
> > */
> > + down_read(&mm->mmap_sem);
> > ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
> > + up_read(&mm->mmap_sem);
>
> This will introduce "put_user under mmap_sem" which is deadlockable.
Hmm, interesting. In this case the pagemap interface is fundamentally broken.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Thu, 2009-03-12 at 12:54 +0100, Martin Schwidefsky wrote:
> On Thu, 12 Mar 2009 14:45:33 +0300
> Alexey Dobriyan <[email protected]> wrote:
>
> > On Thu, Mar 12, 2009 at 11:33:08AM +0100, Martin Schwidefsky wrote:
> > > --- linux-2.6/fs/proc/task_mmu.c
> > > +++ linux-2.6-patched/fs/proc/task_mmu.c
> > > @@ -716,7 +716,9 @@ static ssize_t pagemap_read(struct file
> > > * user buffer is tracked in "pm", and the walk
> > > * will stop when we hit the end of the buffer.
> > > */
> > > + down_read(&mm->mmap_sem);
> > > ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
> > > + up_read(&mm->mmap_sem);
> >
> > This will introduce "put_user under mmap_sem" which is deadlockable.
>
> Hmm, interesting. In this case the pagemap interface is fundamentally broken.
Well it means we may have to reintroduce the very annoying double
buffering from various earlier implementations. But let's leave this
discussion until after we've figured out what to do about the walker
code.
--
http://selenic.com : development and support for Mercurial and Linux
On Thu, 12 Mar 2009 10:23:34 -0500
Matt Mackall <[email protected]> wrote:
> On Thu, 2009-03-12 at 12:54 +0100, Martin Schwidefsky wrote:
> > On Thu, 12 Mar 2009 14:45:33 +0300
> > Alexey Dobriyan <[email protected]> wrote:
> >
> > > On Thu, Mar 12, 2009 at 11:33:08AM +0100, Martin Schwidefsky wrote:
> > > > --- linux-2.6/fs/proc/task_mmu.c
> > > > +++ linux-2.6-patched/fs/proc/task_mmu.c
> > > > @@ -716,7 +716,9 @@ static ssize_t pagemap_read(struct file
> > > > * user buffer is tracked in "pm", and the walk
> > > > * will stop when we hit the end of the buffer.
> > > > */
> > > > + down_read(&mm->mmap_sem);
> > > > ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
> > > > + up_read(&mm->mmap_sem);
> > >
> > > This will introduce "put_user under mmap_sem" which is deadlockable.
> >
> > Hmm, interesting. In this case the pagemap interface is fundamentally broken.
>
> Well it means we may have to reintroduce the very annoying double
> buffering from various earlier implementations. But let's leave this
> discussion until after we've figured out what to do about the walker
> code.
Which would be really ugly. I still have not grasped why this will
introduce a deadlock though. The worst the put_user can do is to cause
a page fault, no? I do not see where the fault handler acquires the
mmap_sem as writer. It takes the mmap_sem as reader and two readers
should be fine.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Thu, 12 Mar 2009 16:41:31 +0100
Brice Goglin <[email protected]> wrote:
> > Which would be really ugly. I still have not grasped why this will
> > introduce a deadlock though. The worst the put_user can do is to cause
> > a page fault, no? I do not see where the fault handler acquires the
> > mmap_sem as writer. It takes the mmap_sem as reader and two readers
> > should be fine.
>
> Somebody else can acquire for write in the meantime, for instance
> another thread doing mprotect. This writer is blocked by the first
> reader, and the second reader is blocked by the writer. So both
> tasks are blocked.
I see, fair r/w locks. So nested down_read is a no-no.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Thu, 12 Mar 2009 10:23:34 -0500
Matt Mackall <[email protected]> wrote:
> Well it means we may have to reintroduce the very annoying double
> buffering from various earlier implementations. But let's leave this
> discussion until after we've figured out what to do about the walker
> code.
About the walker code. I've realized that there is another way to fix
this. The TASK_SIZE definition is currently used for two things: 1) as
a maximum mappable address, 2) the size of the address space for a
process. And there lies a problem: while a process is using a reduced
page table 1) and 2) differ. If I make TASK_SIZE give you the current
size of the address space then it is not possible to mmap an object
beyond 4TB and the page table upgrade never happens. If I make
TASK_SIZE return the maximum mappable address the page table walker
breaks. The solution could be to introduce MAX_TASK_SIZE and use that
in the mmap code to find out what can be mapped.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> On Thu, 12 Mar 2009 10:23:34 -0500
> Matt Mackall <[email protected]> wrote:
>
>> On Thu, 2009-03-12 at 12:54 +0100, Martin Schwidefsky wrote:
>>> On Thu, 12 Mar 2009 14:45:33 +0300
>>> Alexey Dobriyan <[email protected]> wrote:
>>>
>>>> On Thu, Mar 12, 2009 at 11:33:08AM +0100, Martin Schwidefsky wrote:
>>>>> --- linux-2.6/fs/proc/task_mmu.c
>>>>> +++ linux-2.6-patched/fs/proc/task_mmu.c
>>>>> @@ -716,7 +716,9 @@ static ssize_t pagemap_read(struct file
>>>>> * user buffer is tracked in "pm", and the walk
>>>>> * will stop when we hit the end of the buffer.
>>>>> */
>>>>> + down_read(&mm->mmap_sem);
>>>>> ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
>>>>> + up_read(&mm->mmap_sem);
>>>> This will introduce "put_user under mmap_sem" which is deadlockable.
>>> Hmm, interesting. In this case the pagemap interface is fundamentally broken.
>> Well it means we may have to reintroduce the very annoying double
>> buffering from various earlier implementations. But let's leave this
>> discussion until after we've figured out what to do about the walker
>> code.
>
> Which would be really ugly. I still have not grasped why this will
> introduce a deadlock though. The worst the put_user can do is to cause
> a page fault, no? I do not see where the fault handler acquires the
> mmap_sem as writer. It takes the mmap_sem as reader and two readers
> should be fine.
Somebody else can acquire for write in the meantime, for instance
another thread doing mprotect. This writer is blocked by the first
reader, and the second reader is blocked by the writer. So both
tasks are blocked.
Brice
On Thu, 12 Mar 2009 16:54:51 +0100
Martin Schwidefsky <[email protected]> wrote:
> On Thu, 12 Mar 2009 10:23:34 -0500
> Matt Mackall <[email protected]> wrote:
>
> > Well it means we may have to reintroduce the very annoying double
> > buffering from various earlier implementations. But let's leave this
> > discussion until after we've figured out what to do about the walker
> > code.
>
> About the walker code. I've realized that there is another way to fix
> this. The TASK_SIZE definition is currently used for two things: 1) as
> a maximum mappable address, 2) the size of the address space for a
> process. And there lies a problem: while a process is using a reduced
> page table 1) and 2) differ. If I make TASK_SIZE give you the current
> size of the address space then it is not possible to mmap an object
> beyond 4TB and the page table upgrade never happens. If I make
> TASK_SIZE return the maximum mappable address the page table walker
> breaks. The solution could be to introduce MAX_TASK_SIZE and use that
> in the mmap code to find out what can be mapped.
I got around the TASK_SIZE checks with arch code only. In total I'll
need two fixes, one that makes TASK_SIZE to reflect the size of the
current address space and another one to get the page table upgrades
working again. I'll push these patches via git390 as no common code
changes are required.
Which leaves the mmap_sem issue as the only problem left.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Tue, 2009-03-17 at 13:04 +0100, Martin Schwidefsky wrote:
> On Thu, 12 Mar 2009 16:54:51 +0100
> Martin Schwidefsky <[email protected]> wrote:
>
> > On Thu, 12 Mar 2009 10:23:34 -0500
> > Matt Mackall <[email protected]> wrote:
> >
> > > Well it means we may have to reintroduce the very annoying double
> > > buffering from various earlier implementations. But let's leave this
> > > discussion until after we've figured out what to do about the walker
> > > code.
> >
> > About the walker code. I've realized that there is another way to fix
> > this. The TASK_SIZE definition is currently used for two things: 1) as
> > a maximum mappable address, 2) the size of the address space for a
> > process. And there lies a problem: while a process is using a reduced
> > page table 1) and 2) differ. If I make TASK_SIZE give you the current
> > size of the address space then it is not possible to mmap an object
> > beyond 4TB and the page table upgrade never happens. If I make
> > TASK_SIZE return the maximum mappable address the page table walker
> > breaks. The solution could be to introduce MAX_TASK_SIZE and use that
> > in the mmap code to find out what can be mapped.
>
> I got around the TASK_SIZE checks with arch code only. In total I'll
> need two fixes, one that makes TASK_SIZE to reflect the size of the
> current address space and another one to get the page table upgrades
> working again. I'll push these patches via git390 as no common code
> changes are required.
Thanks, Martin.
> Which leaves the mmap_sem issue as the only problem left.
Yeah, this one needs more thought. I'd really rather not go back to
double-buffering here as it was much more complicated, not to mention
slow.
--
http://selenic.com : development and support for Mercurial and Linux