2007-10-31 00:45:46

by Duane Griffin

[permalink] [raw]
Subject: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

Accessing a memory mapped region past the last page containing a valid
file mapping produces a SIGBUS fault (as it should). Running a program
that does this under gdb, then accessing the invalid memory from gdb,
causes it to start consuming 100% CPU and become unkillable. Once in
that state, SysRq-T doesn't show a stack trace for gdb, although it is
shown as running and stack traces are dumped for other tasks.

2.6.22 does not have this bug (gdb just prints '\0' as the contents,
although arguably that is also a bug, and it should instead report the
SIGBUS).

Bisection indicates the problem was introduced by:

54cb8821de07f2ffcd28c380ce9b93d5784b40d7
"mm: merge populate and nopage into fault (fixes nonlinear)"

The following program demonstrates the issue:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>

int main(int argc, char *argv[])
{
int fd;
struct stat buf;

if (argc != 2) {
fprintf(stderr, "usage: %s <filename>\n", argv[0]);
exit(1);
}

fd = open(argv[1], O_RDONLY);
fstat(fd, &buf);
int count = buf.st_size + sysconf(_SC_PAGE_SIZE);
char *file = (char *) mmap(NULL, count, PROT_READ, MAP_PRIVATE, fd, 0);
if (file == MAP_FAILED) {
fprintf(stderr, "mmap failed: %s\n", strerror(errno));
} else {
char ch;
fprintf(stderr, "using offset %d\n", (count - 1));
ch = file[count - 1];
munmap(file, count);
}
close(fd);
return 0;
}

To reproduce the bug, run it under gdb, go up a couple of frames to
the main function, then access invalid memory, for e.g. with: "print
file[4095]", or whatever offset was reported.

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan


2007-10-31 04:19:43

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On Wed, Oct 31, 2007 at 12:45:35AM +0000, Duane Griffin wrote:
> Accessing a memory mapped region past the last page containing a valid
> file mapping produces a SIGBUS fault (as it should). Running a program
> that does this under gdb, then accessing the invalid memory from gdb,
> causes it to start consuming 100% CPU and become unkillable. Once in
> that state, SysRq-T doesn't show a stack trace for gdb, although it is
> shown as running and stack traces are dumped for other tasks.
>
> 2.6.22 does not have this bug (gdb just prints '\0' as the contents,
> although arguably that is also a bug, and it should instead report the
> SIGBUS).
>
> Bisection indicates the problem was introduced by:
>
> 54cb8821de07f2ffcd28c380ce9b93d5784b40d7
> "mm: merge populate and nopage into fault (fixes nonlinear)"
>
> The following program demonstrates the issue:
>
> #include <errno.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <sys/stat.h>
>
> int main(int argc, char *argv[])
> {
> int fd;
> struct stat buf;
>
> if (argc != 2) {
> fprintf(stderr, "usage: %s <filename>\n", argv[0]);
> exit(1);
> }
>
> fd = open(argv[1], O_RDONLY);
> fstat(fd, &buf);
> int count = buf.st_size + sysconf(_SC_PAGE_SIZE);
> char *file = (char *) mmap(NULL, count, PROT_READ, MAP_PRIVATE, fd, 0);
> if (file == MAP_FAILED) {
> fprintf(stderr, "mmap failed: %s\n", strerror(errno));
> } else {
> char ch;
> fprintf(stderr, "using offset %d\n", (count - 1));
> ch = file[count - 1];
> munmap(file, count);
> }
> close(fd);
> return 0;
> }
>
> To reproduce the bug, run it under gdb, go up a couple of frames to
> the main function, then access invalid memory, for e.g. with: "print
> file[4095]", or whatever offset was reported.


Well that's probably the best bug report I've ever had, thanks Duane!

The issue is a silly thinko -- I didn't pay enough attention to the ptrace
rules in filemap_fault :( partly I think that's because I don't understand
them and am scared of them ;)

The following minimal patch fixes it here, and should probably be applied to
2.6.23 stable as well as 2.6.24. If you could verify it, that would be much
appreciated.

However I actually don't really like how this all works. I don't like that
filemap.c should have to know about ptrace, or exactly what ptrace wants here.
It's a bit hairy to force insert page into pagecache and pte into pagetables
here, given the races.

In access_process_vm, can't we just zero fill in the case of a sigbus? Linus?
That will also avoid changing applicatoin behaviour due to a gdb read...

Thanks,
Nick

--
Duane Griffin noticed a 2.6.23 regression that will cause gdb to hang when
it tries to access the memory of another process beyond i_size.

This is because the solution to the fault vs invalidate race requires that
we recheck i_size after the page lock is taken. However in that case, I
had not accounted for the fact that ptracers are granted an exception to this
rule.

Cc: Duane Griffin <[email protected]>
Cc: [email protected]
Signed-off-by: Nick Piggin <[email protected]
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1374,7 +1374,7 @@ retry_find:

/* Must recheck i_size under page lock */
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
+ if (unlikely(vmf->pgoff >= size) && vma->vm_mm == current->mm) {
unlock_page(page);
page_cache_release(page);
goto outside_data_content;

2007-10-31 06:42:33

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On Wed, Oct 31, 2007 at 12:45:35AM +0000, Duane Griffin wrote:
> Accessing a memory mapped region past the last page containing a valid
> file mapping produces a SIGBUS fault (as it should). Running a program
> that does this under gdb, then accessing the invalid memory from gdb,
> causes it to start consuming 100% CPU and become unkillable. Once in
> that state, SysRq-T doesn't show a stack trace for gdb, although it is
> shown as running and stack traces are dumped for other tasks.


BTW. this has come up for me before, and I have found it useful on
a number of occasions to print the stack of running tasks when they
are looping in the kernel...

Any reason we can't do this?

--
Sysrq+T fails to show the stack trace of a running task. Presumably this
is to avoid a garbled stack, however it can often be useful, and besides
there is no guarantee that the task won't start running in the middle of
show_stack(). If there are any correctness issues, then the archietcture
would have to take further steps to ensure the task is not running.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c 2007-10-31 06:53:22.000000000 +1100
+++ linux-2.6/kernel/sched.c 2007-10-31 06:56:02.000000000 +1100
@@ -4900,8 +4900,7 @@
printk(KERN_CONT "%5lu %5d %6d\n", free,
task_pid_nr(p), task_pid_nr(p->parent));

- if (state != TASK_RUNNING)
- show_stack(p, NULL);
+ show_stack(p, NULL);
}

void show_state_filter(unsigned long state_filter)

2007-10-31 06:56:16

by David Miller

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

From: Nick Piggin <[email protected]>
Date: Wed, 31 Oct 2007 07:42:21 +0100

> Sysrq+T fails to show the stack trace of a running task. Presumably this
> is to avoid a garbled stack, however it can often be useful, and besides
> there is no guarantee that the task won't start running in the middle of
> show_stack(). If there are any correctness issues, then the archietcture
> would have to take further steps to ensure the task is not running.
>
> Signed-off-by: Nick Piggin <[email protected]>

This is useful.

Even more useful would be a show_regs() on the cpu where running tasks
are running. If not a full show_regs() at least a program counter.

That's usually what you're trying to debug and we provide nearly no
way to handle: some task is stuck in a loop in kernel mode and you
need to know exactly where that is.

This is pretty easy to do on sparc64. In fact I can capture remote
cpu registers even when that CPU's interrupts are disabled. I suppose
other arches could do a NMI'ish register capture like this as well.

I have a few bug reports that I can't make more progress on because I
currently can't ask users to do something to fetch the registers on
the seemingly hung processor. This is why I'm harping on this so
much :-)

Anyways, my core suggestion is to add a hook here so platforms can
do the remote register fetch if they want.

2007-10-31 07:41:19

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On Tue, Oct 30, 2007 at 11:56:00PM -0700, David Miller wrote:
> From: Nick Piggin <[email protected]>
> Date: Wed, 31 Oct 2007 07:42:21 +0100
>
> > Sysrq+T fails to show the stack trace of a running task. Presumably this
> > is to avoid a garbled stack, however it can often be useful, and besides
> > there is no guarantee that the task won't start running in the middle of
> > show_stack(). If there are any correctness issues, then the archietcture
> > would have to take further steps to ensure the task is not running.
> >
> > Signed-off-by: Nick Piggin <[email protected]>
>
> This is useful.
>
> Even more useful would be a show_regs() on the cpu where running tasks
> are running. If not a full show_regs() at least a program counter.

Yes, that's something I've needed as well. And I doubt that we're
very unusual among kernel developers in this regard ;)


> That's usually what you're trying to debug and we provide nearly no
> way to handle: some task is stuck in a loop in kernel mode and you
> need to know exactly where that is.
>
> This is pretty easy to do on sparc64. In fact I can capture remote
> cpu registers even when that CPU's interrupts are disabled. I suppose
> other arches could do a NMI'ish register capture like this as well.
>
> I have a few bug reports that I can't make more progress on because I
> currently can't ask users to do something to fetch the registers on
> the seemingly hung processor. This is why I'm harping on this so
> much :-)
>
> Anyways, my core suggestion is to add a hook here so platforms can
> do the remote register fetch if they want.

You could possibly even do a generic "best effort" kind of thing with
regular IPIs, that will timeout and continue if some CPUs don't handle
them, and should be pretty easy to get working with existing smp_call_
function stuff. Not exactly clean, but it would be better than nothing.

2007-10-31 07:44:33

by David Miller

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

From: Nick Piggin <[email protected]>
Date: Wed, 31 Oct 2007 08:41:06 +0100

> On Tue, Oct 30, 2007 at 11:56:00PM -0700, David Miller wrote:
> > From: Nick Piggin <[email protected]>
> > Date: Wed, 31 Oct 2007 07:42:21 +0100
> >
> > Anyways, my core suggestion is to add a hook here so platforms can
> > do the remote register fetch if they want.
>
> You could possibly even do a generic "best effort" kind of thing with
> regular IPIs, that will timeout and continue if some CPUs don't handle
> them, and should be pretty easy to get working with existing smp_call_
> function stuff. Not exactly clean, but it would be better than nothing.

Without a doubt.

2007-10-31 10:27:57

by Duane Griffin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On 31/10/2007, Nick Piggin <[email protected]> wrote:
> Well that's probably the best bug report I've ever had, thanks Duane!

Aw, shucks!

> The issue is a silly thinko -- I didn't pay enough attention to the ptrace
> rules in filemap_fault :( partly I think that's because I don't understand
> them and am scared of them ;)
>
> The following minimal patch fixes it here, and should probably be applied to
> 2.6.23 stable as well as 2.6.24. If you could verify it, that would be much
> appreciated.

I can confirm that it fixes the problem, thanks Nick!

Feel free to add:
Tested-by: Duane Griffin <[email protected]>

> However I actually don't really like how this all works. I don't like that
> filemap.c should have to know about ptrace, or exactly what ptrace wants here.
> It's a bit hairy to force insert page into pagecache and pte into pagetables
> here, given the races.
>
> In access_process_vm, can't we just zero fill in the case of a sigbus? Linus?
> That will also avoid changing applicatoin behaviour due to a gdb read...
>
> Thanks,
> Nick
>
> --
> Duane Griffin noticed a 2.6.23 regression that will cause gdb to hang when
> it tries to access the memory of another process beyond i_size.
>
> This is because the solution to the fault vs invalidate race requires that
> we recheck i_size after the page lock is taken. However in that case, I
> had not accounted for the fact that ptracers are granted an exception to this
> rule.
>
> Cc: Duane Griffin <[email protected]>
> Cc: [email protected]
> Signed-off-by: Nick Piggin <[email protected]
> ---
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -1374,7 +1374,7 @@ retry_find:
>
> /* Must recheck i_size under page lock */
> size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> - if (unlikely(vmf->pgoff >= size)) {
> + if (unlikely(vmf->pgoff >= size) && vma->vm_mm == current->mm) {
> unlock_page(page);
> page_cache_release(page);
> goto outside_data_content;
>

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

2007-10-31 15:12:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning



On Wed, 31 Oct 2007, Nick Piggin wrote:
>
> However I actually don't really like how this all works. I don't like that
> filemap.c should have to know about ptrace, or exactly what ptrace wants here.

It shouldn't. It should just fail when it fails. Then, handle_mm_fault()
should return an error code, which should cause get_user_pages() to return
an error code. Which should make ptrace just stop.

So I think your patch is wrong. mm/filemap.c should *not* care about who
does the fault. I think the proper patch is something untested like the
appended...

> It's a bit hairy to force insert page into pagecache and pte into pagetables
> here, given the races.

It's also wrong. They shouldn't be in the page cache, since that can cause
problems with truncate etc. Maybe it doesn't any more, but it's reasonable
to believe that a page outside of i_size should not exist.

> In access_process_vm, can't we just zero fill in the case of a sigbus? Linus?
> That will also avoid changing applicatoin behaviour due to a gdb read...

access_process_vm() should just return how many bytes it could fill (which
means a truncated copy - very including zero bytes - for an error), and
the caller should decide what the right thing to do is.

But who knows, maybe I missed something.

Duane? Does this fix things for you?

Linus

---
mm/filemap.c | 13 ++-----------
1 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9940895..188cf5f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1300,7 +1300,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)

size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (vmf->pgoff >= size)
- goto outside_data_content;
+ return VM_FAULT_SIGBUS;

/* If we don't want any read-ahead, don't bother */
if (VM_RandomReadHint(vma))
@@ -1377,7 +1377,7 @@ retry_find:
if (unlikely(vmf->pgoff >= size)) {
unlock_page(page);
page_cache_release(page);
- goto outside_data_content;
+ return VM_FAULT_SIGBUS;
}

/*
@@ -1388,15 +1388,6 @@ retry_find:
vmf->page = page;
return ret | VM_FAULT_LOCKED;

-outside_data_content:
- /*
- * An external ptracer can access pages that normally aren't
- * accessible..
- */
- if (vma->vm_mm == current->mm)
- return VM_FAULT_SIGBUS;
-
- /* Fall through to the non-read-ahead case */
no_cached_page:
/*
* We're only likely to ever get here if MADV_RANDOM is in

2007-10-31 15:19:44

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On Wed, Oct 31, 2007 at 08:11:10AM -0700, Linus Torvalds wrote:
>
>
> On Wed, 31 Oct 2007, Nick Piggin wrote:
> >
> > However I actually don't really like how this all works. I don't like that
> > filemap.c should have to know about ptrace, or exactly what ptrace wants here.
>
> It shouldn't. It should just fail when it fails. Then, handle_mm_fault()
> should return an error code, which should cause get_user_pages() to return
> an error code. Which should make ptrace just stop.
>
> So I think your patch is wrong. mm/filemap.c should *not* care about who
> does the fault. I think the proper patch is something untested like the
> appended...

Well the patch is right, in the context of the regression I introduced
(and so it should probably go into 2.6.23).

I'd love to get rid of that outside data content crap if possible in
2.6.24. I think you're the one who has the best feeling for the ptrace
cases we have in the VM, so I trust you ;)


> > It's a bit hairy to force insert page into pagecache and pte into pagetables
> > here, given the races.
>
> It's also wrong. They shouldn't be in the page cache, since that can cause
> problems with truncate etc. Maybe it doesn't any more, but it's reasonable
> to believe that a page outside of i_size should not exist.

No I believe it could still be a problem (and at least, it is quite
fragile).


> > In access_process_vm, can't we just zero fill in the case of a sigbus? Linus?
> > That will also avoid changing applicatoin behaviour due to a gdb read...
>
> access_process_vm() should just return how many bytes it could fill (which
> means a truncated copy - very including zero bytes - for an error), and
> the caller should decide what the right thing to do is.
>
> But who knows, maybe I missed something.

I hope it works.


> Duane? Does this fix things for you?
>
> Linus
>
> ---
> mm/filemap.c | 13 ++-----------
> 1 files changed, 2 insertions(+), 11 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9940895..188cf5f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1300,7 +1300,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>
> size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> if (vmf->pgoff >= size)
> - goto outside_data_content;
> + return VM_FAULT_SIGBUS;
>
> /* If we don't want any read-ahead, don't bother */
> if (VM_RandomReadHint(vma))
> @@ -1377,7 +1377,7 @@ retry_find:
> if (unlikely(vmf->pgoff >= size)) {
> unlock_page(page);
> page_cache_release(page);
> - goto outside_data_content;
> + return VM_FAULT_SIGBUS;
> }
>
> /*
> @@ -1388,15 +1388,6 @@ retry_find:
> vmf->page = page;
> return ret | VM_FAULT_LOCKED;
>
> -outside_data_content:
> - /*
> - * An external ptracer can access pages that normally aren't
> - * accessible..
> - */
> - if (vma->vm_mm == current->mm)
> - return VM_FAULT_SIGBUS;
> -
> - /* Fall through to the non-read-ahead case */
> no_cached_page:
> /*
> * We're only likely to ever get here if MADV_RANDOM is in

2007-10-31 16:00:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning



On Wed, 31 Oct 2007, Nick Piggin wrote:
>
> Well the patch is right, in the context of the regression I introduced
> (and so it should probably go into 2.6.23).

Yeah, it probably is fine for -stable.

And if mine (which actually changes behaviour, in that it makes ptrace get
an access error) causes regressions, I guess we'll have to use that
compatible-with-old-behaviour one for 2.6.24 too.

But I just rebooted and tested - the cleaned-up patch does seem to work
fine, and I get "Cannot access memory at address <xyz>" rather than any
reported problem.

So I think I'll commit my version asap, and see if anybody reports that
they have a situation where they use ptrace() and expect zero back from a
shared mapping past the end.. And if there are issues, we can switch back
to the old broken behaviour with your patch,

Linus

2007-10-31 17:19:33

by Duane Griffin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On 31/10/2007, Linus Torvalds <[email protected]> wrote:
> But I just rebooted and tested - the cleaned-up patch does seem to work
> fine, and I get "Cannot access memory at address <xyz>" rather than any
> reported problem.

I can confirm the same thing here, FWIW.

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

2007-10-31 22:55:52

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On Wed, Oct 31, 2007 at 08:59:41AM -0700, Linus Torvalds wrote:
>
>
> On Wed, 31 Oct 2007, Nick Piggin wrote:
> >
> > Well the patch is right, in the context of the regression I introduced
> > (and so it should probably go into 2.6.23).
>
> Yeah, it probably is fine for -stable.
>
> And if mine (which actually changes behaviour, in that it makes ptrace get
> an access error) causes regressions, I guess we'll have to use that
> compatible-with-old-behaviour one for 2.6.24 too.
>
> But I just rebooted and tested - the cleaned-up patch does seem to work
> fine, and I get "Cannot access memory at address <xyz>" rather than any
> reported problem.
>
> So I think I'll commit my version asap, and see if anybody reports that
> they have a situation where they use ptrace() and expect zero back from a
> shared mapping past the end.. And if there are issues, we can switch back
> to the old broken behaviour with your patch,

No that would be great. Fingers crossed it won't cause any problems.

Thanks, all.

2007-10-31 23:08:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning



On Wed, 31 Oct 2007, Nick Piggin wrote:
>
> No that would be great. Fingers crossed it won't cause any problems.

I actually doubt it will cause problems.

We made much bigger changes to ptrace support when we disallowed writing
to read-only shared memory areas (we used to do the magic per-page COW
thing).

But if people have test-cases for ptrace and/or other magical users of
access_vm_pages() (things like core-dumping comes to mind - although I
think we don't dump pure file mappings at all, do we?) it would certainly
be good to run any such tests on the current -git tree...

Linus

2007-11-01 02:38:16

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 31 Oct 2007, Nick Piggin wrote:
> >
> > No that would be great. Fingers crossed it won't cause any problems.
>
> I actually doubt it will cause problems.
>
> We made much bigger changes to ptrace support when we disallowed writing
> to read-only shared memory areas (we used to do the magic per-page COW
> thing).

Really? No, we still do that magic COW thing which creates anonymous
pages in MAP_SHARED vmas, don't we?

But I think that is pretty arbitrary as well. What the "force" argument to
get_user_pages on a MAP_SHARED vma effectively does is punt the permission
check from the vma permissions to the file permissions. If you've opened the
file WR but mmap()ed PROT_READ, it goes ahead and COWs, wheras if you've
opened the file RDONLY it -EFAULTs.

This seems strange wrong, because the act of COWing seems to be exactly
done in order to *absolve* us of file permissions (if we're COWing, why
should it matter if the file happened to be readonly or not).

Could we just try disallow COW for MAP_SHARED mappings?
Seeing as we're doing the break-ptrace thing this release already ;)
All the stuff I know of that requires forced COW (ie. gdb for breakpoints),
does so on MAP_PRIVATE mappings.

I know this is one of Hugh's bugbears. I don't know whether he's thought
out how to remove it, but I wonder if the appended patch is roughly in
the right direction?

> But if people have test-cases for ptrace and/or other magical users of
> access_vm_pages() (things like core-dumping comes to mind - although I
> think we don't dump pure file mappings at all, do we?) it would certainly
> be good to run any such tests on the current -git tree...

We do for MAP_SHARED|MAP_ANONYMOUS, by the looks.

---
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1563,13 +1563,11 @@ static int do_wp_page(struct mm_struct *
reuse = can_share_swap_page(old_page);
unlock_page(old_page);
}
- } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
- (VM_WRITE|VM_SHARED))) {
+ } else if (unlikely((vma->vm_flags & VM_SHARED))) {
/*
- * Only catch write-faults on shared writable pages,
- * read-only shared pages can get COWed by
- * get_user_pages(.write=1, .force=1).
+ * Only catch write-faults on shared writable pages.
*/
+ BUG_ON(!(vma->vm_flags & VM_WRITE));
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
/*
* Notify the address space that the page is about to
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -980,9 +980,11 @@ unsigned long do_mmap_pgoff(struct file
if (locks_verify_locked(inode))
return -EAGAIN;

- vm_flags |= VM_SHARED | VM_MAYSHARE;
- if (!(file->f_mode & FMODE_WRITE))
- vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
+ vm_flags |= VM_MAYSHARE;
+ if (file->f_mode & FMODE_WRITE)
+ vm_flags |= VM_SHARED;
+ if (!(vm_flags & VM_WRITE))
+ vm_flags &= ~VM_MAYWRITE;

/* fall through */
case MAP_PRIVATE:
@@ -998,6 +1000,7 @@ unsigned long do_mmap_pgoff(struct file

if (!file->f_op || !file->f_op->mmap)
return -ENODEV;
+
break;

default:
@@ -1007,6 +1010,8 @@ unsigned long do_mmap_pgoff(struct file
switch (flags & MAP_TYPE) {
case MAP_SHARED:
vm_flags |= VM_SHARED | VM_MAYSHARE;
+ if (!(vm_flags & VM_WRITE))
+ vm_flags &= ~VM_MAYWRITE;
break;
case MAP_PRIVATE:
/*

2007-11-01 15:15:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning



On Thu, 1 Nov 2007, Nick Piggin wrote:

> On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote:
> >
> > We made much bigger changes to ptrace support when we disallowed writing
> > to read-only shared memory areas (we used to do the magic per-page COW
> > thing).
>
> Really? No, we still do that magic COW thing which creates anonymous
> pages in MAP_SHARED vmas, don't we?

No, we don't. I'm pretty sure. It didn't work with the VM cleanups, since
the MAP_SHARED vma's won't be on the anonymous list any more, and cannot
be swapped out.

So now, if you try to write to a read-only shared page through ptrace,
you'll get "Unable to access".

Of course, I didn't really look closely, so maybe I just don't remember
things right..

> > access_vm_pages() (things like core-dumping comes to mind - although I
> > think we don't dump pure file mappings at all, do we?) it would certainly
> > be good to run any such tests on the current -git tree...
>
> We do for MAP_SHARED|MAP_ANONYMOUS, by the looks.

Well, as we should. There's no way for a debugger to get those pages back.
So that all looks sane.

> - vm_flags |= VM_SHARED | VM_MAYSHARE;
> - if (!(file->f_mode & FMODE_WRITE))
> - vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
> + vm_flags |= VM_MAYSHARE;
> + if (file->f_mode & FMODE_WRITE)
> + vm_flags |= VM_SHARED;
> + if (!(vm_flags & VM_WRITE))
> + vm_flags &= ~VM_MAYWRITE;

This looks totally bogus. What was the intent of this patch?

The VM_MAYWRITE flag is *not* supposed to track the VM_WRITE flag: that
would defeat the whole purpose of it! The whole point of that flag is to
say whether mprotect() could turn it into a VM_WRITE mapping, and it
depends on the file mode, not VM_WRITE!

> vm_flags |= VM_SHARED | VM_MAYSHARE;
> + if (!(vm_flags & VM_WRITE))
> + vm_flags &= ~VM_MAYWRITE;

More apparent bogosities.

Linus

2007-11-01 15:47:20

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On Thu, Nov 01, 2007 at 08:14:47AM -0700, Linus Torvalds wrote:
>
>
> On Thu, 1 Nov 2007, Nick Piggin wrote:
>
> > On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote:
> > >
> > > We made much bigger changes to ptrace support when we disallowed writing
> > > to read-only shared memory areas (we used to do the magic per-page COW
> > > thing).
> >
> > Really? No, we still do that magic COW thing which creates anonymous
> > pages in MAP_SHARED vmas, don't we?
>
> No, we don't. I'm pretty sure. It didn't work with the VM cleanups, since
> the MAP_SHARED vma's won't be on the anonymous list any more, and cannot
> be swapped out.
>
> So now, if you try to write to a read-only shared page through ptrace,
> you'll get "Unable to access".

No, it COWs it (the file is RW).

I believe do_wp_page will still attach an anon_vma to the vma, which
will make the pte discoverable via rmap.


> Of course, I didn't really look closely, so maybe I just don't remember
> things right..
>
> > > access_vm_pages() (things like core-dumping comes to mind - although I
> > > think we don't dump pure file mappings at all, do we?) it would certainly
> > > be good to run any such tests on the current -git tree...
> >
> > We do for MAP_SHARED|MAP_ANONYMOUS, by the looks.
>
> Well, as we should. There's no way for a debugger to get those pages back.
> So that all looks sane.
>
> > - vm_flags |= VM_SHARED | VM_MAYSHARE;
> > - if (!(file->f_mode & FMODE_WRITE))
> > - vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
> > + vm_flags |= VM_MAYSHARE;
> > + if (file->f_mode & FMODE_WRITE)
> > + vm_flags |= VM_SHARED;
> > + if (!(vm_flags & VM_WRITE))
> > + vm_flags &= ~VM_MAYWRITE;
>
> This looks totally bogus. What was the intent of this patch?
>
> The VM_MAYWRITE flag is *not* supposed to track the VM_WRITE flag: that
> would defeat the whole purpose of it! The whole point of that flag is to
> say whether mprotect() could turn it into a VM_WRITE mapping, and it
> depends on the file mode, not VM_WRITE!

Yeah of course that won't work, stupid...

The intent is to stop get_user_pages from proceeding with a write fault (and
subsequent COW) to readonly shared mappings, when force is set. I think it
can be done simply via get_user_pages(), which is what I should have done
to begin with.

Untested patch follows
---
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1031,7 +1031,9 @@ int get_user_pages(struct task_struct *t
}

if (!vma || (vma->vm_flags & (VM_IO | VM_PFNMAP))
- || !(vm_flags & vma->vm_flags))
+ || !(vm_flags & vma->vm_flags)
+ || (write && ((vma->vm_flags &
+ (VM_SHARED|VM_MAYSHARE)) == VM_MAYSHARE)))
return i ? : -EFAULT;

if (is_vm_hugetlb_page(vma)) {
@@ -1563,13 +1565,11 @@ static int do_wp_page(struct mm_struct *
reuse = can_share_swap_page(old_page);
unlock_page(old_page);
}
- } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
- (VM_WRITE|VM_SHARED))) {
+ } else if (unlikely((vma->vm_flags & VM_SHARED))) {
/*
- * Only catch write-faults on shared writable pages,
- * read-only shared pages can get COWed by
- * get_user_pages(.write=1, .force=1).
+ * Only catch write-faults on shared writable pages.
*/
+ BUG_ON(!(vma->vm_flags & VM_WRITE));
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
/*
* Notify the address space that the page is about to

2007-11-01 16:10:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning



On Thu, 1 Nov 2007, Nick Piggin wrote:
>
> Untested patch follows

Ok, this looks ok.

Except I would remove the VM_MAYSHARE bit from the test.

That whole bit should go, in fact.

We used to make it something different: iirc, a read-only SHARED mapping
was downgraded to a non-shared mapping, because we wanted to avoid some of
the costs we used to have with the VM implementation (actually, I think it
was various filesystems that don't like shared mappings because they don't
have a per-page writeback). But we left the VM_MAYSHARE bit on, to get
/proc/<pid>/mmap things right.

Or something like that. I forget the details. But I *think* we don't
actually need this any more.

But basically, the "right" way to test for shared mappings is historically
to just test the VM_MAYSHARE bit - but not *both* bits. Because VM_SHARE
may have been artificially cleared.

Somebody should double-check my memory.

Linus

2007-11-01 23:56:56

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On Thu, Nov 01, 2007 at 09:08:45AM -0700, Linus Torvalds wrote:
>
>
> On Thu, 1 Nov 2007, Nick Piggin wrote:
> >
> > Untested patch follows
>
> Ok, this looks ok.
>
> Except I would remove the VM_MAYSHARE bit from the test.

But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb
uses this for inserting breakpoints (but fortunately, a COW page in a
MAP_PRIVATE mapping is a much more natural thing for the VM).


> That whole bit should go, in fact.
>
> We used to make it something different: iirc, a read-only SHARED mapping
> was downgraded to a non-shared mapping, because we wanted to avoid some of
> the costs we used to have with the VM implementation (actually, I think it
> was various filesystems that don't like shared mappings because they don't
> have a per-page writeback). But we left the VM_MAYSHARE bit on, to get
> /proc/<pid>/mmap things right.
>
> Or something like that. I forget the details. But I *think* we don't
> actually need this any more.
>
> But basically, the "right" way to test for shared mappings is historically
> to just test the VM_MAYSHARE bit - but not *both* bits. Because VM_SHARE
> may have been artificially cleared.

I think you're right -- VM_MAYSHARE is basically testing for MAP_SHARED.
I just don't know exactly what you're proposing here.

Thanks,
Nick

2007-11-02 01:18:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning



On Fri, 2 Nov 2007, Nick Piggin wrote:
>
> But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb
> uses this for inserting breakpoints (but fortunately, a COW page in a
> MAP_PRIVATE mapping is a much more natural thing for the VM).

Yes, I phrased that badly. I meant that I'd be happier if we got rid of
VM_MAYSHARE entirely, and just used VM_SHARED. I thought we already made
them always be the same (and any VM_MAYSHARE use is historical).

Linus

2007-11-02 05:02:18

by David Miller

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

From: David Miller <[email protected]>
Date: Wed, 31 Oct 2007 00:44:25 -0700 (PDT)

> From: Nick Piggin <[email protected]>
> Date: Wed, 31 Oct 2007 08:41:06 +0100
>
> > You could possibly even do a generic "best effort" kind of thing with
> > regular IPIs, that will timeout and continue if some CPUs don't handle
> > them, and should be pretty easy to get working with existing smp_call_
> > function stuff. Not exactly clean, but it would be better than nothing.
>
> Without a doubt.

Putting my code where my mouth is, here is an example implementation
of a special SysRQ "g" "dump regs globally" debugging tool for
sparc64.

The only thing that has to happen is the SysRQ trigger. So if you can
either SysRQ-'g' at the console or "echo 'g' >/proc/sysrq-trigger" you
can get the registers from the cpus in the system.

The only case the remote cpu registers would not be capturable would
be if they were stuck looping in the trap entry, trap exit, or low
level TLB handler code.

This means that even if some cpu is stuck in a spinlock loop with
interrupts disabled, you'd see it with this thing. The way it works
is that cross cpu vectored interrupts are disabled independently of
the processor interrupt level on sparc64.

This version just records the absolute minimum processor state, it
could be trivially extended to record all of pt_regs etc.

Even on my 64 cpu niagara box, the output is reasonable and fills up
one full screen of my console window. Full pt_regs dumps are too
much.

diff --git a/arch/sparc64/kernel/process.c b/arch/sparc64/kernel/process.c
index ca7cdfd..67bf91d 100644
--- a/arch/sparc64/kernel/process.c
+++ b/arch/sparc64/kernel/process.c
@@ -1,7 +1,6 @@
-/* $Id: process.c,v 1.131 2002/02/09 19:49:30 davem Exp $
- * arch/sparc64/kernel/process.c
+/* arch/sparc64/kernel/process.c
*
- * Copyright (C) 1995, 1996 David S. Miller ([email protected])
+ * Copyright (C) 1995, 1996, 2007 David S. Miller ([email protected])
* Copyright (C) 1996 Eddie C. Dost ([email protected])
* Copyright (C) 1997, 1998 Jakub Jelinek ([email protected])
*/
@@ -31,6 +30,7 @@
#include <linux/tick.h>
#include <linux/init.h>
#include <linux/cpu.h>
+#include <linux/sysrq.h>

#include <asm/oplib.h>
#include <asm/uaccess.h>
@@ -48,6 +48,7 @@
#include <asm/unistd.h>
#include <asm/hypervisor.h>
#include <asm/sstate.h>
+#include <asm/irq_regs.h>

/* #define VERBOSE_SHOWREGS */

@@ -388,6 +389,76 @@ void show_regs32(struct pt_regs32 *regs)
regs->u_regs[15]);
}

+#ifdef CONFIG_MAGIC_SYSRQ
+struct global_reg_snapshot {
+ unsigned long tstate;
+ unsigned long tpc;
+ unsigned long tnpc;
+ struct thread_info *thread;
+} global_reg_snapshot[NR_CPUS];
+static DEFINE_SPINLOCK(global_reg_snapshot_lock);
+
+static void sysrq_handle_globreg(int key, struct tty_struct *tty)
+{
+ struct pt_regs *regs = get_irq_regs();
+#ifdef CONFIG_KALLSYMS
+ char buffer[KSYM_SYMBOL_LEN];
+#endif
+ unsigned long flags;
+ int cpu;
+
+ spin_lock_irqsave(&global_reg_snapshot_lock, flags);
+ cpu = raw_smp_processor_id();
+ if (regs) {
+ global_reg_snapshot[cpu].tstate = regs->tstate;
+ global_reg_snapshot[cpu].tpc = regs->tpc;
+ global_reg_snapshot[cpu].tnpc = regs->tnpc;
+ } else {
+ global_reg_snapshot[cpu].tstate = 0;
+ global_reg_snapshot[cpu].tpc = 0;
+ global_reg_snapshot[cpu].tnpc = 0;
+ }
+ global_reg_snapshot[cpu].thread = current_thread_info();
+
+ smp_fetch_global_regs();
+
+ for_each_online_cpu(cpu) {
+ struct global_reg_snapshot *gp = &global_reg_snapshot[cpu];
+ struct thread_info *tp = gp->thread;
+
+ printk("%c CPU[%3d]: TSTATE[%016lx] TPC[%016lx] TNPC[%016lx] TASK[%s:%d]\n",
+ (cpu == raw_smp_processor_id() ? '*' : ' '), cpu,
+ gp->tstate, gp->tpc, gp->tnpc,
+ ((tp && tp->task) ? tp->task->comm : "NULL"),
+ ((tp && tp->task) ? tp->task->pid : -1));
+#ifdef CONFIG_KALLSYMS
+ if ((gp->tstate & TSTATE_PRIV) && (gp->tpc != 0UL)) {
+ sprint_symbol(buffer, gp->tpc);
+ printk(" TPC[%s]\n", buffer);
+ }
+#endif
+ }
+
+ memset(global_reg_snapshot, 0, sizeof(global_reg_snapshot));
+
+ spin_unlock_irqrestore(&global_reg_snapshot_lock, flags);
+}
+
+static struct sysrq_key_op sparc_globalreg_op = {
+ .handler = sysrq_handle_globreg,
+ .help_msg = "Globalregs",
+ .action_msg = "Show Global CPU Regs",
+};
+
+static int __init sparc_globreg_init(void)
+{
+ return register_sysrq_key('g', &sparc_globalreg_op);
+}
+
+core_initcall(sparc_globreg_init);
+
+#endif
+
unsigned long thread_saved_pc(struct task_struct *tsk)
{
struct thread_info *ti = task_thread_info(tsk);
diff --git a/arch/sparc64/kernel/smp.c b/arch/sparc64/kernel/smp.c
index c73b7a4..cbedf27 100644
--- a/arch/sparc64/kernel/smp.c
+++ b/arch/sparc64/kernel/smp.c
@@ -894,6 +894,7 @@ extern unsigned long xcall_flush_tlb_mm;
extern unsigned long xcall_flush_tlb_pending;
extern unsigned long xcall_flush_tlb_kernel_range;
extern unsigned long xcall_report_regs;
+extern unsigned long xcall_fetch_glob_regs;
extern unsigned long xcall_receive_signal;
extern unsigned long xcall_new_mmu_context_version;

@@ -1064,6 +1065,11 @@ void smp_report_regs(void)
smp_cross_call(&xcall_report_regs, 0, 0, 0);
}

+void smp_fetch_global_regs(void)
+{
+ smp_cross_call(&xcall_fetch_glob_regs, 0, 0, 0);
+}
+
/* We know that the window frames of the user have been flushed
* to the stack before we get here because all callers of us
* are flush_tlb_*() routines, and these run after flush_cache_*()
diff --git a/arch/sparc64/mm/ultra.S b/arch/sparc64/mm/ultra.S
index 737c269..7a079ed 100644
--- a/arch/sparc64/mm/ultra.S
+++ b/arch/sparc64/mm/ultra.S
@@ -1,7 +1,6 @@
-/* $Id: ultra.S,v 1.72 2002/02/09 19:49:31 davem Exp $
- * ultra.S: Don't expand these all over the place...
+/* ultra.S: Don't expand these all over the place...
*
- * Copyright (C) 1997, 2000 David S. Miller ([email protected])
+ * Copyright (C) 1997, 2000, 2007 David S. Miller ([email protected])
*/

#include <asm/asi.h>
@@ -15,6 +14,7 @@
#include <asm/thread_info.h>
#include <asm/cacheflush.h>
#include <asm/hypervisor.h>
+#include <asm/cpudata.h>

/* Basically, most of the Spitfire vs. Cheetah madness
* has to do with the fact that Cheetah does not support
@@ -523,6 +523,27 @@ xcall_report_regs:
b rtrap_xcall
ldx [%sp + PTREGS_OFF + PT_V9_TSTATE], %l1

+ .globl xcall_fetch_glob_regs
+xcall_fetch_glob_regs:
+ sethi %hi(global_reg_snapshot), %g1
+ or %g1, %lo(global_reg_snapshot), %g1
+ __GET_CPUID(%g2)
+ sllx %g2, 5, %g3
+ add %g1, %g3, %g1
+ rdpr %tstate, %g7
+ stx %g7, [%g1 + 0x00]
+ rdpr %tpc, %g7
+ stx %g7, [%g1 + 0x08]
+ rdpr %tnpc, %g7
+ stx %g7, [%g1 + 0x10]
+ sethi %hi(trap_block), %g7
+ or %g7, %lo(trap_block), %g7
+ sllx %g2, TRAP_BLOCK_SZ_SHIFT, %g2
+ add %g7, %g2, %g7
+ ldx [%g7 + TRAP_PER_CPU_THREAD], %g3
+ stx %g3, [%g1 + 0x18]
+ retry
+
#ifdef DCACHE_ALIASING_POSSIBLE
.align 32
.globl xcall_flush_dcache_page_cheetah
diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
index 39cc318..7f871a5 100644
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -335,6 +335,7 @@ static struct sysrq_key_op *sysrq_key_table[36] = {
&sysrq_term_op, /* e */
&sysrq_moom_op, /* f */
/* g: May be registered by ppc for kgdb */
+ /* May be registered by sparc for global register dump */
NULL, /* g */
NULL, /* h */
&sysrq_kill_op, /* i */
diff --git a/include/asm-sparc64/smp.h b/include/asm-sparc64/smp.h
index e8a96a3..29393e3 100644
--- a/include/asm-sparc64/smp.h
+++ b/include/asm-sparc64/smp.h
@@ -1,6 +1,6 @@
/* smp.h: Sparc64 specific SMP stuff.
*
- * Copyright (C) 1996 David S. Miller ([email protected])
+ * Copyright (C) 1996, 2007 David S. Miller ([email protected])
*/

#ifndef _SPARC64_SMP_H
@@ -43,6 +43,8 @@ extern int hard_smp_processor_id(void);
extern void smp_fill_in_sib_core_maps(void);
extern void cpu_play_dead(void);

+extern void smp_fetch_global_regs(void);
+
#ifdef CONFIG_HOTPLUG_CPU
extern int __cpu_disable(void);
extern void __cpu_die(unsigned int cpu);
@@ -54,6 +56,7 @@ extern void __cpu_die(unsigned int cpu);

#define hard_smp_processor_id() 0
#define smp_fill_in_sib_core_maps() do { } while (0)
+#define smp_fetch_global_regs() do { } while (0)

#endif /* !(CONFIG_SMP) */

2007-11-02 06:30:41

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On Thu, Nov 01, 2007 at 06:17:42PM -0700, Linus Torvalds wrote:
>
>
> On Fri, 2 Nov 2007, Nick Piggin wrote:
> >
> > But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb
> > uses this for inserting breakpoints (but fortunately, a COW page in a
> > MAP_PRIVATE mapping is a much more natural thing for the VM).
>
> Yes, I phrased that badly. I meant that I'd be happier if we got rid of
> VM_MAYSHARE entirely, and just used VM_SHARED. I thought we already made
> them always be the same (and any VM_MAYSHARE use is historical).

Oh yeah, I think it would probably be clearer to use VM_SHARED == MAP_SHARED,
and test the write permission explicitly. Though there could be something
I missed that makes it not as easy as it sounds... probably something best
left for Hugh ;)

2007-11-02 10:46:11

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

On Thu, Nov 01, 2007 at 10:02:04PM -0700, David Miller wrote:
> From: David Miller <[email protected]>
> Date: Wed, 31 Oct 2007 00:44:25 -0700 (PDT)
>
> > From: Nick Piggin <[email protected]>
> > Date: Wed, 31 Oct 2007 08:41:06 +0100
> >
> > > You could possibly even do a generic "best effort" kind of thing with
> > > regular IPIs, that will timeout and continue if some CPUs don't handle
> > > them, and should be pretty easy to get working with existing smp_call_
> > > function stuff. Not exactly clean, but it would be better than nothing.
> >
> > Without a doubt.
>
> Putting my code where my mouth is, here is an example implementation
> of a special SysRQ "g" "dump regs globally" debugging tool for
> sparc64.
>
> The only thing that has to happen is the SysRQ trigger. So if you can
> either SysRQ-'g' at the console or "echo 'g' >/proc/sysrq-trigger" you
> can get the registers from the cpus in the system.
>
> The only case the remote cpu registers would not be capturable would
> be if they were stuck looping in the trap entry, trap exit, or low
> level TLB handler code.
>
> This means that even if some cpu is stuck in a spinlock loop with
> interrupts disabled, you'd see it with this thing. The way it works
> is that cross cpu vectored interrupts are disabled independently of
> the processor interrupt level on sparc64.

That's really sweet. I'd better put my code where my mouth is as well...
this patch seems to be about the best we can do as a generic fallback
without getting a dedicated vector (which has to be done in arch
code anyway).

Maybe it can be improved, I don't know... I also don't know if I'm doing
exactly the right thing with pt_regs, but it seems to work.

One thing I'm doing is just dumping a single CPU, on a rotational basis.
Don't know whether that's better or worse, but just an idea.

Now I don't know what's the best way for an arch to override this with
their own enhanced handler (an ARCH_HAVE ifdef?)

Index: linux-2.6/arch/x86/kernel/smp_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp_64.c
+++ linux-2.6/arch/x86/kernel/smp_64.c
@@ -504,8 +504,10 @@ asmlinkage void smp_reschedule_interrupt
add_pda(irq_resched_count, 1);
}

-asmlinkage void smp_call_function_interrupt(void)
+asmlinkage void smp_call_function_interrupt(struct pt_regs *regs)
{
+ struct pt_regs *old_regs = set_irq_regs(regs);
+
void (*func) (void *info) = call_data->func;
void *info = call_data->info;
int wait = call_data->wait;
@@ -515,7 +517,7 @@ asmlinkage void smp_call_function_interr
* Notify initiating CPU that I've grabbed the data and am
* about to execute the function
*/
- mb();
+ smp_rmb();
atomic_inc(&call_data->started);
/*
* At this point the info structure may be out of scope unless wait==1
@@ -526,8 +528,10 @@ asmlinkage void smp_call_function_interr
add_pda(irq_call_count, 1);
irq_exit();
if (wait) {
- mb();
+ smp_mb();
atomic_inc(&call_data->finished);
}
+
+ set_irq_regs(old_regs);
}

Index: linux-2.6/drivers/char/sysrq.c
===================================================================
--- linux-2.6.orig/drivers/char/sysrq.c
+++ linux-2.6/drivers/char/sysrq.c
@@ -196,12 +196,72 @@ static struct sysrq_key_op sysrq_showloc
#define sysrq_showlocks_op (*(struct sysrq_key_op *)0)
#endif

-static void sysrq_handle_showregs(int key, struct tty_struct *tty)
+static void show_registers(struct pt_regs *regs)
{
- struct pt_regs *regs = get_irq_regs();
if (regs)
show_regs(regs);
}
+
+static void show_cpu_registers(void)
+{
+ show_registers(get_irq_regs());
+}
+
+static void show_global_regs(void *arg)
+{
+ show_cpu_registers();
+}
+
+static DEFINE_SPINLOCK(global_rotate_lock);
+static cpumask_t global_rotate_cpu;
+
+static void sysrq_handle_global_showregs_work_fn(struct work_struct *work)
+{
+ int cpu, next, thiscpu;
+
+ spin_lock(&global_rotate_lock);
+ if (cpus_weight(global_rotate_cpu) == 0)
+ cpu = smp_processor_id();
+ else
+ cpu = first_cpu(global_rotate_cpu);
+
+ next = next_cpu(cpu, cpu_online_map);
+ if (next == NR_CPUS)
+ next = first_cpu(cpu_online_map);
+ cpu_clear(cpu, global_rotate_cpu);
+ cpu_set(next, global_rotate_cpu);
+ BUG_ON(cpus_weight(global_rotate_cpu) != 1);
+ spin_unlock(&global_rotate_lock);
+
+ thiscpu = get_cpu();
+ if (cpu == thiscpu)
+ show_registers(task_pt_regs(current));
+ else {
+ if (smp_call_function_single(cpu, show_global_regs, NULL, 1, 1)
+ == -1) {
+ printk("Could not interrogate CPU registers\n");
+ return;
+ }
+ }
+ put_cpu();
+}
+static DECLARE_WORK(global_showregs_work, sysrq_handle_global_showregs_work_fn);
+static void sysrq_handle_global_showregs(int key, struct tty_struct *tty)
+{
+ schedule_work(&global_showregs_work);
+}
+static struct sysrq_key_op sysrq_global_showregs_op = {
+ .handler = sysrq_handle_global_showregs,
+ .help_msg = "showGlobalPc",
+ .action_msg = "Show Global CPU Regs",
+ .enable_mask = SYSRQ_ENABLE_DUMP,
+};
+
+
+static void sysrq_handle_showregs(int key, struct tty_struct *tty)
+{
+ show_cpu_registers();
+}
static struct sysrq_key_op sysrq_showregs_op = {
.handler = sysrq_handle_showregs,
.help_msg = "showPc",
@@ -336,7 +396,7 @@ static struct sysrq_key_op *sysrq_key_ta
&sysrq_term_op, /* e */
&sysrq_moom_op, /* f */
/* g: May be registered by ppc for kgdb */
- NULL, /* g */
+ &sysrq_global_showregs_op, /* g */
NULL, /* h */
&sysrq_kill_op, /* i */
NULL, /* j */

2007-11-02 15:37:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning


* Nick Piggin <[email protected]> wrote:

> > This means that even if some cpu is stuck in a spinlock loop with
> > interrupts disabled, you'd see it with this thing. The way it works
> > is that cross cpu vectored interrupts are disabled independently of
> > the processor interrupt level on sparc64.
>
> That's really sweet. I'd better put my code where my mouth is as
> well... this patch seems to be about the best we can do as a generic
> fallback without getting a dedicated vector (which has to be done in
> arch code anyway).

btw., in -rt we had something like this implemented as an NMI mechanism
(for x86) - that way it's guaranteed that we get some output even from
locked up CPUs.

Ingo