While looking at the new bits'n'pieces which has appeared in 2.5.71, I
noticed the following in alpha and ia64:
#define alpha_task_regs(task) \
((struct pt_regs *) ((long) (task)->thread_info + 2*PAGE_SIZE) - 1)
#define force_successful_syscall_return() (alpha_task_regs(current)->r0 = 0)
# define ia64_task_regs(t) (((struct pt_regs *) ((char *) (t) + IA64_STK_OFFSET)) - 1)
#define force_successful_syscall_return() \
do { \
ia64_task_regs(current)->r8 = 0; \
} while (0)
I don't know what happens on these architectures, but I have a suspicion
that there is a case which the above will fail, maybe with dramatic
consequences.
Consider what happens when a userspace program is started from kernel
space, eg the init(8) or hotplug programs. In these, we call execve()
from within kernel space function. This implies that we have some
frames already on the stack.
AFAIK, sys_execve() does not ensure that the kernel stack will be empty
before starting the user space thread, so these programs are running with
a slightly reduced kernel stack.
In turn, this means that the user registers are not stored at the top
of the kernel stack when the user space program subsequently calls a
kernel system call, which means the *_task_regs() macro doesn't point
at the saved user registers.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
On Sun, Jun 15, 2003 at 07:36:04PM +0100, Russell King wrote:
> AFAIK, sys_execve() does not ensure that the kernel stack will be empty
> before starting the user space thread, so these programs are running with
> a slightly reduced kernel stack.
Indeed. This is fixed in a rewrite I have of entry.S, but
that's not particularly stable at the moment...
r~
Russell King writes:
> #define force_successful_syscall_return() \
> do { \
> ia64_task_regs(current)->r8 = 0; \
> } while (0)
>
> I don't know what happens on these architectures, but I have a suspicion
> that there is a case which the above will fail, maybe with dramatic
> consequences.
On PPC, I am going to use a bit in the thread_info flags field to
indicate that the current syscall should not return an error. The
syscall entry and exit paths already look at the thread_info flags
(testing the syscall trace bit, among others) so it's convenient to
have the "no error" flag there too. The flag bit gets cleared on
syscall entry, set by force_successful_syscall_return() and tested on
syscall exit. The only restriction is that kernel code should not do
a system call between where it calls force_successful_syscall_return
and where it returns from the syscall. But I don't believe we ever do
recursive system calls anyway, so that should be fine.
Regards,
Paul.
>>>>> On Sun, 15 Jun 2003 19:36:04 +0100, Russell King <[email protected]> said:
Russell> Consider what happens when a userspace program is started
Russell> from kernel space, eg the init(8) or hotplug programs. In
Russell> these, we call execve() from within kernel space function.
Russell> This implies that we have some frames already on the stack.
Russell> AFAIK, sys_execve() does not ensure that the kernel stack
Russell> will be empty before starting the user space thread, so
Russell> these programs are running with a slightly reduced kernel
Russell> stack.
Russell> In turn, this means that the user registers are not stored
Russell> at the top of the kernel stack when the user space program
Russell> subsequently calls a kernel system call, which means the
Russell> *_task_regs() macro doesn't point at the saved user
Russell> registers.
That's a limitation that was described in the change log entry:
The only limitation of force_successful_syscall_return() is
that it doesn't help with system calls performed by the
kernel. But the kernel does that so rarely and for such a
limited set of syscalls that this is not a real problem.
Alpha and ia64 have used pt_regs for "force-success" purposes for a
long time, but if you want to add support to another platform, I'd
also recommend using the task_info instead. Dave Miller proposed a
scheme that could work nicely: force_success flag is off by default
and turned on by force_successful_syscall_return(). When you
"consume" the flag in the syscall exit path, you turn it off
afterwards. The advantage of this scheme is that you don't have any
extra stores on the performance-critical syscall entry path. The
disadvantage is that you'll have to check the flag even when the
return-value is non-negative (probably not much of a disadvantage if
you store the flag along with the other flags).
Having said that, there is one real problem: Linus pointed out that
the current scheme may be in correct for certain platforms: if
sizeof(loff_t) > sizeof(off_t) then drivers/char/mem.c:memory_lseek()
may do force_successful_syscall_return() yet sys_lseek() would have to
return EOVERFLOW if it turns out that the offset doesn't fit in off_t.
I think this would only affect 32-bit platforms with some sort of
physical address-extensions. So it would affect x86, but perhaps
nothing else (but of course, x86 doesn't use
force_successful_syscall() anyhow, so for now, that's OK).
The clean way to fix this would probably be to return the offset via a
pointer argument.
--david
On Mon, Jun 16, 2003 at 10:38:27AM -0700, David Mosberger wrote:
> >>>>> On Sun, 15 Jun 2003 19:36:04 +0100, Russell King <[email protected]> said:
>
> Russell> Consider what happens when a userspace program is started
> Russell> from kernel space, eg the init(8) or hotplug programs. In
> Russell> these, we call execve() from within kernel space function.
> Russell> This implies that we have some frames already on the stack.
>
> Russell> AFAIK, sys_execve() does not ensure that the kernel stack
> Russell> will be empty before starting the user space thread, so
> Russell> these programs are running with a slightly reduced kernel
> Russell> stack.
>
> Russell> In turn, this means that the user registers are not stored
> Russell> at the top of the kernel stack when the user space program
> Russell> subsequently calls a kernel system call, which means the
> Russell> *_task_regs() macro doesn't point at the saved user
> Russell> registers.
>
> That's a limitation that was described in the change log entry:
>
> The only limitation of force_successful_syscall_return() is
> that it doesn't help with system calls performed by the
> kernel. But the kernel does that so rarely and for such a
> limited set of syscalls that this is not a real problem.
I'm not actually talking about subsequent syscalls issued by the kernel.
I'm talking about stuff like init, bash, and the module tools. If
any of these call any of the affected syscalls which expect user
registers to be at the top of the kernel stack, they'll be accessing
the wrong data. A corrected comment would be:
The only limitation of force_successful_syscall_return() is
that it doesn't help with system calls performed by the
kernel or user threads exec'd from the kernel.
I'd suggest such a comment is added to force_successful_syscall_return()
to ensure that anyone thinking it'll work for all user space processes
is sufficiently deterred.
> Alpha and ia64 have used pt_regs for "force-success" purposes for a
> long time, but if you want to add support to another platform, I'd
> also recommend using the task_info instead.
Oh, I'm currently not thinking about implementing this on ARM; this
touches a similar area which I investigated a number of months ago.
I through out the idea of accessing user registers for user space
programs at the top of the kernel stack because it does not work for
processes exec'd from kernel space.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
On Mon, 2003-06-16 at 10:55, Russell King wrote:
> I'm not actually talking about subsequent syscalls issued by the kernel.
> I'm talking about stuff like init, bash, and the module tools.
Wrong, after the go for the first time into user space, the
next trap into the kernel will put the pt_regs at the top at
the stack where we expect it to be.
--
David S. Miller <[email protected]>
>>>>> On Mon, 16 Jun 2003 18:55:49 +0100, Russell King <[email protected]> said:
Russell> I through out the idea of accessing user registers for user space
Russell> programs at the top of the kernel stack because it does not work for
Russell> processes exec'd from kernel space.
It doesn't work for the execve() itself, but it works for all
subsequent syscalls. force_successful_syscall() not working for
execve() is of course not a problem, since it returns only on error.
--david
David S. Miller wrote:
> On Mon, 2003-06-16 at 10:55, Russell King wrote:
>
>>I'm not actually talking about subsequent syscalls issued by the kernel.
>>I'm talking about stuff like init, bash, and the module tools.
>
>
> Wrong, after the go for the first time into user space, the
> next trap into the kernel will put the pt_regs at the top at
> the stack where we expect it to be.
>
I was facing a simillar problem with ptrace on Alpha (ptrace on alpha
expect the pt_regs at current + 2*PAGE_SIZE for 2.4. kernel ) w.r.t
http://www.openssi.org project. What i found was that even after we return to
user space subsequent syscalls are not putting pt_regs at that offset. I
guess while entering the kernel kernel stack pointer always point to
value stored in thread_struct.ksp ?
-aneesh
>>>>> On Tue, 17 Jun 2003 12:24:23 +0530, "Aneesh Kumar K.V" <[email protected]> said:
Aneesh> I was facing a simillar problem with ptrace on Alpha (ptrace
Aneesh> on alpha expect the pt_regs at current + 2*PAGE_SIZE for
Aneesh> 2.4. kernel ) w.r.t http://www.openssi.org project. What i found
Aneesh> was that even after we return to user space subsequent
Aneesh> syscalls are not putting pt_regs at that offset. I guess
Aneesh> while entering the kernel kernel stack pointer always point
Aneesh> to value stored in thread_struct.ksp ?
If a platform doesn't start with an empty kernel stack on entry from
user-space, that platform will be wasting (precious) stack space and
ptrace() most likely won't work reliably. Personally, I'd consider
such behavior a bug, but I suppose it is to some degree a
platform-choice.
--david
From: David Mosberger <[email protected]>
Date: Tue, 17 Jun 2003 12:01:12 -0700
Personally, I'd consider such behavior a bug,
Me too.
On Tue, Jun 17, 2003 at 12:01:12PM -0700, David Mosberger wrote:
> Aneesh> I was facing a simillar problem with ptrace on Alpha (ptrace
> Aneesh> on alpha expect the pt_regs at current + 2*PAGE_SIZE for
> Aneesh> 2.4. kernel ) w.r.t http://www.openssi.org project. What i found
> Aneesh> was that even after we return to user space subsequent
> Aneesh> syscalls are not putting pt_regs at that offset. I guess
> Aneesh> while entering the kernel kernel stack pointer always point
> Aneesh> to value stored in thread_struct.ksp ?
>
> If a platform doesn't start with an empty kernel stack on entry from
> user-space, that platform will be wasting (precious) stack space and
> ptrace() most likely won't work reliably. Personally, I'd consider
> such behavior a bug...
So would I. It's now fixed.
r~