Hi -
I'm experimenting with some code to track down stack overruns in the
kernel, and I've stumbled across some stuff in the i386 kernel_thread
code that strikes me as very suspicious. This is 2.4.6.
First off, here's kernel_thread from arch/i386/kernel/process.c:
int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
{
long retval, d0;
__asm__ __volatile__(
"movl %%esp,%%esi\n\t"
"int $0x80\n\t" /* Linux/i386 system call */
"cmpl %%esp,%%esi\n\t" /* child or parent? */
"je 1f\n\t" /* parent - jump */
... stuff omitted ...
"1:\t"
:"=&a" (retval), "=&S" (d0)
:"0" (__NR_clone), "i" (__NR_exit),
"r" (arg), "r" (fn),
"b" (flags | CLONE_VM)
: "memory");
return retval;
}
The register constraints make sure that the "a" register (eax) is
operand 0 and contains __NR_clone to start with. The "b" register (ebx)
contains (flags | CLONE_VM). We save the stack pointer to ESI and INT
80, which (combined with the __NR_clone in eax) lands us in sys_clone
(same file):
asmlinkage int sys_clone(struct pt_regs regs)
{
unsigned long clone_flags;
unsigned long newsp;
clone_flags = regs.ebx;
newsp = regs.ecx;
if (!newsp)
newsp = regs.esp;
return do_fork(clone_flags, newsp, ®s, 0);
}
The first thing I notice is that this function refers not only to the
clone flags in ebx, but also to a "newsp" in ecx - and ecx went
completely unmentioned in kernel_thread()! A disassembly of
kernel_thread shows that "arg" winds up in ecx before the system call,
so I guess this is what gets passed to do_fork(), where (I think) it
ultimately ends up being the child's stack pointer.
In the case of bdflush_init() (end of fs/buffer.c), what gets passed in
as "arg" is the address of a semaphore on the stack - the only variable
allocated by the function. That means that the child's stack pointer
starts at the bottom of the parent's stack in bdflush_init() and grows
down from there. And if the parent never goes deeper into its stack
than bdflush_init(), I guess it works - sort of.
Anyway, I'm confused. My analysis might be wrong, since I don't spend
that much time in the Linux kernel, but bottom line - doesn't
kernel_thread() need to allocate stack space for the child? I mean,
even if everything else is shared, doesn't the child at least need it's
own stack?
I need to stare at this more, but maybe somebody else can explain what's
going on here. At the very least, I think kernel_thread() needs to
explicitly specify what goes into the ECX register, because it looks to
me like it's just the luck of the compiler's draw...
--
-bwb
Brent Baccala
[email protected]
==============================================================================
For news from freesoft.org, subscribe to [email protected]:
mailto:[email protected]?subject=subscribe&body=subscribe
==============================================================================
On 18 Jul 01 at 3:16, Brent Baccala wrote:
> The first thing I notice is that this function refers not only to the
> clone flags in ebx, but also to a "newsp" in ecx - and ecx went
> completely unmentioned in kernel_thread()! A disassembly of
>
> Anyway, I'm confused. My analysis might be wrong, since I don't spend
> that much time in the Linux kernel, but bottom line - doesn't
> kernel_thread() need to allocate stack space for the child? I mean,
> even if everything else is shared, doesn't the child at least need it's
> own stack?
ecx specifies where userspace stack lives, not kernel space one, and
each process gets its own kernel stack automagically. As you must not
ever return to userspace from kernel_thread(), it is not a problem.
Because of exiting from kernel_thread() to userspace is not trivial
task, I do not think that is worth of effort.
If you are in doubts, you can trace code down to copy_thread. It copies
parent's registers as entered by int 0x80, and then uses ret_from_fork
for child's return path. So it returns to same place from where it
was invoked - in kernel_thread() case into kernel, with esp just being
on the top of kernel stack. And value passed for esp into clone() is lost
in that case, as it was stored in return stackframe esp field (oldesp
in entry.S) which was not used by processor because of it did not change
its CPL to userspace (stack->cs & 3 was equal to %cs & 3, so stack pointer
was not fetched from stack).
Only problem I see is that we could give 8 more bytes to child by doing
add $8,%esp in child path in kernel_thread(), as currently we leave
these 8 bytes (oldesp, oldss) filled with garbage. But on other side,
if 8 bytes is everything what saves you from stack overflow, you are
in serious troubles anyway (in Linux, of course; 8 bytes is half of
your stack on i8048). And leaving it this way we ensure that child
has 16byte aligned stack on its enter, which can speed up code a bit
too.
Best regards,
Petr Vandrovec
[email protected]
Petr Vandrovec wrote:
>
> On 18 Jul 01 at 3:16, Brent Baccala wrote:
>
> > The first thing I notice is that this function refers not only to the
> > clone flags in ebx, but also to a "newsp" in ecx - and ecx went
> > completely unmentioned in kernel_thread()! A disassembly of
> >
> > Anyway, I'm confused. My analysis might be wrong, since I don't spend
> > that much time in the Linux kernel, but bottom line - doesn't
> > kernel_thread() need to allocate stack space for the child? I mean,
> > even if everything else is shared, doesn't the child at least need it's
> > own stack?
>
> ecx specifies where userspace stack lives, not kernel space one, and
> each process gets its own kernel stack automagically. As you must not
> ever return to userspace from kernel_thread(), it is not a problem.
> Because of exiting from kernel_thread() to userspace is not trivial
> task, I do not think that is worth of effort.
OK, now I see it. The kernel stack lives at the top of the task
structure, which is allocated as a full page at the beginning of
do_fork(), then type cast down to a struct task_struct. The copy_thread
code looks past the end of the task_struct and sets up esp0 to point to
the end of the page.
Thanks.
--
-bwb
Brent Baccala
[email protected]
==============================================================================
For news from freesoft.org, subscribe to [email protected]:
mailto:[email protected]?subject=subscribe&body=subscribe
==============================================================================
If you are talking about kernel stack for the child... alloc_task_struct() does
that in do_fork()
--
Amol
Brent Baccala <[email protected]> on 07/18/2001 01:46:27 PM
To: [email protected]
cc: (bcc: Amol Lad/HSS)
Subject: Do kernel threads need their own stack?
Hi -
I'm experimenting with some code to track down stack overruns in the
kernel, and I've stumbled across some stuff in the i386 kernel_thread
code that strikes me as very suspicious. This is 2.4.6.
First off, here's kernel_thread from arch/i386/kernel/process.c:
int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
{
long retval, d0;
__asm__ __volatile__(
"movl %%esp,%%esi\n\t"
"int $0x80\n\t" /* Linux/i386 system call */
"cmpl %%esp,%%esi\n\t" /* child or parent? */
"je 1f\n\t" /* parent - jump */
... stuff omitted ...
"1:\t"
:"=&a" (retval), "=&S" (d0)
:"0" (__NR_clone), "i" (__NR_exit),
"r" (arg), "r" (fn),
"b" (flags | CLONE_VM)
: "memory");
return retval;
}
The register constraints make sure that the "a" register (eax) is
operand 0 and contains __NR_clone to start with. The "b" register (ebx)
contains (flags | CLONE_VM). We save the stack pointer to ESI and INT
80, which (combined with the __NR_clone in eax) lands us in sys_clone
(same file):
asmlinkage int sys_clone(struct pt_regs regs)
{
unsigned long clone_flags;
unsigned long newsp;
clone_flags = regs.ebx;
newsp = regs.ecx;
if (!newsp)
newsp = regs.esp;
return do_fork(clone_flags, newsp, ®s, 0);
}
The first thing I notice is that this function refers not only to the
clone flags in ebx, but also to a "newsp" in ecx - and ecx went
completely unmentioned in kernel_thread()! A disassembly of
kernel_thread shows that "arg" winds up in ecx before the system call,
so I guess this is what gets passed to do_fork(), where (I think) it
ultimately ends up being the child's stack pointer.
In the case of bdflush_init() (end of fs/buffer.c), what gets passed in
as "arg" is the address of a semaphore on the stack - the only variable
allocated by the function. That means that the child's stack pointer
starts at the bottom of the parent's stack in bdflush_init() and grows
down from there. And if the parent never goes deeper into its stack
than bdflush_init(), I guess it works - sort of.
Anyway, I'm confused. My analysis might be wrong, since I don't spend
that much time in the Linux kernel, but bottom line - doesn't
kernel_thread() need to allocate stack space for the child? I mean,
even if everything else is shared, doesn't the child at least need it's
own stack?
I need to stare at this more, but maybe somebody else can explain what's
going on here. At the very least, I think kernel_thread() needs to
explicitly specify what goes into the ECX register, because it looks to
me like it's just the luck of the compiler's draw...
--
-bwb
Brent Baccala
[email protected]
==============================================================================
For news from freesoft.org, subscribe to [email protected]:
mailto:[email protected]?subject=subscribe&body=subscribe
==============================================================================