[with apologies for folks Cc'd, resent due to mis-autoexpanded l-k address
on the original posting ;-/ Mea culpa...]
There's an interesting ongoing project around kernel_thread() and
friends, including execve() variants. I really need help from architecture
maintainers on that one; I'd been able to handle (and test) quite a few
architectures on my own [alpha, arm, m68k, powerpc, s390, sparc, x86, um]
plus two more untested [frv, mn10300]. c6x patches had been supplied by
Mark Salter; everything else remains to be done. Right now it's at
minus 1.2KLoC, quite a bit of that removed from asm glue and other black magic.
What it promises:
* unified kernel_thread() implementation
* unified sys_execve() implementation
* no more struct pt_regs * passing for execve/fork/vfork/clone and stuff
called by those (do_execve(), do_fork() and downstream from those)
* unified kernel_execve() implementation (if we leave that function alive
at all; see below for details)
* much simpler and more uniform logics around kernel thread creation.
* lots and lots of code removed.
Some notes on kernel_thread() are needed to explain the rest;
first of all, it's really a very low-level interface with few users.
Practically everything is using either kthread_run() and its friends
(linux/kthread.h stuff, for things that really do work as kernel threads)
or call_usermode_helper() (linux/kmod.h stuff, for things that are
intended to set things up and exec, turning into a userland process).
Both are implemented via kernel_thread(), of course, but that's about
all we use kernel_thread() for. Outside of those mechanisms there are
only 3 users:
* original thread spawns future init via kernel_thread() before
turning itself into idle thread for boot CPU
* /linuxrc from initramfs is spawned via kernel_thread(); that's
trivially converted to call_usermodehelper_fns().
* powerpc has one kernel thread that should've been done via
kthread_run(), but is implemented via raw kernel_thread() instead.
That's all. Everything else is about kthread.c and kmod.c machinery.
In other words, kernel_thread() should be strictly kernel-internal.
Moreover, *becoming* a kernel thread is inherently racy and while we
have a leftover primitive for that (daemonize()), it's practically
unused. The only remaining caller is in drivers/staging and it can
easily be eliminated; it's done in kernel thread (kthread.h variety),
so all it does is change of thread name. I think we ought to kill
daemonize() and be done with that.
kernel_execve() is also very limited in visibility - it's used by kmod.c
machinery and pretty much everything else is using that instead.
The only exceptions right now is exec of userland init and exec of
/linuxrc; the latter actually should be using kmod.c stuff.
In other words, the desired situation is
* kernel threads spawn kernel threads via kernel_thread(); it
happens in a very limited number of places, everything else uses
kthread.h and kmod.h functions.
* userland processes spawn only userland processes. It's done
by sys_fork/sys_vfork/sys_clone/sys_clone2; none of those is ever called
by a kernel thread. No userland process ever calls kernel_thread().
* a kernel thread either runs forever or calls do_exit() or
does successful kernel_execve(). In the last case it becomes a userland
process.
* a userland process cannot become a kernel thread; it can ask
an existing kernel thread to spawn a new one (that's how kthread.h mechanism
works), but that's it.
* kernel_execve() is limited to kernel/*.c - not even seen in
linux/*.h
We are fairly close to that already. Now, let's recall how process creation
of any kind works. Any thing other than an idle thread is created by
do_fork(), called one way or another. do_fork() gets a new task_struct
and kernel stack allocated and calls copy_thread() to do the arch-dependent
work. Eventually, do_fork() makes the new process visible to scheduler
and off we go. New process is set up (by copy_thread()) so that it looks
as if it's just lost CPU in the middle of switch_to(). When scheduler
decides to pick it for execution, switch_to() done in a process losing CPU
resumes the execution of the newborn. However, it's *not* in the middle
of switch_to() - it doesn't have a call chain on its stack that would
contain schedule(), etc. Instead of faking all that, we pretend that
all we have in our call chain is ret_from_fork() and we are resuming the
execution in the beginning of ret_from_fork(), which does what schedule
would've done after calling switch_to() (i.e. calls
schedule_tail(previous_task)) and proceeds to normal return from syscall
codepath. There are variations on that, but they are fairly minor (e.g.
amd64 recognizes that there are only two addresses where execution could
be resumed after switch_to(), one much more frequent than another. So
instead of storing an address where we'll resume the task and jumping
to such address picked from the new task, it checks a thread flag and
does conditional branch to ret_from_fork, where the flag is immediately
removed. It's still the same logics, just much nicer on branch predictor
in CPU).
That's fine for fork()/clone()/vfork(); we have saved the register state
on syscall entry (into struct pt_regs on caller's kernel stack), we pass
it to copy_thread() and slap it on child's stack. When the child gets
born, it'll wake up in ret_from_fork() and proceed to return from syscall.
Which will pick said copy of registers from its stack, restore them as
usual and return to userland. If we have callee-saved registers we don't
bother to put into pt_regs (C part of the kernel won't have them changed
anyway, so why bother?) we just do a wrapper for sys_fork() that saves
those registers next to pt_regs on stack, so that copy_thread() would pick
them as well and arranged for them set up - either on by switch_to() or
by ret_from_fork. The only painful part is finding those pt_regs...
For kernel threads it gets more hairy. Some (thankfully very few now - only
2 such architectures left in my tree) just issue a syscall instruction right
in the kernel mode. Another variant: set struct pt_regs in a local variable,
fill it so that it would look like one set by e.g. interrupt taken in kernel,
shove the kernel_thread() callback and its argument into a couple of registers,
set the "return address" at a helper function and call do_fork(). We are
still relying on the following convoluted sequence of events:
* copy_thread() sets the child's kernel stack according to what
we'd set in pt_regs. Would be nice if it could just do what it does for
normal fork(), but in practice it needs to check whether it's going to be
a kernel thread or not anyway.
* child gets woken up at the ret_from_fork() entry, does that
schedule_tail() call and goes to the path that would normally lead to
userland.
* due to the way we'd set pt_regs up, we'll actually do nothing
on that path other than setting all registers according to what's in
pt_regs, *NOT* switching out of the kernel mode and jumping to helper that'll
call the callback we'd passed to kernel_thread(). Taking its address (and
the value of its argument) from the registers we'd just set. Assuming we
didn't have a bug somewhere.
It's *way* too convoluted.
Simpler variant: have copy_thread() set things up to resume in
a separate function (ret_from_kernel_thread()) if we are dealing with the
kernel thread. Which function will call schudule_tail(), then call the
callback, then call do_exit(). Without going through the
return-to-userland-but-not-really contortions.
Even simpler: if we are spawning a kernel thread (which can be
checked as current->flags & PF_KTHREAD), don't even look at pt_regs we'd
got - it has been filled based on kernel_thread() callback and argument
anyway. Just pass those in two unsigned long arguments do_fork() blindly
passes to copy_thread() (normally one is used for userland stack pointer
and another is not used at all, other than on itanic sys_clone2()) and have
copy_thread() use them in kthread case. That variant has an additional
benefit of kernel_thread() being 100% identical on all architectures using it.
I have that done on architectures listed above; the rest needs
to be done.
The next thing is kernel_execve(). Some architectures do a syscall
instruction in the kernel approach. It kinda-sorta works, but it complicates
the things a lot on syscall exit path, as well as for do_execve() and friends.
Another variant: set empty pt_regs in local variable, pass them to
do_execve() and if it succeeds, do black magic. Namely, copy them to normal
location, reset stack pointer so that it'll look like return from syscall
and jump to return from syscall. And black magic it is - *everything* starting
from "copy to normal location" has to be in assembler, since the normal
location might bloody well overlap the current stack frame! And you'd
better pray that nothing like unwinder sees the poor abused stack in the
middle of that fun.
Simpler variant: do the aforementioned conversion of kernel_thread()
and set the kernel stack pointer of child the same way whether it's a
kthread or not. Then we are guaranteed to have the normal pt_regs at the
location normal for process in the middle of syscall. Under the kernel
thread stack. Just pass the normal location of pt_regs to do_execve()
and the only black magic left is essentially a longjmp() all way out to
stack frame of ret_from_kernel_thread(). I.e. reset the stack pointer
and off we go to return from syscall. I've gone that way; the remaining
bit of black magic is called ret_from_kernel_execve() and it's fairly
simple. Same architectures converted, the rest needs help.
If we get all architectures converted wrt kernel_thread(), we'll
be able to do even simpler:
* step 1: take do_exit() calls from the 30-odd assembler functions
into 3 or 4 places in thread payloads where they can return without calling
do_exit() themselves. All of which are in init/*.c and kernel/*.c
* step 2: turn ret_from_kernel_thread() instances into
"call schedule_tail(), call the callback, jump to return from syscall",
killing all ret_from_kernel_execve, turning kernel_execve() callers
(all 2 or 3 of them) into "call do_execve() on normal pt_regs" and letting
the kernel_thread() callbacks return if and only if they'd succeeded in
do_execve().
At that point we'll have all longjmp-like crap gone and ret_from_kernel_thread
becomes really very similar to ret_from_fork - the only difference is the
call of kernel_thread() callback before going to userland. If that callback
returns, that is, rather than doing do_exit()...
Further benefits of that will be that *all* do_execve() callers will
be passing it the normal location of pt_regs. So we can bloody well get rid
of that argument and just use current_pt_regs() when something like
elf_load_binary() really needs to access pt_regs. There goes struct pt_regs *
argument of do_execve(), ->load_binary(), etc.
A bit more about the black magic: the ways sys_execve() and friends
locate struct pt_regs to pass to do_{execve,fork} are definitely not fit
for describing in polite company. They are architecture-dependent and bloody
kludgy more often than not. Large part of the kludginess comes from the
architectures doing kernel_execve()/kernel_thread() as syscall-in-kernel-mode,
since you get pt_regs in unusual location. Switching to generic variants
gets rid of that, of course.
I've introduced a new helper in that series - current_pt_regs().
Usually it's the same thing as task_pt_regs(current), but it should
work at any time (task_pt_regs() is guaranteed to work only when the task
is stopped by ptrace). That, of course, allowed to merge sys_execve()
on all converted architectures, with the result living in fs/exec.c and
having no magical struct pt_regs * arguments. The same thing happens
to compat variant of execve() on biarch targets, of course.
We could also get rid of struct pt_regs * in do_fork() once we are
done with the conversion; copy_thread() can bloody well do current_pt_regs()
itself. And yes, the cost of calculating it is going to be lower than
passing it through at least 3 function calls. On anything. We also can
do architecture-agnostic variants of sys_fork()/sys_clone()/sys_vfork(),
getting rid of another load of black magic and code duplication.
I've some further plans for that series, but that's really a separate
story; the above is more than enough for the coming cycle.
Right now the tree lives in
git.kernel.org/pub/scm/linux/kernel/git/viro/signal experimental-kernel_thread
Some of that had been in -next for a while.
Folks, help with review, testing and filling the missing bits, please.
Al,
On 1 October 2012 22:38, Al Viro <[email protected]> wrote:
> Right now the tree lives in
> git.kernel.org/pub/scm/linux/kernel/git/viro/signal experimental-kernel_thread
> Some of that had been in -next for a while.
>
> Folks, help with review, testing and filling the missing bits, please.
I tested this branch with the arm64 kernel and runs ok (I get couple
of compiler warnings because reparent_to_kthread and
__set_special_pids are no longer used with the removal of daemonize).
I have three arm64 patches on this branch:
git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64.git execve
Catalin Marinas (3):
arm64: Use generic kernel_thread() implementation
arm64: Use generic kernel_execve() implementation
arm64: Use generic sys_execve() implementation
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/processor.h | 5 --
arch/arm64/include/asm/syscalls.h | 3 -
arch/arm64/include/asm/unistd.h | 3 +
arch/arm64/kernel/entry.S | 28 ++++++++++---
arch/arm64/kernel/process.c | 78 +++++++++++------------------------
arch/arm64/kernel/sys.c | 65 ------------------------------
arch/arm64/kernel/sys32.S | 7 +---
arch/arm64/kernel/sys_compat.c | 18 --------
9 files changed, 52 insertions(+), 156 deletions(-)
--
Catalin
On Fri, Oct 05, 2012 at 05:07:47PM +0100, Catalin Marinas wrote:
> Al,
>
> On 1 October 2012 22:38, Al Viro <[email protected]> wrote:
> > Right now the tree lives in
> > git.kernel.org/pub/scm/linux/kernel/git/viro/signal experimental-kernel_thread
> > Some of that had been in -next for a while.
> >
> > Folks, help with review, testing and filling the missing bits, please.
>
> I tested this branch with the arm64 kernel and runs ok (I get couple
> of compiler warnings because reparent_to_kthread and
> __set_special_pids are no longer used with the removal of daemonize).
Applied and pushed. Also there - folded fixes from jejb into parisc part,
optimizations of the fork-related paths in parisc on top of that (completely
untested), fixed the sparc64 breakage davem had spotted yesterday.
On 10/1/2012 5:38 PM, Al Viro wrote:
> There's an interesting ongoing project around kernel_thread() and
> friends, including execve() variants. I really need help from architecture
> maintainers on that one; I'd been able to handle (and test) quite a few
> architectures on my own [alpha, arm, m68k, powerpc, s390, sparc, x86, um]
> plus two more untested [frv, mn10300]. c6x patches had been supplied by
> Mark Salter; everything else remains to be done. Right now it's at
> minus 1.2KLoC, quite a bit of that removed from asm glue and other black
> magic.
I'll take a look at this for arch/tile this week.
--
Chris Metcalf, Tilera Corp.
http://www.tilera.com
On Tue, Oct 09, 2012 at 03:48:16PM -0400, Chris Metcalf wrote:
> On 10/1/2012 5:38 PM, Al Viro wrote:
> > There's an interesting ongoing project around kernel_thread() and
> > friends, including execve() variants. I really need help from architecture
> > maintainers on that one; I'd been able to handle (and test) quite a few
> > architectures on my own [alpha, arm, m68k, powerpc, s390, sparc, x86, um]
> > plus two more untested [frv, mn10300]. c6x patches had been supplied by
> > Mark Salter; everything else remains to be done. Right now it's at
> > minus 1.2KLoC, quite a bit of that removed from asm glue and other black
> > magic.
>
> I'll take a look at this for arch/tile this week.
Thanks. FWIW, changes since the last posting:
* untested conversion for cris added [me]
* conversion for mips added [Ralf has done execve side, I've added
kernel_thread() one]
* conversion for parisc added [aka "Al has generated broken patches,
jejb has tested and fixed that crap"]
* arm64 conversion added [Catalin Marinas]
IOW, right now we have
* alpha, arm, arm64, c6x, m68k, mips, parisc, powerpc, s390, sparc,
um, x86 - apparently over and done with (IOW, all old architectures except for
itanic are converted by now)
* avr32, blackfin, hexagon, h8300, ia64, m32r, microblaze, openrisc,
score, sh, tile, unicore32, xtensa - need to be done
* cris, frv, mn10300 - need to be tested (and very likely will need
fixing)
I just looked through the "powerpc: split ret_from_fork" commit in
your for-next branch, and I have a couple of comments.
First, on 64-bit powerpc, if kernel_thread() is called on a function
in a module, and that function returns, we'll then jump to do_exit
with r2 pointing to the module's TOC rather than the main kernel's
TOC, with disastrous results. It would be easily solved by adding a
ld r2,PACATOC(r13)
before the branch to .do_exit in the 64-bit ret_from_kernel_thread().
Maybe no-one ever calls kernel_thread() on a function in a module
today, but I don't want to rely on that being true forever, especially
since it's only one more instruction to guard against the possibility.
Secondly, while the code as you have it is correct, to my mind it
would be cleaner to put the function descriptor address in r14 on
64-bit, rather than the function text address, and then dereference it
to get the TOC and text address at the point of doing the call. That
would reduce the differences between 32-bit and 64-bit in
copy_thread() and make it more obvious that ret_from_kernel_thread()
is setting the right TOC value in r2.
In fact copy_thread() doesn't need to set r2 in the 32-bit case, since
_switch() sets r2 to the task_struct for the task it's switching to,
and ignores the r2 value in the stack frame. So with my suggestion
copy_thread() wouldn't need to set childregs->gpr[2] at all (for the
kernel thread case).
Paul.
On Thu, Oct 11, 2012 at 08:00:23PM +1100, Paul Mackerras wrote:
> I just looked through the "powerpc: split ret_from_fork" commit in
> your for-next branch, and I have a couple of comments.
>
> First, on 64-bit powerpc, if kernel_thread() is called on a function
> in a module, and that function returns, we'll then jump to do_exit
> with r2 pointing to the module's TOC rather than the main kernel's
> TOC, with disastrous results. It would be easily solved by adding a
>
> ld r2,PACATOC(r13)
>
> before the branch to .do_exit in the 64-bit ret_from_kernel_thread().
> Maybe no-one ever calls kernel_thread() on a function in a module
> today, but I don't want to rely on that being true forever, especially
> since it's only one more instruction to guard against the possibility.
It's unexported, and for a good reason. So no, there won't be any callers
in modules. The good reasons in question:
* it's trivial to convert any such caller to kthread_run() and
kthread.h in general
* there are only ~7 callers in the mainline, all but one in
{init,kernel}/*.c. Said exception is in arch/powerpc, also non-modular
and there conversion is simply
- set_task_comm(current, "eehd");
in eeh_event_handler,
- if (kernel_thread(eeh_event_handler, NULL, CLONE_KERNEL) < 0)
+ if (IS_ERR(kthread_run(eeh_event_handler, NULL, "eehd")))
in eeh_thread_launcher, plus adding #include <linux/kthread.h>.
All there is to it.
* life is going to become much simpler if we change kernel_thread()
callback semantics (while leaving kthread.h and kmod.h behaviour intact)
and I'm going to do that later in the series. This call of do_exit() is
going away, to start with - from any asm glue. Better have it done in
kernel/kmod.c - all it takes is a couple of lines in there, everything
else already never returns, either by looping forever or by successful
execve(2) or by exit(2) (or panic(), in one case). Again, I'm talking about
kernel_thread() callbacks - kthread.h ones already have do_exit() done by
their caller (which is a kernel_thread() callback in kernel/kthread.c).
And then we can do even better - check signal.git#experimental-kernel_thread
and you'll see.
IOW, being able to treat that as internal low-level mechanism rather than
any kind of stable API for modules is a very good thing.
> Secondly, while the code as you have it is correct, to my mind it
> would be cleaner to put the function descriptor address in r14 on
> 64-bit, rather than the function text address, and then dereference it
> to get the TOC and text address at the point of doing the call. That
> would reduce the differences between 32-bit and 64-bit in
> copy_thread() and make it more obvious that ret_from_kernel_thread()
> is setting the right TOC value in r2.
Umm... Maybe, but let's do that as subsequent cleanup. Again,
we almost certainly don't need to mess with TOC at all - the callbacks
are in the main kernel, there are very few of them and they really are
low-level details of exported mechanisms (i.e. kthread_create/run/etc.
in kthread.h and call_usermode... in kmod.h). Again, we are talking
about out-of-tree modules, they had better mechanism for at least
6 years and conversion to it is bloody trivial. Hell, it was even
in late unlamented feature-removal-schedule.txt - since 2006. If that's
not enough to retire an export, what is?
On Thu, Oct 11, 2012 at 01:53:06PM +0100, Al Viro wrote:
> Umm... Maybe, but let's do that as subsequent cleanup. Again,
> we almost certainly don't need to mess with TOC at all - the callbacks
> are in the main kernel, there are very few of them and they really are
> low-level details of exported mechanisms (i.e. kthread_create/run/etc.
> in kthread.h and call_usermode... in kmod.h). Again, we are talking
> about out-of-tree modules, they had better mechanism for at least
> 6 years and conversion to it is bloody trivial. Hell, it was even
> in late unlamented feature-removal-schedule.txt - since 2006. If that's
> not enough to retire an export, what is?
OK... yes we can fix things up in a subsequent cleanup.
We will need to fix the TOC handling when we go to using multiple TOCs
in the main kernel, with the linker managing the transitions between
TOCs. Our toolchain guys have been pushing us to do that for years,
because it should make things run faster, but first we'll have to stop
using ld -r to combine objects in subdirectories.
Paul.
On Fri, Oct 12, 2012 at 11:16:33AM +1100, Paul Mackerras wrote:
> On Thu, Oct 11, 2012 at 01:53:06PM +0100, Al Viro wrote:
>
> > Umm... Maybe, but let's do that as subsequent cleanup. Again,
> > we almost certainly don't need to mess with TOC at all - the callbacks
> > are in the main kernel, there are very few of them and they really are
> > low-level details of exported mechanisms (i.e. kthread_create/run/etc.
> > in kthread.h and call_usermode... in kmod.h). Again, we are talking
> > about out-of-tree modules, they had better mechanism for at least
> > 6 years and conversion to it is bloody trivial. Hell, it was even
> > in late unlamented feature-removal-schedule.txt - since 2006. If that's
> > not enough to retire an export, what is?
>
> OK... yes we can fix things up in a subsequent cleanup.
>
> We will need to fix the TOC handling when we go to using multiple TOCs
> in the main kernel, with the linker managing the transitions between
> TOCs. Our toolchain guys have been pushing us to do that for years,
> because it should make things run faster, but first we'll have to stop
> using ld -r to combine objects in subdirectories.
How granular are you planning to make that? I mean, we are talking about
3 objects here - init/main.o, kernel/kthread.o and kernel/kmod.o. Do they
get TOC separate from that of arch/powerpc/kernel/entry_64.o?
Anyway, if ppc folks can live with that stuff in its current form for now,
here's the second signal.git pull request. Stuff in there: kernel_thread/
kernel_execve/sys_execve conversions for several more architectures plus
assorted signal fixes and cleanups. There'll be more (in particular, real
fixes for alpha do_notify_resume() irq mess)... Linus, could you pull that
queue? It's in the usual place -
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal for-linus
Shortlog:
Al Viro (38):
powerpc: split ret_from_fork
powerpc: switch to generic sys_execve()/kernel_execve()
m68k: split ret_from_fork(), simplify kernel_thread()
m68k: switch to generic sys_execve()/kernel_execve()
frv: split ret_from_fork, simplify kernel_thread() a lot
frv: switch to generic sys_execve()
frv: switch to generic kernel_execve
frv: switch to generic kernel_thread()
mn10300: split ret_from_fork, simplify kernel_thread()
mn10300: switch to generic sys_execve()
mn10300: switch to generic kernel_execve()
mn10300: convert to generic kernel_thread()
c6x: switch to generic kernel_thread()
xtensa: can't get to do_notify_resume() when user_mode(regs) is not true
mn10300: get rid of calling do_notify_resume() when returning to kernel mode
score: fix bogus restarts on sigreturn()
ia64: can't reach do_signal() when returning to kernel mode
mips: prevent hitting do_notify_resume() with !user_mode(regs)
mips: unobfuscate _TIF..._MASK
mips: merge the identical "return from syscall" per-ABI code
mips: NOTIFY_RESUME is not needed in TIF masks
unicore32: unobfuscate _TIF_WORK_MASK
bury _TIF_RESTORE_SIGMASK
sanitize tsk_is_polling()
bury the rest of TIF_IRET
parisc: fix double restarts
parisc: don't bother looping in do_signal()
parisc: decide whether to go to slow path (tracesys) based on thread flags
h8300: trim _TIF_WORK_MASK
unicore32: remove pointless test
x86: get rid of duplicate code in case of CONFIG_VM86
frv: no need to raise SIGTRAP in setup_frame()
mn10300: don't bother with SIGTRAP in setup_frame()
microblaze: don't bother with SIGTRAP in setup_rt_frame()
tile: don't bother with SIGTRAP in setup_frame
avr32: trim masks
m32r: trim masks
alpha: don't open-code trace_report_syscall_{enter,exit}
Greg Ungerer (1):
m68k: always set stack frame format for ColdFire on thread start
Mark Salter (3):
c6x: add ret_from_kernel_thread(), simplify kernel_thread()
c6x: switch to generic kernel_execve
c6x: switch to generic sys_execve
Richard Weinberger (1):
Uninclude linux/freezer.h
Diffstat:
arch/alpha/include/asm/thread_info.h | 3 +-
arch/alpha/kernel/entry.S | 13 ++--
arch/alpha/kernel/ptrace.c | 32 ++++-----
arch/arm/include/asm/thread_info.h | 2 -
arch/arm/kernel/signal.c | 1 -
arch/avr32/include/asm/thread_info.h | 18 ++---
arch/avr32/kernel/signal.c | 1 -
arch/blackfin/include/asm/thread_info.h | 4 -
arch/blackfin/kernel/signal.c | 1 -
arch/c6x/Kconfig | 1 +
arch/c6x/include/asm/processor.h | 2 -
arch/c6x/include/asm/syscalls.h | 5 --
arch/c6x/include/asm/thread_info.h | 1 -
arch/c6x/include/asm/unistd.h | 3 +
arch/c6x/kernel/asm-offsets.c | 1 -
arch/c6x/kernel/entry.S | 56 +++++++--------
arch/c6x/kernel/process.c | 72 +++-----------------
arch/cris/include/asm/thread_info.h | 3 -
arch/frv/Kconfig | 1 +
arch/frv/include/asm/processor.h | 9 +--
arch/frv/include/asm/ptrace.h | 1 +
arch/frv/include/asm/thread_info.h | 3 -
arch/frv/include/asm/unistd.h | 2 +
arch/frv/kernel/Makefile | 4 +-
arch/frv/kernel/entry.S | 13 ++++
arch/frv/kernel/frv_ksyms.c | 1 -
arch/frv/kernel/kernel_execve.S | 33 ---------
arch/frv/kernel/kernel_thread.S | 77 ---------------------
arch/frv/kernel/process.c | 66 ++++++------------
arch/frv/kernel/signal.c | 9 ---
arch/h8300/include/asm/thread_info.h | 7 +--
arch/h8300/kernel/signal.c | 1 -
arch/hexagon/include/asm/thread_info.h | 5 --
arch/hexagon/kernel/signal.c | 1 -
arch/ia64/include/asm/thread_info.h | 2 -
arch/ia64/kernel/signal.c | 8 --
arch/m32r/include/asm/thread_info.h | 9 +--
arch/m32r/kernel/signal.c | 3 -
arch/m68k/Kconfig | 1 +
arch/m68k/include/asm/processor.h | 25 +++----
arch/m68k/include/asm/ptrace.h | 2 +
arch/m68k/include/asm/unistd.h | 2 +
arch/m68k/kernel/entry.S | 16 +++++
arch/m68k/kernel/process.c | 104 +++++++----------------------
arch/m68k/kernel/sys_m68k.c | 17 -----
arch/microblaze/include/asm/thread_info.h | 3 +-
arch/microblaze/kernel/signal.c | 7 +--
arch/mips/include/asm/thread_info.h | 9 +--
arch/mips/kernel/entry.S | 15 +++-
arch/mips/kernel/scall32-o32.S | 13 +---
arch/mips/kernel/scall64-64.S | 13 +---
arch/mips/kernel/scall64-n32.S | 13 +---
arch/mips/kernel/scall64-o32.S | 13 +---
arch/mips/kernel/signal.c | 8 --
arch/mn10300/Kconfig | 1 +
arch/mn10300/include/asm/frame.inc | 2 +-
arch/mn10300/include/asm/processor.h | 18 +----
arch/mn10300/include/asm/ptrace.h | 1 +
arch/mn10300/include/asm/thread_info.h | 3 +-
arch/mn10300/include/asm/unistd.h | 2 +
arch/mn10300/kernel/Makefile | 4 +-
arch/mn10300/kernel/entry.S | 18 +++++
arch/mn10300/kernel/internal.h | 6 +--
arch/mn10300/kernel/kernel_execve.S | 37 ----------
arch/mn10300/kernel/kthread.S | 31 ---------
arch/mn10300/kernel/process.c | 91 ++++++-------------------
arch/mn10300/kernel/signal.c | 13 ----
arch/openrisc/include/asm/thread_info.h | 3 +-
arch/parisc/hpux/gate.S | 2 +-
arch/parisc/include/asm/thread_info.h | 5 +-
arch/parisc/kernel/signal.c | 45 ++++--------
arch/parisc/kernel/syscall.S | 9 ++-
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/processor.h | 3 -
arch/powerpc/include/asm/ptrace.h | 2 +
arch/powerpc/include/asm/syscalls.h | 3 -
arch/powerpc/include/asm/thread_info.h | 2 +
arch/powerpc/include/asm/unistd.h | 2 +
arch/powerpc/kernel/entry_32.S | 16 +++++
arch/powerpc/kernel/entry_64.S | 16 +++++
arch/powerpc/kernel/misc.S | 7 --
arch/powerpc/kernel/misc_32.S | 33 ---------
arch/powerpc/kernel/misc_64.S | 34 ---------
arch/powerpc/kernel/ppc_ksyms.c | 1 -
arch/powerpc/kernel/process.c | 59 ++++++++---------
arch/powerpc/kernel/signal_32.c | 1 -
arch/powerpc/kernel/sys_ppc32.c | 22 ------
arch/s390/include/asm/thread_info.h | 4 -
arch/score/include/asm/thread_info.h | 4 -
arch/score/kernel/signal.c | 1 +
arch/sh/include/asm/thread_info.h | 3 +
arch/sh/kernel/signal_32.c | 1 -
arch/sh/kernel/signal_64.c | 1 -
arch/sparc/include/asm/thread_info_32.h | 3 +-
arch/sparc/include/asm/thread_info_64.h | 3 +
arch/tile/kernel/compat_signal.c | 9 ---
arch/tile/kernel/signal.c | 12 +---
arch/um/include/asm/thread_info.h | 3 -
arch/unicore32/include/asm/thread_info.h | 4 +-
arch/unicore32/kernel/entry.S | 2 -
arch/unicore32/kernel/signal.c | 1 -
arch/x86/kernel/entry_32.S | 17 ++---
arch/xtensa/include/asm/thread_info.h | 5 --
arch/xtensa/kernel/signal.c | 4 -
kernel/sched/core.c | 2 +-
105 files changed, 363 insertions(+), 944 deletions(-)
delete mode 100644 arch/frv/kernel/kernel_execve.S
delete mode 100644 arch/frv/kernel/kernel_thread.S
delete mode 100644 arch/mn10300/kernel/kernel_execve.S
delete mode 100644 arch/mn10300/kernel/kthread.S
On Fri, Oct 12, 2012 at 02:09:58AM +0100, Al Viro wrote:
>
> How granular are you planning to make that? I mean, we are talking about
> 3 objects here - init/main.o, kernel/kthread.o and kernel/kmod.o. Do they
> get TOC separate from that of arch/powerpc/kernel/entry_64.o?
Potentially, yes, it would be up to the linker.
> Anyway, if ppc folks can live with that stuff in its current form for now,
Yes, we can.
Paul.
On Fri, 2012-10-12 at 02:09 +0100, Al Viro wrote:
> Anyway, if ppc folks can live with that stuff in its current form for now,
> here's the second signal.git pull request. Stuff in there: kernel_thread/
> kernel_execve/sys_execve conversions for several more architectures plus
> assorted signal fixes and cleanups. There'll be more (in particular, real
> fixes for alpha do_notify_resume() irq mess)... Linus, could you pull that
> queue? It's in the usual place -
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal for-linus
Yes, we can live with what's in there, it won't break anything. We can
do the proposed cleanups subsequently.
Cheers,
Ben.
On Mon, Oct 01, 2012 at 10:38:09PM +0100, Al Viro wrote:
> [with apologies for folks Cc'd, resent due to mis-autoexpanded l-k address
> on the original posting ;-/ Mea culpa...]
>
> There's an interesting ongoing project around kernel_thread() and
> friends, including execve() variants. I really need help from architecture
> maintainers on that one; I'd been able to handle (and test) quite a few
> architectures on my own [alpha, arm, m68k, powerpc, s390, sparc, x86, um]
> plus two more untested [frv, mn10300]. c6x patches had been supplied by
> Mark Salter; everything else remains to be done. Right now it's at
> minus 1.2KLoC, quite a bit of that removed from asm glue and other black magic.
Update:
* all infrastructure is in mainline now, along with conversion for
kernel_thread() callbacks to the form that allows really simple model for
kernel_execve() _without_ flagday changes.
* #experimental-kernel_thread is gone; this stuff is in for-next
now.
* a lot of architecture conversions had been done and some are
even tested. Currently missing are only 7 - avr32, hexagon, m32r, openrisc,
score, tile and xtensa. OTOH, a lot are completely untested. I've put
per-architecture stuff into separate branches and I promise never rebase
those once arch maintainers will be OK with the stuff in them. IOW, they'll
be safe to pull into respective architecture trees.
Folks, *please* review the stuff in signal.git#arch-*. All of them are
completely independent. I'll be glad to get ACKs/fixes/replacements/etc.
I've merged some of those into for-next, but that can change at any time -
it's not final; for-next will be rebased. Obviously, I hope to get to
the situation when all of those branches (plus currently missing ones)
get into shape that satisfies architecture maintainers. Once that happens,
all those branches will be merged into for-next.
I think the model is about final wrt kernel_thread()/kernel_execve()/
sys_execve(). There's one possible change on top of it, but it's reasonably
well-isolated from the rest. As it is, the model to aim for is this:
* select GENERIC_KERNEL_THREAD and GENERIC_KERNEL_EXECVE
* kill local kernel_thread()/kernel_execve() implementations
* generic kernel_thread() will call your copy_thread() with
NULL regs and fn/arg passed in the pair of arguments that are blindly
passed all the way through to copy_thread() - usp and stack_size resp.
In such case copy_thread() should arrange for the newborn to be woken
up in a function that is very similar to ret_from_fork(). The only
difference is that between the call of schedule_tail() and jumping into
the "return from syscall" code it should call fn(arg), using the data
left for it by copy_thread().
* unlike the previous variant, ret_from_kernel_execve() is not
needed at all; no need to play longjmp()-like games when kernel_thread()
callbacks had been taught to return normally all the way out when
kernel_execve() returns 0; any updates of sp/manipulations of register
windows/etc. will happen without any magic.
* provide current_pt_regs() if needed. Default is
task_pt_regs(current), but you might want to optimize it and unlike
task_pt_regs() it must work whenever we are in syscall or in a kernel thread.
task_pt_regs(task), OTOH, is required to work only when task can be
interrogated by tracer.
* no more syscalls-from-kernel, which often allows for simplifications
in the syscall entry/exit logics. I haven't done any of those; up to the
architecture maintainers.
One thing to keep in mind is that right now on SMP architectures
there's the third caller of copy_thread(), besides fork()/clone()/vfork()
(all pass userland pt_regs, with the address being current_pt_regs()) and
kernel_thread() (pass NULL pt_regs, kthread creation time). It's fork_idle()
and it passes zero-filled pt_regs. Frankly, I'm not even sure we want to
call copy_thread() in that case - the stuff set up by it goes nowhere.
We do that for each possible secondary CPU on SMP and we do *not* expose
those threads to scheduler. When CPU gets initialized we have the
secondary bootstrap take that task_struct as current. Its kernel stack,
thread_info, etc. are set up by said secondary bootstrap, overriding whatever
copy_thread() has done. Eventually the bootstrap reaches cpu_idle(),
which is where we schedule away. switch_to() done by schedule() is what
completes setting the things up; at that point they are ready to be woken
up - and not in ret_from_fork(), of course.
For the majority of architectures nothing done by copy_thread() in
that case is used afterwards, so we might as well stop calling it when
copy_process() is called by fork_idle(). I know of only one dubious case -
powerpc sets thread->ksp_limit on copy_thread() and I'm not sure if
that's get overwritten in secondary bootstrap - the value would be still
correct and I don't see any obvious places where it would be reassigned
on that codepath. There might be other cases like that, though. I would
argue that for this kind of stuff the right place is arch_dup_task_struct(),
not copy_thread()... Hell knows. Note that we are pretty much hitting
the random path in copy_thread() in that case - what zeroed pt_regs look
like to user_regs() is arch-dependent.
This is the possible change I've mentioned above. Not sure; I'd
really like comments on that one.
Branches in there:
arch-blackfin - conversion; completely untested
arch-cris - conversion; completely untested
arch-h8300 - conversion; completely untested
arch-microblaze - conversion; completely untested
arch-sh - conversion; completely untested
arch-unicore32 - conversion; completely untested
arch-ia64 - conversion; tested only on ski, which is worth very little
arch-c6x - followup to mainline; while it's minor, it's pretty much done
blindly and *really* needs review by maintainer.
arch-arm - contains heroic fix by rmk and nothing else. Seems to work fine.
arch-m68k - minor followup to stuff already in mainline; works on aranym
arch-parisc - mostly the stuff tested by parisc folks + minor followup
similar to m68k one.
arch-s390 - minor followup to mainline; works in hercules
arch-arm64 - patches from maintainer with minor followup folded
arch-frv - minor followup to mainline, needs testing
arch-mn10300 - minor followup to mainline, needs testing
arch-mips - patches from me and Ralf; works on qemu
arch-sparc - conversions for sparc32 and sparc64, plus the syscall_noerror
optimization
arch-powerpc - minor followups to mainline, need review by maintainers
"Completely untested" in the above reads "no promises it even compiles, let
alone isn't horribly broken". Please, treat that as a possible starting
point for doing the conversion for arch in question. I might have misread
the CPU manuals, your switch_to() implementation, etc., or just have been
temporary insane from digging through dozens of architectures. Hopefully
temporary, that is...
And folks, for pity sake, do the remaining seven. The merge window is
over, so...
Al, buggering off to get some VFS work done.
Hi Al,
On 15/10/12 11:30, Al Viro wrote:
> On Mon, Oct 01, 2012 at 10:38:09PM +0100, Al Viro wrote:
>> [with apologies for folks Cc'd, resent due to mis-autoexpanded l-k address
>> on the original posting ;-/ Mea culpa...]
>>
>> There's an interesting ongoing project around kernel_thread() and
>> friends, including execve() variants. I really need help from architecture
>> maintainers on that one; I'd been able to handle (and test) quite a few
>> architectures on my own [alpha, arm, m68k, powerpc, s390, sparc, x86, um]
>> plus two more untested [frv, mn10300]. c6x patches had been supplied by
>> Mark Salter; everything else remains to be done. Right now it's at
>> minus 1.2KLoC, quite a bit of that removed from asm glue and other black magic.
>
> Update:
> * all infrastructure is in mainline now, along with conversion for
> kernel_thread() callbacks to the form that allows really simple model for
> kernel_execve() _without_ flagday changes.
> * #experimental-kernel_thread is gone; this stuff is in for-next
> now.
> * a lot of architecture conversions had been done and some are
> even tested. Currently missing are only 7 - avr32, hexagon, m32r, openrisc,
> score, tile and xtensa. OTOH, a lot are completely untested. I've put
> per-architecture stuff into separate branches and I promise never rebase
> those once arch maintainers will be OK with the stuff in them. IOW, they'll
> be safe to pull into respective architecture trees.
>
> Folks, *please* review the stuff in signal.git#arch-*. All of them are
> completely independent. I'll be glad to get ACKs/fixes/replacements/etc.
I have checked arch-m68k on ColdFire with and without MMU, and it
is all fine. So for those:
Acked-by: Greg Ungerer <[email protected]>
Regards
Greg
> I've merged some of those into for-next, but that can change at any time -
> it's not final; for-next will be rebased. Obviously, I hope to get to
> the situation when all of those branches (plus currently missing ones)
> get into shape that satisfies architecture maintainers. Once that happens,
> all those branches will be merged into for-next.
>
> I think the model is about final wrt kernel_thread()/kernel_execve()/
> sys_execve(). There's one possible change on top of it, but it's reasonably
> well-isolated from the rest. As it is, the model to aim for is this:
> * select GENERIC_KERNEL_THREAD and GENERIC_KERNEL_EXECVE
> * kill local kernel_thread()/kernel_execve() implementations
> * generic kernel_thread() will call your copy_thread() with
> NULL regs and fn/arg passed in the pair of arguments that are blindly
> passed all the way through to copy_thread() - usp and stack_size resp.
> In such case copy_thread() should arrange for the newborn to be woken
> up in a function that is very similar to ret_from_fork(). The only
> difference is that between the call of schedule_tail() and jumping into
> the "return from syscall" code it should call fn(arg), using the data
> left for it by copy_thread().
> * unlike the previous variant, ret_from_kernel_execve() is not
> needed at all; no need to play longjmp()-like games when kernel_thread()
> callbacks had been taught to return normally all the way out when
> kernel_execve() returns 0; any updates of sp/manipulations of register
> windows/etc. will happen without any magic.
> * provide current_pt_regs() if needed. Default is
> task_pt_regs(current), but you might want to optimize it and unlike
> task_pt_regs() it must work whenever we are in syscall or in a kernel thread.
> task_pt_regs(task), OTOH, is required to work only when task can be
> interrogated by tracer.
> * no more syscalls-from-kernel, which often allows for simplifications
> in the syscall entry/exit logics. I haven't done any of those; up to the
> architecture maintainers.
>
> One thing to keep in mind is that right now on SMP architectures
> there's the third caller of copy_thread(), besides fork()/clone()/vfork()
> (all pass userland pt_regs, with the address being current_pt_regs()) and
> kernel_thread() (pass NULL pt_regs, kthread creation time). It's fork_idle()
> and it passes zero-filled pt_regs. Frankly, I'm not even sure we want to
> call copy_thread() in that case - the stuff set up by it goes nowhere.
> We do that for each possible secondary CPU on SMP and we do *not* expose
> those threads to scheduler. When CPU gets initialized we have the
> secondary bootstrap take that task_struct as current. Its kernel stack,
> thread_info, etc. are set up by said secondary bootstrap, overriding whatever
> copy_thread() has done. Eventually the bootstrap reaches cpu_idle(),
> which is where we schedule away. switch_to() done by schedule() is what
> completes setting the things up; at that point they are ready to be woken
> up - and not in ret_from_fork(), of course.
> For the majority of architectures nothing done by copy_thread() in
> that case is used afterwards, so we might as well stop calling it when
> copy_process() is called by fork_idle(). I know of only one dubious case -
> powerpc sets thread->ksp_limit on copy_thread() and I'm not sure if
> that's get overwritten in secondary bootstrap - the value would be still
> correct and I don't see any obvious places where it would be reassigned
> on that codepath. There might be other cases like that, though. I would
> argue that for this kind of stuff the right place is arch_dup_task_struct(),
> not copy_thread()... Hell knows. Note that we are pretty much hitting
> the random path in copy_thread() in that case - what zeroed pt_regs look
> like to user_regs() is arch-dependent.
>
> This is the possible change I've mentioned above. Not sure; I'd
> really like comments on that one.
>
> Branches in there:
> arch-blackfin - conversion; completely untested
> arch-cris - conversion; completely untested
> arch-h8300 - conversion; completely untested
> arch-microblaze - conversion; completely untested
> arch-sh - conversion; completely untested
> arch-unicore32 - conversion; completely untested
> arch-ia64 - conversion; tested only on ski, which is worth very little
> arch-c6x - followup to mainline; while it's minor, it's pretty much done
> blindly and *really* needs review by maintainer.
> arch-arm - contains heroic fix by rmk and nothing else. Seems to work fine.
> arch-m68k - minor followup to stuff already in mainline; works on aranym
> arch-parisc - mostly the stuff tested by parisc folks + minor followup
> similar to m68k one.
> arch-s390 - minor followup to mainline; works in hercules
> arch-arm64 - patches from maintainer with minor followup folded
> arch-frv - minor followup to mainline, needs testing
> arch-mn10300 - minor followup to mainline, needs testing
> arch-mips - patches from me and Ralf; works on qemu
> arch-sparc - conversions for sparc32 and sparc64, plus the syscall_noerror
> optimization
> arch-powerpc - minor followups to mainline, need review by maintainers
>
> "Completely untested" in the above reads "no promises it even compiles, let
> alone isn't horribly broken". Please, treat that as a possible starting
> point for doing the conversion for arch in question. I might have misread
> the CPU manuals, your switch_to() implementation, etc., or just have been
> temporary insane from digging through dozens of architectures. Hopefully
> temporary, that is...
>
> And folks, for pity sake, do the remaining seven. The merge window is
> over, so...
>
> Al, buggering off to get some VFS work done.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
------------------------------------------------------------------------
Greg Ungerer -- Principal Engineer EMAIL: [email protected]
SnapGear Group, McAfee PHONE: +61 7 3435 2888
8 Gardner Close FAX: +61 7 3217 5323
Milton, QLD, 4064, Australia WEB: http://www.SnapGear.com
Hi Al,
On 15 October 2012 02:30, Al Viro <[email protected]> wrote:
> arch-arm64 - patches from maintainer with minor followup folded
Thanks for updating the arm64 branch. I've adapted the changes, tested
and folded them into the branch below (the AArch64 instruction set
does not have conditional instructions):
git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64.git
execve
Top commit: 6a872777ffff6184f4ac10bd71d926d5e6f2491e (arm64: Use
generic sys_execve() implementation)
The diff between your branch and mine (including one cosmetic change
as the ARM64 Kconfig selects are alphabetically sorted):
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 31e3b5e..75b212d 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -6,8 +6,8 @@ config ARM64
select GENERIC_IOMAP
select GENERIC_IRQ_PROBE
select GENERIC_IRQ_SHOW
- select GENERIC_KERNEL_THREAD
select GENERIC_KERNEL_EXECVE
+ select GENERIC_KERNEL_THREAD
select GENERIC_SMP_IDLE_THREAD
select GENERIC_TIME_VSYSCALL
select HARDIRQS_SW_RESEND
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index c3d650a..6165318 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -611,10 +611,10 @@ ENDPROC(ret_to_user)
*/
ENTRY(ret_from_fork)
bl schedule_tail
- cmp x19, #0
- movne x0, x20
- blne x19
- get_thread_info tsk
+ cbz x19, 1f // not a kernel thread
+ mov x0, x20
+ blr x19
+1: get_thread_info tsk
b ret_to_user
ENDPROC(ret_from_fork)
--
Catalin
On Wed, Oct 17, 2012 at 03:02:15PM +0100, Catalin Marinas wrote:
> Hi Al,
>
> On 15 October 2012 02:30, Al Viro <[email protected]> wrote:
> > arch-arm64 - patches from maintainer with minor followup folded
>
> Thanks for updating the arm64 branch. I've adapted the changes, tested
> and folded them into the branch below (the AArch64 instruction set
> does not have conditional instructions):
>
> git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64.git
> execve
Fetched and merged, thanks. I'll probably drop arch-arm64 branch, now
that I have yours directly merged. In any case, right now arch-arm64
points to the current head or yours and for-next has
commit 0cea479343cbfaec3a9b8accdeadae53322d98a8
Merge: 43cc05a 6a87277
Author: Al Viro <[email protected]>
Date: Wed Oct 17 12:29:32 2012 -0400
Merge branch 'execve' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch
in it...
On Wed, Oct 17, 2012 at 05:34:24PM +0100, Al Viro wrote:
> On Wed, Oct 17, 2012 at 03:02:15PM +0100, Catalin Marinas wrote:
> > Hi Al,
> >
> > On 15 October 2012 02:30, Al Viro <[email protected]> wrote:
> > > arch-arm64 - patches from maintainer with minor followup folded
> >
> > Thanks for updating the arm64 branch. I've adapted the changes, tested
> > and folded them into the branch below (the AArch64 instruction set
> > does not have conditional instructions):
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64.git
> > execve
>
> Fetched and merged, thanks. I'll probably drop arch-arm64 branch, now
> that I have yours directly merged. In any case, right now arch-arm64
> points to the current head or yours and for-next has
>
> commit 0cea479343cbfaec3a9b8accdeadae53322d98a8
> Merge: 43cc05a 6a87277
> Author: Al Viro <[email protected]>
> Date: Wed Oct 17 12:29:32 2012 -0400
>
> Merge branch 'execve' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch
>
> in it...
Sounds fine. Thanks.
--
Catalin