2010-06-01 19:24:33

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone

Albert Cahalan [[email protected]] wrote:
| Sukadev Bhattiprolu writes:
|
| > Randy Dunlap [randy.dunlap at oracle.com] wrote:
| >>> base of the region allocated for stack. These architectures
| >>> must pass in the size of the stack-region in ->child_stack_size.
| >>
| >> stack region
| >>
| >> Seems unfortunate that different architectures use
| >> the fields differently.
| >
| > Yes and no. The field still has a single purpose, just that
| > some architectures may not need it. We enforce that if unused
| > on an architecture, the field must be 0. It looked like
| > the easiest way to keep the API common across architectures.
|
| Yuck. You're forcing userspace to have #ifdef messes or,
| more likely, just not work on all architectures.

There is going to be #ifdef code in the library interface to eclone().
But applications should not need any #ifdefs. Please see the test cases
for eclone in

git://git.sr71.net/~hallyn/cr_tests.git

There is no #ifdef and the tests work on x86, x86_64, ppc, s390.

These use the libeclone.a built from following git-tree, which has the
arch-dependent user space code.

git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Is that the #ifdef mess you are talking about ? I don't see that as a
consequence of the API. So maybe you can elaborate.

| There is no reason to have field usage vary by architecture. The

The field usage does not vary by architecture. Some architectures
don't use some fields and those fields must be 0. A simple

memset(&clone_args, 0, sizeof(clone_args))

before initializing fields is all that is required.

| original clone syscall was not designed with ia64 and hppa
| in mind, and has been causing trouble ever since. Let's not
| perpetuate the problem.

and lot of folks contributed to this new API to try and make sure
it is portable and meets the forseeable requirements.

|
| Given code like this: stack_base = malloc(stack_size);
| stack_base and stack_size are what the kernel needs.
|
| I suspect that you chose the defective method for some reason
| related to restarting processes that were created with the
| older system calls. I can't say most of us even care, but in
| that broken-already case your process restarter can make up
| some numbers that will work. (for i386, the base could be the
| lowest address in the vma in which %esp lies, or even address 0)

I don't understand how "making up some numbers (pids) that will work"
is more portable/cleaner than the proposed eclone().

Sukadev


2010-06-01 19:59:52

by Albert Cahalan

[permalink] [raw]
Subject: Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone

On Tue, Jun 1, 2010 at 3:32 PM, Sukadev Bhattiprolu
<[email protected]> wrote:
> Albert Cahalan [[email protected]] wrote:
> | Sukadev Bhattiprolu writes:
> | > Randy Dunlap [randy.dunlap at oracle.com] wrote:

> | >>> base of the region allocated for stack. These architectures
> | >>> must pass in the size of the stack-region in ->child_stack_size.
> | >>
> | >> stack region
> | >>
> | >> Seems unfortunate that different architectures use
> | >> the fields differently.
> | >
> | > Yes and no. The field still has a single purpose, just that
> | > some architectures may not need it. We enforce that if unused
> | > on an architecture, the field must be 0. It looked like
> | > the easiest way to keep the API common across architectures.
> |
> | Yuck. You're forcing userspace to have #ifdef messes or,
> | more likely, just not work on all architectures.
>
> There is going to be #ifdef code in the library interface to eclone().
> But applications should not need any #ifdefs. Please see the test cases
> for eclone in
>
> git://git.sr71.net/~hallyn/cr_tests.git
>
> There is no #ifdef and the tests work on x86, x86_64, ppc, s390.

Come on, seriously, you know it's ia64 and hppa that
have issues. Maybe the nommu ports also have issues.

The only portable way to specify the stack is base and offset,
with flags or magic values for "share" and "kernel managed".

> | There is no reason to have field usage vary by architecture. The
>
> The field usage does not vary by architecture. Some architectures
> don't use some fields and those fields must be 0.

It looks like you contradict yourself. Please explain how
those two sentences are compatible.

> | original clone syscall was not designed with ia64 and hppa
> | in mind, and has been causing trouble ever since. Let's not
> | perpetuate the problem.
>
> and lot of folks contributed to this new API to try and make sure
> it is portable and meets the forseeable requirements.

Right, and some folks were ignored.

> | Given code like this: stack_base = malloc(stack_size);
> | stack_base and stack_size are what the kernel needs.
> |
> | I suspect that you chose the defective method for some reason
> | related to restarting processes that were created with the
> | older system calls. I can't say most of us even care, but in
> | that broken-already case your process restarter can make up
> | some numbers that will work. (for i386, the base could be the
> | lowest address in the vma in which %esp lies, or even address 0)
>
> I don't understand how "making up some numbers (pids) that will work"
> is more portable/cleaner than the proposed eclone().

It isolates the cross-platform problems to an obscure tool
instead of polluting the kernel interface that everybody uses.

2010-06-02 01:30:37

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone

| Come on, seriously, you know it's ia64 and hppa that
| have issues. Maybe the nommu ports also have issues.
|
| The only portable way to specify the stack is base and offset,
| with flags or magic values for "share" and "kernel managed".

Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
comes in.

But are you saying that we should force x86 and other architectures to
specify base and offset for eclone() even though they currently specify
just the stack pointer to clone() ?

That would remove the ifdef, but could be a big change to applications
on x86 and other architectures.

|
| > | There is no reason to have field usage vary by architecture. The
| >
| > The field usage does not vary by architecture. Some architectures
| > don't use some fields and those fields must be 0.
|
| It looks like you contradict yourself. Please explain how
| those two sentences are compatible.
|
| > | original clone syscall was not designed with ia64 and hppa
| > | in mind, and has been causing trouble ever since. Let's not
| > | perpetuate the problem.
| >
| > and lot of folks contributed to this new API to try and make sure
| > it is portable and meets the forseeable requirements.
|
| Right, and some folks were ignored.

I don't think your comment was ignored. The ->child_stack_size field was
added specifically for IA64 and my understanding was that ->clone_flags_high
could be used to specify the "kernel managed" or "shared" mode you mention
above.

| >
| > I don't understand how "making up some numbers (pids) that will work"
| > is more portable/cleaner than the proposed eclone().
|
| It isolates the cross-platform problems to an obscure tool
| instead of polluting the kernel interface that everybody uses.

Sure, there was talk about using an approach like /proc/<pid>/next_pid
where you write your target pid into the file and the next time you
fork() you get that target pid. But it was considered racy and ugly.

Sukadev

2010-06-05 11:50:16

by Albert Cahalan

[permalink] [raw]
Subject: Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone

On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<[email protected]> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?

Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).

This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.

> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.

It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.

> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.

Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.

You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.

For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.

For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.

Right now user code must be like this:

base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif

The man page is likewise messy.

Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.

2010-06-05 12:01:49

by Albert Cahalan

[permalink] [raw]
Subject: Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone

On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<[email protected]> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?

Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).

This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.

> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.

It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.

> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.

Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.

You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.

For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.

For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.

Right now user code must be like this:

base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif

The man page is likewise messy.

Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.

2010-06-05 12:09:00

by Albert Cahalan

[permalink] [raw]
Subject: Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone

On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<[email protected]> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?

Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).

This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.

> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.

It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.

> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.

Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.

You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.

For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.

For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.

Right now user code must be like this:

base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif

The man page is likewise messy.

Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.

2010-06-09 18:05:27

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone

Albert Cahalan [[email protected]] wrote:
| On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
| <[email protected]> wrote:
| > | Come on, seriously, you know it's ia64 and hppa that
| > | have issues. Maybe the nommu ports also have issues.
| > |
| > | The only portable way to specify the stack is base and offset,
| > | with flags or magic values for "share" and "kernel managed".
| >
| > Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
| > comes in.
| >
| > But are you saying that we should force x86 and other architectures to
| > specify base and offset for eclone() even though they currently specify
| > just the stack pointer to clone() ?
|
| Even for x86, it's an easier API. Callers would be specifying
| two numbers they already have: the argument and return value
| for malloc. Currently the numbers must be added together,
| destroying information, except on hppa (must not add size)
| and ia64 (must use what I'm proposing already).

I agree its easier and would avoid #ifdefs in the applications.

Peter, Arnd, Roland - do you have any concerns with requiring all
architectures to specify the stack to eclone() as [base, offset]

To recap, currently we have

struct clone_args {
u64 clone_flags_high;
/*
* Architectures can use child_stack for either the stack pointer or
* the base of of stack. If child_stack is used as the stack pointer,
* child_stack_size must be 0. Otherwise child_stack_size must be
* set to size of allocated stack.
*/
u64 child_stack;
u64 child_stack_size;
u64 parent_tid_ptr;
u64 child_tid_ptr;
u32 nr_pids;
u32 reserved0;
};

sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
pid_t * __user pids)


Most architecutres would specify the stack pointer in ->child_stack and
ignore the ->child_stack_size.

IA64 specifies the *stack-base* in ->child_stack and the stack size in
->child_stack_size.

Albert and Randy point out that this would require #ifdefs in the
application code that intends to be portable across say IA64 and x86.

Can we instead have all architectures specify [base, size] ?

Thanks

Sukadev

2010-06-09 18:46:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone

On 06/09/2010 11:14 AM, Sukadev Bhattiprolu wrote:
> |
> | Even for x86, it's an easier API. Callers would be specifying
> | two numbers they already have: the argument and return value
> | for malloc. Currently the numbers must be added together,
> | destroying information, except on hppa (must not add size)
> | and ia64 (must use what I'm proposing already).
>
> I agree its easier and would avoid #ifdefs in the applications.
>
> Peter, Arnd, Roland - do you have any concerns with requiring all
> architectures to specify the stack to eclone() as [base, offset]
>

Makes sense to me. There might be advantages to be able to track the
size of the "stack allocation" even for other architectures, too.

-hpa

2010-06-09 22:33:09

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone

> Peter, Arnd, Roland - do you have any concerns with requiring all
> architectures to specify the stack to eclone() as [base, offset]

I can't see why that would be a problem.
It's consistent with the sigaltstack interface we already have.


Thanks,
Roland

2010-06-10 09:16:00

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone

On Wednesday 09 June 2010, Sukadev Bhattiprolu wrote:
> Albert and Randy point out that this would require #ifdefs in the
> application code that intends to be portable across say IA64 and x86.
>
> Can we instead have all architectures specify [base, size] ?

No objections from me on that.

Arnd