2021-04-27 21:18:51

by H. Peter Anvin

[permalink] [raw]
Subject: pt_regs->ax == -ENOSYS

Trying to stomp out some possible cargo cult programming?

In the process of going through the various entry code paths, I have to
admit to being a bit confused why pt_regs->ax is set to -ENOSYS very
early in the system call path.

What is perhaps even more confusing is:

__visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
{
nr = syscall_enter_from_user_mode(regs, nr);

instrumentation_begin();
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
regs->ax = sys_call_table[nr](regs);
#ifdef CONFIG_X86_X32_ABI
} else if (likely((nr & __X32_SYSCALL_BIT) &&
(nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
X32_NR_syscalls);
regs->ax = x32_sys_call_table[nr](regs);
#endif
}
instrumentation_end();
syscall_exit_to_user_mode(regs);
}
#endif

Now, unless I'm completely out to sea, it seems to me that if
syscall_enter_from_user_mode() changes the system call number to an
invalid number and pt_regs->ax to !-ENOSYS then the system call will
return a different value(!) depending on if it is out of range for the
table (whatever was poked into pt_regs->ax) or if it corresponds to a
hole in the table. This seems to me at least to be The Wrong Thing.

Calling regs->ax = sys_ni_syscall() in an else clause would arguably be
the right thing here, except possibly in the case where nr (or (int)nr,
see below) == -1 or < 0.

Now, syscall_get_nr() returns the low 32 bits of the system call number
unconditionally. There are places where we look at the sign of this
number, which means that 0xffffffff7fffffff is "positive" and
0x7fffffffffffffff is "negative". We have gone back and forth more than
once on if we should look at %rax or just %eax on a system call... I
have to admit that the current design makes me a bit nervous.

Finally, can anything bad happen in some weird corner case inside one of
the syscall_*_mode() calls or after an interrupt if someone tries to
call syscall(-1) or another negative number?

Food for thought or just my not being up to date?

Thanks,

-hpa


2021-04-27 21:29:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS


> On Apr 27, 2021, at 2:15 PM, H. Peter Anvin <[email protected]> wrote:
>
> Trying to stomp out some possible cargo cult programming?
>
> In the process of going through the various entry code paths, I have to admit to being a bit confused why pt_regs->ax is set to -ENOSYS very early in the system call path.
>

It has to get set to _something_, and copying orig_ax seems perhaps silly. There could also be code that relies on ptrace poking -1 into the nr resulting in -ENOSYS.

> What is perhaps even more confusing is:
>
> __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
> {
> nr = syscall_enter_from_user_mode(regs, nr);
>
> instrumentation_begin();
> if (likely(nr < NR_syscalls)) {
> nr = array_index_nospec(nr, NR_syscalls);
> regs->ax = sys_call_table[nr](regs);
> #ifdef CONFIG_X86_X32_ABI
> } else if (likely((nr & __X32_SYSCALL_BIT) &&
> (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
> nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
> X32_NR_syscalls);
> regs->ax = x32_sys_call_table[nr](regs);
> #endif
> }
> instrumentation_end();
> syscall_exit_to_user_mode(regs);
> }
> #endif
>
> Now, unless I'm completely out to sea, it seems to me that if syscall_enter_from_user_mode() changes the system call number to an invalid number and pt_regs->ax to !-ENOSYS then the system call will return a different value(!) depending on if it is out of range for the table (whatever was poked into pt_regs->ax) or if it corresponds to a hole in the table. This seems to me at least to be The Wrong Thing.

I think you’re right.

>
> Calling regs->ax = sys_ni_syscall() in an else clause would arguably be the right thing here, except possibly in the case where nr (or (int)nr, see below) == -1 or < 0.

I think the check should be -1 for 64 bit but (u32)nr == (u32)-1 for the 32-bit path. Does that seem reasonable?

>
> Now, syscall_get_nr() returns the low 32 bits of the system call number unconditionally. There are places where we look at the sign of this number, which means that 0xffffffff7fffffff is "positive" and 0x7fffffffffffffff is "negative". We have gone back and forth more than once on if we should look at %rax or just %eax on a system call... I have to admit that the current design makes me a bit nervous.
>
> Finally, can anything bad happen in some weird corner case inside one of the syscall_*_mode() calls or after an interrupt if someone tries to call syscall(-1) or another negative number?
>
> Food for thought or just my not being up to date?
>
> Thanks,
>
> -hpa
>

2021-04-27 23:00:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS

On 4/27/21 2:28 PM, Andy Lutomirski wrote:
>
>> On Apr 27, 2021, at 2:15 PM, H. Peter Anvin <[email protected]> wrote:
>>
>> Trying to stomp out some possible cargo cult programming?
>>
>> In the process of going through the various entry code paths, I have to admit to being a bit confused why pt_regs->ax is set to -ENOSYS very early in the system call path.
>>
>
> It has to get set to _something_, and copying orig_ax seems perhaps silly. There could also be code that relies on ptrace poking -1 into the nr resulting in -ENOSYS.
>

Yeah. I obviously ran into this working on the common entry-exit code
for FRED; the frame has annoyingly different formats because of this,
and I wanted to avoid slowing down the system call path.

>> What is perhaps even more confusing is:
>>
>> __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
>> {
>> nr = syscall_enter_from_user_mode(regs, nr);
>>
>> instrumentation_begin();
>> if (likely(nr < NR_syscalls)) {
>> nr = array_index_nospec(nr, NR_syscalls);
>> regs->ax = sys_call_table[nr](regs);
>> #ifdef CONFIG_X86_X32_ABI
>> } else if (likely((nr & __X32_SYSCALL_BIT) &&
>> (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
>> nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
>> X32_NR_syscalls);
>> regs->ax = x32_sys_call_table[nr](regs);
>> #endif
>> }
>> instrumentation_end();
>> syscall_exit_to_user_mode(regs);
>> }
>> #endif
>>
>> Now, unless I'm completely out to sea, it seems to me that if syscall_enter_from_user_mode() changes the system call number to an invalid number and pt_regs->ax to !-ENOSYS then the system call will return a different value(!) depending on if it is out of range for the table (whatever was poked into pt_regs->ax) or if it corresponds to a hole in the table. This seems to me at least to be The Wrong Thing.
>
> I think you’re right.
>
>>
>> Calling regs->ax = sys_ni_syscall() in an else clause would arguably be the right thing here, except possibly in the case where nr (or (int)nr, see below) == -1 or < 0.
>
> I think the check should be -1 for 64 bit but (u32)nr == (u32)-1 for the 32-bit path. Does that seem reasonable?

I'm thinking overall that depending on 64-bit %rax is once again a
mistake; I realize that the assembly code that did that kept breaking
because people messed with it, but we still have:

/*
* Only the low 32 bits of orig_ax are meaningful, so we return int.
* This importantly ignores the high bits on 64-bit, so comparisons
* sign-extend the low 32 bits.
*/
static inline int syscall_get_nr(struct task_struct *task, struct
pt_regs *regs)
{
return regs->orig_ax;
}

"Different interpretation of the same data" is a notorious security
trap. Zero-extending orig_ax would cause different behavior on 32 and 64
bits and differ from the above, so I'm thinking that just once and for
all defining the system call number as a signed int for all the x86 ABIs
would be the sanest.

It still doesn't really answer the question if "movq $-1,%rax; syscall"
or "movl $-1,%eax; syscall" could somehow cause bad things to happen,
though, which makes me a little bit nervous still.

-hpa

2021-04-27 23:25:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS

On Tue, Apr 27, 2021 at 3:58 PM H. Peter Anvin <[email protected]> wrote:
>
> On 4/27/21 2:28 PM, Andy Lutomirski wrote:
> >
> >> On Apr 27, 2021, at 2:15 PM, H. Peter Anvin <[email protected]> wrote:
> >>
> >> Trying to stomp out some possible cargo cult programming?
> >>
> >> In the process of going through the various entry code paths, I have to admit to being a bit confused why pt_regs->ax is set to -ENOSYS very early in the system call path.
> >>
> >
> > It has to get set to _something_, and copying orig_ax seems perhaps silly. There could also be code that relies on ptrace poking -1 into the nr resulting in -ENOSYS.
> >
>
> Yeah. I obviously ran into this working on the common entry-exit code
> for FRED; the frame has annoyingly different formats because of this,
> and I wanted to avoid slowing down the system call path.
>
> >> What is perhaps even more confusing is:
> >>
> >> __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
> >> {
> >> nr = syscall_enter_from_user_mode(regs, nr);
> >>
> >> instrumentation_begin();
> >> if (likely(nr < NR_syscalls)) {
> >> nr = array_index_nospec(nr, NR_syscalls);
> >> regs->ax = sys_call_table[nr](regs);
> >> #ifdef CONFIG_X86_X32_ABI
> >> } else if (likely((nr & __X32_SYSCALL_BIT) &&
> >> (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
> >> nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
> >> X32_NR_syscalls);
> >> regs->ax = x32_sys_call_table[nr](regs);
> >> #endif
> >> }
> >> instrumentation_end();
> >> syscall_exit_to_user_mode(regs);
> >> }
> >> #endif
> >>
> >> Now, unless I'm completely out to sea, it seems to me that if syscall_enter_from_user_mode() changes the system call number to an invalid number and pt_regs->ax to !-ENOSYS then the system call will return a different value(!) depending on if it is out of range for the table (whatever was poked into pt_regs->ax) or if it corresponds to a hole in the table. This seems to me at least to be The Wrong Thing.
> >
> > I think you’re right.
> >
> >>
> >> Calling regs->ax = sys_ni_syscall() in an else clause would arguably be the right thing here, except possibly in the case where nr (or (int)nr, see below) == -1 or < 0.
> >
> > I think the check should be -1 for 64 bit but (u32)nr == (u32)-1 for the 32-bit path. Does that seem reasonable?
>
> I'm thinking overall that depending on 64-bit %rax is once again a
> mistake; I realize that the assembly code that did that kept breaking
> because people messed with it, but we still have:
>
> /*
> * Only the low 32 bits of orig_ax are meaningful, so we return int.
> * This importantly ignores the high bits on 64-bit, so comparisons
> * sign-extend the low 32 bits.
> */
> static inline int syscall_get_nr(struct task_struct *task, struct
> pt_regs *regs)
> {
> return regs->orig_ax;
> }
>
> "Different interpretation of the same data" is a notorious security
> trap. Zero-extending orig_ax would cause different behavior on 32 and 64
> bits and differ from the above, so I'm thinking that just once and for
> all defining the system call number as a signed int for all the x86 ABIs
> would be the sanest.
>
> It still doesn't really answer the question if "movq $-1,%rax; syscall"
> or "movl $-1,%eax; syscall" could somehow cause bad things to happen,
> though, which makes me a little bit nervous still.
>

I much prefer the model of saying that the bits that make sense for
the syscall type (all 64 for 64-bit SYSCALL and the low 32 for
everything else) are all valid. This way there are no weird reserved
bits, no weird ptrace() interactions, etc. I'm a tiny bit concerned
that this would result in a backwards compatibility issue, but not
very. This would involve changing syscall_get_nr(), but that doesn't
seem so bad. The biggest problem is that seccomp hardcoded syscall
nrs to 32 bit.

An alternative would be to declare that we always truncate to 32 bits,
except that 64-bit SYSCALL with high bits set is an error and results
in ENOSYS. The ptrace interaction there is potentially nasty.

Basically, all choices here kind of suck, and I haven't done a real
analysis of all the issues...

--Andy

2021-04-27 23:30:12

by Kees Cook

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS

On Tue, Apr 27, 2021 at 03:58:03PM -0700, H. Peter Anvin wrote:
> On 4/27/21 2:28 PM, Andy Lutomirski wrote:
> >
> > > On Apr 27, 2021, at 2:15 PM, H. Peter Anvin <[email protected]> wrote:
> > >
> > > Trying to stomp out some possible cargo cult programming?
> > >
> > > In the process of going through the various entry code paths, I have to admit to being a bit confused why pt_regs->ax is set to -ENOSYS very early in the system call path.
> > >
> >
> > It has to get set to _something_, and copying orig_ax seems perhaps silly. There could also be code that relies on ptrace poking -1 into the nr resulting in -ENOSYS.
> >
>
> Yeah. I obviously ran into this working on the common entry-exit code for
> FRED; the frame has annoyingly different formats because of this, and I
> wanted to avoid slowing down the system call path.
>
> > > What is perhaps even more confusing is:
> > >
> > > __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
> > > {
> > > nr = syscall_enter_from_user_mode(regs, nr);
> > >
> > > instrumentation_begin();
> > > if (likely(nr < NR_syscalls)) {
> > > nr = array_index_nospec(nr, NR_syscalls);
> > > regs->ax = sys_call_table[nr](regs);
> > > #ifdef CONFIG_X86_X32_ABI
> > > } else if (likely((nr & __X32_SYSCALL_BIT) &&
> > > (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
> > > nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
> > > X32_NR_syscalls);
> > > regs->ax = x32_sys_call_table[nr](regs);
> > > #endif
> > > }
> > > instrumentation_end();
> > > syscall_exit_to_user_mode(regs);
> > > }
> > > #endif
> > >
> > > Now, unless I'm completely out to sea, it seems to me that if syscall_enter_from_user_mode() changes the system call number to an invalid number and pt_regs->ax to !-ENOSYS then the system call will return a different value(!) depending on if it is out of range for the table (whatever was poked into pt_regs->ax) or if it corresponds to a hole in the table. This seems to me at least to be The Wrong Thing.
> >
> > I think you’re right.
> >
> > >
> > > Calling regs->ax = sys_ni_syscall() in an else clause would arguably be the right thing here, except possibly in the case where nr (or (int)nr, see below) == -1 or < 0.
> >
> > I think the check should be -1 for 64 bit but (u32)nr == (u32)-1 for the 32-bit path. Does that seem reasonable?

FWIW, there is some confusion with how syscall_trac_enter() signals the
"skip syscall" condition (-1L), vs actually calling "syscall -1". Right
now they're not really distinguished, and the early ENOSYS is there so that
"pretend it happened" can be implemented (by either ptrace or seccomp).
As in, "set return value to $whatever, and don't run a syscall".

> I'm thinking overall that depending on 64-bit %rax is once again a mistake;
> I realize that the assembly code that did that kept breaking because people
> messed with it, but we still have:
>
> /*
> * Only the low 32 bits of orig_ax are meaningful, so we return int.
> * This importantly ignores the high bits on 64-bit, so comparisons
> * sign-extend the low 32 bits.
> */
> static inline int syscall_get_nr(struct task_struct *task, struct pt_regs
> *regs)
> {
> return regs->orig_ax;
> }
>
> "Different interpretation of the same data" is a notorious security trap.
> Zero-extending orig_ax would cause different behavior on 32 and 64 bits and
> differ from the above, so I'm thinking that just once and for all defining
> the system call number as a signed int for all the x86 ABIs would be the
> sanest.
>
> It still doesn't really answer the question if "movq $-1,%rax; syscall" or
> "movl $-1,%eax; syscall" could somehow cause bad things to happen, though,
> which makes me a little bit nervous still.

I don't think this matters? What's the condition you're worried about
here? The syscall table lookup is going to be safe.

--
Kees Cook

2021-04-27 23:52:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS



> On Apr 27, 2021, at 4:29 PM, Kees Cook <[email protected]> wrote:
>
> On Tue, Apr 27, 2021 at 03:58:03PM -0700, H. Peter Anvin wrote:
>>> On 4/27/21 2:28 PM, Andy Lutomirski wrote:
>>>
>>>> On Apr 27, 2021, at 2:15 PM, H. Peter Anvin <[email protected]> wrote:
>>>>
>>>> Trying to stomp out some possible cargo cult programming?
>>>>
>>>> In the process of going through the various entry code paths, I have to admit to being a bit confused why pt_regs->ax is set to -ENOSYS very early in the system call path.
>>>>
>>>
>>> It has to get set to _something_, and copying orig_ax seems perhaps silly. There could also be code that relies on ptrace poking -1 into the nr resulting in -ENOSYS.
>>>
>>
>> Yeah. I obviously ran into this working on the common entry-exit code for
>> FRED; the frame has annoyingly different formats because of this, and I
>> wanted to avoid slowing down the system call path.
>>
>>>> What is perhaps even more confusing is:
>>>>
>>>> __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
>>>> {
>>>> nr = syscall_enter_from_user_mode(regs, nr);
>>>>
>>>> instrumentation_begin();
>>>> if (likely(nr < NR_syscalls)) {
>>>> nr = array_index_nospec(nr, NR_syscalls);
>>>> regs->ax = sys_call_table[nr](regs);
>>>> #ifdef CONFIG_X86_X32_ABI
>>>> } else if (likely((nr & __X32_SYSCALL_BIT) &&
>>>> (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
>>>> nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
>>>> X32_NR_syscalls);
>>>> regs->ax = x32_sys_call_table[nr](regs);
>>>> #endif
>>>> }
>>>> instrumentation_end();
>>>> syscall_exit_to_user_mode(regs);
>>>> }
>>>> #endif
>>>>
>>>> Now, unless I'm completely out to sea, it seems to me that if syscall_enter_from_user_mode() changes the system call number to an invalid number and pt_regs->ax to !-ENOSYS then the system call will return a different value(!) depending on if it is out of range for the table (whatever was poked into pt_regs->ax) or if it corresponds to a hole in the table. This seems to me at least to be The Wrong Thing.
>>>
>>> I think you’re right.
>>>
>>>>
>>>> Calling regs->ax = sys_ni_syscall() in an else clause would arguably be the right thing here, except possibly in the case where nr (or (int)nr, see below) == -1 or < 0.
>>>
>>> I think the check should be -1 for 64 bit but (u32)nr == (u32)-1 for the 32-bit path. Does that seem reasonable?
>
> FWIW, there is some confusion with how syscall_trac_enter() signals the
> "skip syscall" condition (-1L), vs actually calling "syscall -1".

Fortunately there is not, and never will be, a syscall -1. But I agree that calling max syscall + 1 should behave identically to calling a nonexistent syscall in the middle of the table.

2021-04-28 00:08:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS

On 4/27/21 4:23 PM, Andy Lutomirski wrote:
>
> I much prefer the model of saying that the bits that make sense for
> the syscall type (all 64 for 64-bit SYSCALL and the low 32 for
> everything else) are all valid. This way there are no weird reserved
> bits, no weird ptrace() interactions, etc. I'm a tiny bit concerned
> that this would result in a backwards compatibility issue, but not
> very. This would involve changing syscall_get_nr(), but that doesn't
> seem so bad. The biggest problem is that seccomp hardcoded syscall
> nrs to 32 bit.
>
> An alternative would be to declare that we always truncate to 32 bits,
> except that 64-bit SYSCALL with high bits set is an error and results
> in ENOSYS. The ptrace interaction there is potentially nasty.
>
> Basically, all choices here kind of suck, and I haven't done a real
> analysis of all the issues...
>

OK, I really don't understand this. The *current* way of doing it causes
a bunch of ugly corner conditions, including in ptrace, which this would
get rid of. It isn't any different than passing any other argument which
is an int -- in fact we have this whole machinery to deal with that subcase.

If it makes you feel better, we could even sign-extend the value in
orig_ax, but that seems unnecessary and a bit broken to me.

-hpa

2021-04-28 00:12:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS

On Tue, Apr 27, 2021 at 5:05 PM H. Peter Anvin <[email protected]> wrote:
>
> On 4/27/21 4:23 PM, Andy Lutomirski wrote:
> >
> > I much prefer the model of saying that the bits that make sense for
> > the syscall type (all 64 for 64-bit SYSCALL and the low 32 for
> > everything else) are all valid. This way there are no weird reserved
> > bits, no weird ptrace() interactions, etc. I'm a tiny bit concerned
> > that this would result in a backwards compatibility issue, but not
> > very. This would involve changing syscall_get_nr(), but that doesn't
> > seem so bad. The biggest problem is that seccomp hardcoded syscall
> > nrs to 32 bit.
> >
> > An alternative would be to declare that we always truncate to 32 bits,
> > except that 64-bit SYSCALL with high bits set is an error and results
> > in ENOSYS. The ptrace interaction there is potentially nasty.
> >
> > Basically, all choices here kind of suck, and I haven't done a real
> > analysis of all the issues...
> >
>
> OK, I really don't understand this. The *current* way of doing it causes
> a bunch of ugly corner conditions, including in ptrace, which this would
> get rid of. It isn't any different than passing any other argument which
> is an int -- in fact we have this whole machinery to deal with that subcase.
>

Let's suppose we decide to truncate the syscall nr. What would the
actual semantics be? Would ptrace see the truncated value in orig_ax?
How about syscall user dispatch? What happens if ptrace writes a
value with high bits set to orig_ax? Do we truncate it again? Or do
we say that ptrace *can't* write too large a value?

For better for worse, RAX is 64 bits, orig_ax is a 64-bit field, and
it currently has nonsensical semantics. Redefining orig_ax as a
32-bit field is surely possible, but doing so cleanly is not
necessarily any easier than any other approach. If it weren't for
seccomp, I would say that the obviously correct answer is to just
treat it everywhere as a 64-bit number.

--Andy

2021-04-28 00:25:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS



On 4/27/21 5:11 PM, Andy Lutomirski wrote:
> On Tue, Apr 27, 2021 at 5:05 PM H. Peter Anvin <[email protected]> wrote:
>>
>> On 4/27/21 4:23 PM, Andy Lutomirski wrote:
>>>
>>> I much prefer the model of saying that the bits that make sense for
>>> the syscall type (all 64 for 64-bit SYSCALL and the low 32 for
>>> everything else) are all valid. This way there are no weird reserved
>>> bits, no weird ptrace() interactions, etc. I'm a tiny bit concerned
>>> that this would result in a backwards compatibility issue, but not
>>> very. This would involve changing syscall_get_nr(), but that doesn't
>>> seem so bad. The biggest problem is that seccomp hardcoded syscall
>>> nrs to 32 bit.
>>>
>>> An alternative would be to declare that we always truncate to 32 bits,
>>> except that 64-bit SYSCALL with high bits set is an error and results
>>> in ENOSYS. The ptrace interaction there is potentially nasty.
>>>
>>> Basically, all choices here kind of suck, and I haven't done a real
>>> analysis of all the issues...
>>>
>>
>> OK, I really don't understand this. The *current* way of doing it causes
>> a bunch of ugly corner conditions, including in ptrace, which this would
>> get rid of. It isn't any different than passing any other argument which
>> is an int -- in fact we have this whole machinery to deal with that subcase.
>>
>
> Let's suppose we decide to truncate the syscall nr. What would the
> actual semantics be? Would ptrace see the truncated value in orig_ax?
> How about syscall user dispatch? What happens if ptrace writes a
> value with high bits set to orig_ax? Do we truncate it again? Or do
> we say that ptrace *can't* write too large a value?
>
> For better for worse, RAX is 64 bits, orig_ax is a 64-bit field, and
> it currently has nonsensical semantics. Redefining orig_ax as a
> 32-bit field is surely possible, but doing so cleanly is not
> necessarily any easier than any other approach. If it weren't for
> seccomp, I would say that the obviously correct answer is to just
> treat it everywhere as a 64-bit number.
>

We *used* to truncate the system call number; that was unsigned. It
causes massive headache to ptrace if a 32-bit ptrace wants to write -1,
which is a bit hacky.

I would personally like to see orig_ax to be the register passed in and
for the truncation to happen by syscall_get_nr().

I also note that kernel/seccomp.c and the tracing infrastructure all
expect a signed int as the system call number. Yes, orig_ax is a 64-bit
field, but so are the other register fields which doesn't necessarily
directly reflect the value of an argument -- like, say, %rdi in the case
of sys_write - it is an int argument so it gets sign extended; this is
*not* reflected in ptrace.

-hpa

2021-04-28 00:48:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS

On 4/27/21 5:20 PM, H. Peter Anvin wrote:
>
> We *used* to truncate the system call number; that was unsigned. It
> causes massive headache to ptrace if a 32-bit ptrace wants to write -1,
> which is a bit hacky.
>
> I would personally like to see orig_ax to be the register passed in and
> for the truncation to happen by syscall_get_nr().
>
> I also note that kernel/seccomp.c and the tracing infrastructure all
> expect a signed int as the system call number. Yes, orig_ax is a 64-bit
> field, but so are the other register fields which doesn't necessarily
> directly reflect the value of an argument -- like, say, %rdi in the case
> of sys_write - it is an int argument so it gets sign extended; this is
> *not* reflected in ptrace.
>
>     -hpa

We could even do this, to make it perhaps harder to mess up:

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 409f661481e1..4e8e5c2e35f4 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -41,7 +41,10 @@ struct pt_regs {
unsigned short gs;
unsigned short __gsh;
/* On interrupt, this is the error code. */
- unsigned long orig_ax;
+ union {
+ unsigned long orig_ax;
+ int syscall_nr;
+ };
unsigned long ip;
unsigned short cs;
unsigned short __csh;
@@ -78,7 +81,10 @@ struct pt_regs {
* On syscall entry, this is syscall#. On CPU exception, this is error
code.
* On hw interrupt, it's IRQ number:
*/
- unsigned long orig_ax;
+ union {
+ unsigned long orig_ax;
+ int syscall_nr;
+ };
/* Return frame for iretq */
unsigned long ip;
unsigned long cs;

2021-04-28 02:07:27

by Kees Cook

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS

On Tue, Apr 27, 2021 at 04:51:06PM -0700, Andy Lutomirski wrote:
> Fortunately there is not, and never will be, a syscall -1. But I
> agree that calling max syscall + 1 should behave identically to calling
> a nonexistent syscall in the middle of the table.

If that happens, we have to separate the meaning of -1L from ptrace,
seccomp, etc. (i.e. we can't just add an "else { result = -ENOSYS; }" to
the syscall table dispatching code, since that'll overwrite any written
return value when the syscall is meant to be skipped with a specific
return value set by ptrace/seccomp.

syscall_trace_enter() will currently return either -1 or the
syscall. Which means someone making a "syscall -1" will get the skip
semantics currently (though the preloaded -ENOSYS results in the
"expected" outcome).

arm64 recently had to untangle this too:

15956689a0e6 arm64: compat: Ensure upper 32 bits of x0 are zero on syscall return
59ee987ea47c arm64: ptrace: Add a comment describing our syscall entry/exit trap ABI
139dbe5d8ed3 arm64: syscall: Expand the comment about ptrace and syscall(-1)
d83ee6e3e75d arm64: ptrace: Use NO_SYSCALL instead of -1 in syscall_trace_enter()

--
Kees Cook

2021-04-28 02:09:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: pt_regs->ax == -ENOSYS

Earlier in the thread the suggestion was to have (int)pt_regs->orig_ax < 0 indicate a nonsyscall.

On April 27, 2021 7:05:56 PM PDT, Kees Cook <[email protected]> wrote:
>On Tue, Apr 27, 2021 at 04:51:06PM -0700, Andy Lutomirski wrote:
>> Fortunately there is not, and never will be, a syscall -1. But I
>> agree that calling max syscall + 1 should behave identically to
>calling
>> a nonexistent syscall in the middle of the table.
>
>If that happens, we have to separate the meaning of -1L from ptrace,
>seccomp, etc. (i.e. we can't just add an "else { result = -ENOSYS; }"
>to
>the syscall table dispatching code, since that'll overwrite any written
>return value when the syscall is meant to be skipped with a specific
>return value set by ptrace/seccomp.
>
>syscall_trace_enter() will currently return either -1 or the
>syscall. Which means someone making a "syscall -1" will get the skip
>semantics currently (though the preloaded -ENOSYS results in the
>"expected" outcome).
>
>arm64 recently had to untangle this too:
>
>15956689a0e6 arm64: compat: Ensure upper 32 bits of x0 are zero on
>syscall return
>59ee987ea47c arm64: ptrace: Add a comment describing our syscall
>entry/exit trap ABI
>139dbe5d8ed3 arm64: syscall: Expand the comment about ptrace and
>syscall(-1)
>d83ee6e3e75d arm64: ptrace: Use NO_SYSCALL instead of -1 in
>syscall_trace_enter()

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.