Hi folks,
I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
else could reproduce it.
The problem occurs with a seccomp-bpf filter program that's set up to return
an errno value -- an errno of 1 is always returned instead of what's in the
filter, plus other oddities (selftest output below).
The problem seems to need a combination of circumstances to occur:
- The seccomp-bpf userspace program needs to be 32-bit, running against a
64-bit kernel -- I'm testing with seccomp_bpf from
tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
- The kernel needs to be running as a VM guest -- it occurs inside my
VMware Fusion host, but not if I run on bare metal. Kees tells me he
cannot repro with a kvm guest though.
Bisecting indicates that the commit that induces the problem is
3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
64-bit logic", included in all the v4.2-rc* candidates.
Apologies if I've just got something odd with my local setup, but the
bisection was unequivocal enough that I thought it worth reporting...
Thanks,
David
seccomp_bpf failure outputs:
seccomp_bpf.c:533:global.ERRNO_valid:Expected 7 (7) ==
(*__errno_location ()) (1)
seccomp_bpf.c:560:global.ERRNO_zero:Expected 0 (0) == read(0, ((void
*)0), 0) (4294967295)
seccomp_bpf.c:587:global.ERRNO_capped:Expected 4095 (4095) ==
(*__errno_location ()) (1)
seccomp_bpf.c:905:precedence.errno_is_third:Expected 0 (0) ==
syscall(20) (4294967295)
seccomp_bpf.c:925:precedence.errno_is_third_in_any_order:Expected 0
(0) == syscall(20) (4294967295)
On 08/13/2015 10:30 AM, David Drysdale wrote:
> Hi folks,
>
> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
> else could reproduce it.
>
> The problem occurs with a seccomp-bpf filter program that's set up to return
> an errno value -- an errno of 1 is always returned instead of what's in the
> filter, plus other oddities (selftest output below).
>
> The problem seems to need a combination of circumstances to occur:
>
> - The seccomp-bpf userspace program needs to be 32-bit, running against a
> 64-bit kernel -- I'm testing with seccomp_bpf from
> tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
Does it work correctly when built as 64-bit program?
>
> - The kernel needs to be running as a VM guest -- it occurs inside my
> VMware Fusion host, but not if I run on bare metal. Kees tells me he
> cannot repro with a kvm guest though.
>
> Bisecting indicates that the commit that induces the problem is
> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
> 64-bit logic", included in all the v4.2-rc* candidates.
>
> Apologies if I've just got something odd with my local setup, but the
> bisection was unequivocal enough that I thought it worth reporting...
>
> Thanks,
> David
>
>
> seccomp_bpf failure outputs:
>
> seccomp_bpf.c:533:global.ERRNO_valid:Expected 7 (7) ==
> (*__errno_location ()) (1)
Test source code:
TEST(ERRNO_valid)
{
struct sock_filter filter[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_read, 0, 1),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | E2BIG),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
};
struct sock_fprog prog = {
.len = (unsigned short)ARRAY_SIZE(filter),
.filter = filter,
};
long ret;
pid_t parent = getppid();
ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
ASSERT_EQ(0, ret);
ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
ASSERT_EQ(0, ret);
EXPECT_EQ(parent, syscall(__NR_getppid));
EXPECT_EQ(-1, read(0, NULL, 0));
EXPECT_EQ(E2BIG, errno);
}
The last EXPECT expects 7 (E2BIG) but sees 1.
I'm trying to see how that happens.
SECCOMP_RET_ERRNO action is processed as follows:
static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd)
{
...
case SECCOMP_RET_ERRNO:
/* Set low-order bits as an errno, capped at MAX_ERRNO. */
if (data > MAX_ERRNO)
data = MAX_ERRNO;
syscall_set_return_value(current, task_pt_regs(current),
-data, 0);
goto skip;
...
skip:
audit_seccomp(this_syscall, 0, action);
return SECCOMP_PHASE1_SKIP; // "the syscall should not be invoked"
}
The above is called from:
unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
{
...
if (work & _TIF_SECCOMP) {
... ret = seccomp_phase1(&sd);
if (ret == SECCOMP_PHASE1_SKIP) {
regs->orig_ax = -1;
ret = 0;
}
...
}
/* Do our best to finish without phase 2. */
if (work == 0)
return ret; /* seccomp and/or nohz only (ret == 0 here) */
#ifdef CONFIG_AUDITSYSCALL
if (work == _TIF_SYSCALL_AUDIT) {
/*
* If there is no more work to be done except auditing,
* then audit in phase 1. Phase 2 always audits, so, if
* we audit here, then we can't go on to phase 2.
*/
do_audit_syscall_entry(regs, arch);
return 0;
}
#endif
return 1; /* Something is enabled that we can't handle in phase 1 */
}
...
long syscall_trace_enter(struct pt_regs *regs)
{
u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
if (phase1_result == 0)
return regs->orig_ax;
else
return syscall_trace_enter_phase2(regs, arch, phase1_result);
}
End result should be:
pt_regs->ax = -E2BIG (via syscall_set_return_value())
pt_regs->orig_ax = -1 ("skip syscall")
and syscall_trace_enter_phase1() usually returns with 0,
meaning "re-execute syscall at once, no phase2 needed".
This, in turn, is called from .S files, and when it returns there,
execution loops back to syscall dispatch.
Because of orig_ax = -1, syscall dispatch should skip calling syscall.
So -E2BIG should survive and be returned...
On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <[email protected]> wrote:
> On 08/13/2015 10:30 AM, David Drysdale wrote:
>> Hi folks,
>>
>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>> else could reproduce it.
>>
>> The problem occurs with a seccomp-bpf filter program that's set up to return
>> an errno value -- an errno of 1 is always returned instead of what's in the
>> filter, plus other oddities (selftest output below).
>>
>> The problem seems to need a combination of circumstances to occur:
>>
>> - The seccomp-bpf userspace program needs to be 32-bit, running against a
>> 64-bit kernel -- I'm testing with seccomp_bpf from
>> tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>
> Does it work correctly when built as 64-bit program?
Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>
>> - The kernel needs to be running as a VM guest -- it occurs inside my
>> VMware Fusion host, but not if I run on bare metal. Kees tells me he
>> cannot repro with a kvm guest though.
>>
>> Bisecting indicates that the commit that induces the problem is
>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>> 64-bit logic", included in all the v4.2-rc* candidates.
>>
>> Apologies if I've just got something odd with my local setup, but the
>> bisection was unequivocal enough that I thought it worth reporting...
>>
>> Thanks,
>> David
>>
>>
>> seccomp_bpf failure outputs:
[snip]
> End result should be:
> pt_regs->ax = -E2BIG (via syscall_set_return_value())
> pt_regs->orig_ax = -1 ("skip syscall")
> and syscall_trace_enter_phase1() usually returns with 0,
> meaning "re-execute syscall at once, no phase2 needed".
>
> This, in turn, is called from .S files, and when it returns there,
> execution loops back to syscall dispatch.
>
> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
> So -E2BIG should survive and be returned...
So I was just about to send:
That makes sense, and given that exactly the same 32-bit binary
runs fine on a different machine, there's presumably something up
with my local setup. The failing machine is a VMware guest, but
maybe that's not the relevant interaction -- particularly if no-one
else can repro.
But then I noticed some odd audit entries in the main log:
Aug 13 16:52:56 ubuntu kernel: [ 20.687249] audit: type=1326
audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
compat=1 ip=0xf773cc90 code=0x0
Aug 13 16:52:56 ubuntu kernel: [ 20.691157] audit: type=1326
audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
compat=1 ip=0xf773cc90 code=0x10000000
...
I didn't think I had any audit stuff turned on, and indeed:
# auditctl -l
No rules
But as soon as I'd run that auditctl command, the 32-bit
seccomp_bpf binary started running fine!
So now I'm confused, and I can no longer reproduce the
problem. Which probably means this was a false alarm, in
which case, my apologies.
D.
On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <[email protected]> wrote:
> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <[email protected]> wrote:
>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>> Hi folks,
>>>
>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>> else could reproduce it.
>>>
>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>> filter, plus other oddities (selftest output below).
>>>
>>> The problem seems to need a combination of circumstances to occur:
>>>
>>> - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>> 64-bit kernel -- I'm testing with seccomp_bpf from
>>> tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>
>> Does it work correctly when built as 64-bit program?
>
> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>
>>>
>>> - The kernel needs to be running as a VM guest -- it occurs inside my
>>> VMware Fusion host, but not if I run on bare metal. Kees tells me he
>>> cannot repro with a kvm guest though.
>>>
>>> Bisecting indicates that the commit that induces the problem is
>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>
>>> Apologies if I've just got something odd with my local setup, but the
>>> bisection was unequivocal enough that I thought it worth reporting...
>>>
>>> Thanks,
>>> David
>>>
>>>
>>> seccomp_bpf failure outputs:
>
> [snip]
>
>> End result should be:
>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>> pt_regs->orig_ax = -1 ("skip syscall")
>> and syscall_trace_enter_phase1() usually returns with 0,
>> meaning "re-execute syscall at once, no phase2 needed".
>>
>> This, in turn, is called from .S files, and when it returns there,
>> execution loops back to syscall dispatch.
>>
>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>> So -E2BIG should survive and be returned...
>
> So I was just about to send:
>
> That makes sense, and given that exactly the same 32-bit binary
> runs fine on a different machine, there's presumably something up
> with my local setup. The failing machine is a VMware guest, but
> maybe that's not the relevant interaction -- particularly if no-one
> else can repro.
>
> But then I noticed some odd audit entries in the main log:
>
> Aug 13 16:52:56 ubuntu kernel: [ 20.687249] audit: type=1326
> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
> compat=1 ip=0xf773cc90 code=0x0
> Aug 13 16:52:56 ubuntu kernel: [ 20.691157] audit: type=1326
> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
> compat=1 ip=0xf773cc90 code=0x10000000
> ...
>
> I didn't think I had any audit stuff turned on, and indeed:
> # auditctl -l
> No rules
>
> But as soon as I'd run that auditctl command, the 32-bit
> seccomp_bpf binary started running fine!
>
> So now I'm confused, and I can no longer reproduce the
> problem. Which probably means this was a false alarm, in
> which case, my apologies.
You might have triggered TIF_AUDIT or whatever it's called, which
causes a whole different path through the asm tangle, so you might
really have a problem.
Try auditctl -a task,never. If that doesn't change anything, try
rebooting the guest.
--Andy
>
> D.
--
Andy Lutomirski
AMA Capital Management, LLC
On Thu, Aug 13, 2015 at 6:15 PM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <[email protected]> wrote:
>> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <[email protected]> wrote:
>>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>>> Hi folks,
>>>>
>>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>>> else could reproduce it.
>>>>
>>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>>> filter, plus other oddities (selftest output below).
>>>>
>>>> The problem seems to need a combination of circumstances to occur:
>>>>
>>>> - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>> 64-bit kernel -- I'm testing with seccomp_bpf from
>>>> tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>>
>>> Does it work correctly when built as 64-bit program?
>>
>> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>
>>>>
>>>> - The kernel needs to be running as a VM guest -- it occurs inside my
>>>> VMware Fusion host, but not if I run on bare metal. Kees tells me he
>>>> cannot repro with a kvm guest though.
>>>>
>>>> Bisecting indicates that the commit that induces the problem is
>>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>>
>>>> Apologies if I've just got something odd with my local setup, but the
>>>> bisection was unequivocal enough that I thought it worth reporting...
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>
>>>> seccomp_bpf failure outputs:
>>
>> [snip]
>>
>>> End result should be:
>>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>>> pt_regs->orig_ax = -1 ("skip syscall")
>>> and syscall_trace_enter_phase1() usually returns with 0,
>>> meaning "re-execute syscall at once, no phase2 needed".
>>>
>>> This, in turn, is called from .S files, and when it returns there,
>>> execution loops back to syscall dispatch.
>>>
>>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>>> So -E2BIG should survive and be returned...
>>
>> So I was just about to send:
>>
>> That makes sense, and given that exactly the same 32-bit binary
>> runs fine on a different machine, there's presumably something up
>> with my local setup. The failing machine is a VMware guest, but
>> maybe that's not the relevant interaction -- particularly if no-one
>> else can repro.
>>
>> But then I noticed some odd audit entries in the main log:
>>
>> Aug 13 16:52:56 ubuntu kernel: [ 20.687249] audit: type=1326
>> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
>> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
>> compat=1 ip=0xf773cc90 code=0x0
>> Aug 13 16:52:56 ubuntu kernel: [ 20.691157] audit: type=1326
>> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
>> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
>> compat=1 ip=0xf773cc90 code=0x10000000
>> ...
>>
>> I didn't think I had any audit stuff turned on, and indeed:
>> # auditctl -l
>> No rules
>>
>> But as soon as I'd run that auditctl command, the 32-bit
>> seccomp_bpf binary started running fine!
>>
>> So now I'm confused, and I can no longer reproduce the
>> problem. Which probably means this was a false alarm, in
>> which case, my apologies.
>
> You might have triggered TIF_AUDIT or whatever it's called, which
> causes a whole different path through the asm tangle, so you might
> really have a problem.
>
> Try auditctl -a task,never. If that doesn't change anything, try
> rebooting the guest.
Aha, that seems to re-instate the problem -- with that auditctl setup
I get the 32-bit seccomp failures on two different machines (one VM,
one bare). So can anyone else repro?
I guess the relevant steps are thus:
- sudo auditctl -a task,never
- cd tools/testing/selftests/seccomp
- CFLAGS=-m32 make clean run_tests
On Thu, Aug 13, 2015 at 10:39 AM, David Drysdale <[email protected]> wrote:
> On Thu, Aug 13, 2015 at 6:15 PM, Andy Lutomirski <[email protected]> wrote:
>> On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <[email protected]> wrote:
>>> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <[email protected]> wrote:
>>>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>>>> Hi folks,
>>>>>
>>>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>>>> else could reproduce it.
>>>>>
>>>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>>>> filter, plus other oddities (selftest output below).
>>>>>
>>>>> The problem seems to need a combination of circumstances to occur:
>>>>>
>>>>> - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>>> 64-bit kernel -- I'm testing with seccomp_bpf from
>>>>> tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>>>
>>>> Does it work correctly when built as 64-bit program?
>>>
>>> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>>
>>>>>
>>>>> - The kernel needs to be running as a VM guest -- it occurs inside my
>>>>> VMware Fusion host, but not if I run on bare metal. Kees tells me he
>>>>> cannot repro with a kvm guest though.
>>>>>
>>>>> Bisecting indicates that the commit that induces the problem is
>>>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>>>
>>>>> Apologies if I've just got something odd with my local setup, but the
>>>>> bisection was unequivocal enough that I thought it worth reporting...
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>
>>>>> seccomp_bpf failure outputs:
>>>
>>> [snip]
>>>
>>>> End result should be:
>>>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>>>> pt_regs->orig_ax = -1 ("skip syscall")
>>>> and syscall_trace_enter_phase1() usually returns with 0,
>>>> meaning "re-execute syscall at once, no phase2 needed".
>>>>
>>>> This, in turn, is called from .S files, and when it returns there,
>>>> execution loops back to syscall dispatch.
>>>>
>>>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>>>> So -E2BIG should survive and be returned...
>>>
>>> So I was just about to send:
>>>
>>> That makes sense, and given that exactly the same 32-bit binary
>>> runs fine on a different machine, there's presumably something up
>>> with my local setup. The failing machine is a VMware guest, but
>>> maybe that's not the relevant interaction -- particularly if no-one
>>> else can repro.
>>>
>>> But then I noticed some odd audit entries in the main log:
>>>
>>> Aug 13 16:52:56 ubuntu kernel: [ 20.687249] audit: type=1326
>>> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
>>> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
>>> compat=1 ip=0xf773cc90 code=0x0
>>> Aug 13 16:52:56 ubuntu kernel: [ 20.691157] audit: type=1326
>>> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
>>> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
>>> compat=1 ip=0xf773cc90 code=0x10000000
>>> ...
>>>
>>> I didn't think I had any audit stuff turned on, and indeed:
>>> # auditctl -l
>>> No rules
>>>
>>> But as soon as I'd run that auditctl command, the 32-bit
>>> seccomp_bpf binary started running fine!
>>>
>>> So now I'm confused, and I can no longer reproduce the
>>> problem. Which probably means this was a false alarm, in
>>> which case, my apologies.
>>
>> You might have triggered TIF_AUDIT or whatever it's called, which
>> causes a whole different path through the asm tangle, so you might
>> really have a problem.
>>
>> Try auditctl -a task,never. If that doesn't change anything, try
>> rebooting the guest.
>
> Aha, that seems to re-instate the problem -- with that auditctl setup
> I get the 32-bit seccomp failures on two different machines (one VM,
> one bare). So can anyone else repro?
>
> I guess the relevant steps are thus:
> - sudo auditctl -a task,never
> - cd tools/testing/selftests/seccomp
> - CFLAGS=-m32 make clean run_tests
That was it! I can reproduce this now on kvm (after adding the auditctl rule).
-Kees
--
Kees Cook
Chrome OS Security
On 08/13/2015 08:47 PM, Kees Cook wrote:
> On Thu, Aug 13, 2015 at 10:39 AM, David Drysdale <[email protected]> wrote:
>> On Thu, Aug 13, 2015 at 6:15 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <[email protected]> wrote:
>>>> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <[email protected]> wrote:
>>>>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>>>>> Hi folks,
>>>>>>
>>>>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>>>>> else could reproduce it.
>>>>>>
>>>>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>>>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>>>>> filter, plus other oddities (selftest output below).
>>>>>>
>>>>>> The problem seems to need a combination of circumstances to occur:
>>>>>>
>>>>>> - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>>>> 64-bit kernel -- I'm testing with seccomp_bpf from
>>>>>> tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>>>>
>>>>> Does it work correctly when built as 64-bit program?
>>>>
>>>> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>>>
>>>>>>
>>>>>> - The kernel needs to be running as a VM guest -- it occurs inside my
>>>>>> VMware Fusion host, but not if I run on bare metal. Kees tells me he
>>>>>> cannot repro with a kvm guest though.
>>>>>>
>>>>>> Bisecting indicates that the commit that induces the problem is
>>>>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>>>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>>>>
>>>>>> Apologies if I've just got something odd with my local setup, but the
>>>>>> bisection was unequivocal enough that I thought it worth reporting...
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>
>>>>>> seccomp_bpf failure outputs:
>>>>
>>>> [snip]
>>>>
>>>>> End result should be:
>>>>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>>>>> pt_regs->orig_ax = -1 ("skip syscall")
>>>>> and syscall_trace_enter_phase1() usually returns with 0,
>>>>> meaning "re-execute syscall at once, no phase2 needed".
>>>>>
>>>>> This, in turn, is called from .S files, and when it returns there,
>>>>> execution loops back to syscall dispatch.
>>>>>
>>>>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>>>>> So -E2BIG should survive and be returned...
>>>>
>>>> So I was just about to send:
>>>>
>>>> That makes sense, and given that exactly the same 32-bit binary
>>>> runs fine on a different machine, there's presumably something up
>>>> with my local setup. The failing machine is a VMware guest, but
>>>> maybe that's not the relevant interaction -- particularly if no-one
>>>> else can repro.
>>>>
>>>> But then I noticed some odd audit entries in the main log:
>>>>
>>>> Aug 13 16:52:56 ubuntu kernel: [ 20.687249] audit: type=1326
>>>> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
>>>> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
>>>> compat=1 ip=0xf773cc90 code=0x0
>>>> Aug 13 16:52:56 ubuntu kernel: [ 20.691157] audit: type=1326
>>>> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
>>>> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
>>>> compat=1 ip=0xf773cc90 code=0x10000000
>>>> ...
>>>>
>>>> I didn't think I had any audit stuff turned on, and indeed:
>>>> # auditctl -l
>>>> No rules
>>>>
>>>> But as soon as I'd run that auditctl command, the 32-bit
>>>> seccomp_bpf binary started running fine!
>>>>
>>>> So now I'm confused, and I can no longer reproduce the
>>>> problem. Which probably means this was a false alarm, in
>>>> which case, my apologies.
>>>
>>> You might have triggered TIF_AUDIT or whatever it's called, which
>>> causes a whole different path through the asm tangle, so you might
>>> really have a problem.
>>>
>>> Try auditctl -a task,never. If that doesn't change anything, try
>>> rebooting the guest.
>>
>> Aha, that seems to re-instate the problem -- with that auditctl setup
>> I get the 32-bit seccomp failures on two different machines (one VM,
>> one bare). So can anyone else repro?
>>
>> I guess the relevant steps are thus:
>> - sudo auditctl -a task,never
>> - cd tools/testing/selftests/seccomp
>> - CFLAGS=-m32 make clean run_tests
>
> That was it! I can reproduce this now on kvm (after adding the auditctl rule).
I suspect this change:
.macro auditsys_entry_common
...
movl %ebx,%esi /* 2nd arg: 1st syscall arg */
movl %eax,%edi /* 1st arg: syscall number */
call __audit_syscall_entry
- movl RAX(%rsp),%eax /* reload syscall number */
- cmpq $(IA32_NR_syscalls-1),%rax
- ja ia32_badsys
+ movl ORIG_RAX(%rsp),%eax /* reload syscall number */
movl %ebx,%edi /* reload 1st syscall arg */
movl RCX(%rsp),%esi /* reload 2nd syscall arg */
movl RDX(%rsp),%edx /* reload 3rd syscall arg */
We were reloading syscall# from pt_regs->ax.
After the patch, pt_regs->ax isn't equal to syscall# on entry,
instead it contains -ENOSYS. Therefore the change shown above
was made, to reload it from pt_regs->orig_ax.
Well. This still should work... in fact it is "more correct"
than it was before...
64-bit code has no call to __audit_syscall_entry, it uses
syscall_trace_enter_phase1/phase2 mechanism instead of
"only audit" shortcut. If the bug is here (though I don't see it),
it explains why 64-bit binary works.
Now, how do we reach this bit of code?
ia32_sysenter_target:
...
testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jnz sysenter_tracesys
...
sysenter_tracesys:
testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jz sysenter_auditsys
...
sysenter_auditsys:
auditsys_entry_common <== OUR MACRO
movl %ebp,%r9d /* reload 6th syscall arg */
jmp sysenter_dispatch
ia32_cstar_target:
...
testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jnz cstar_tracesys
...
cstar_tracesys:
testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jz cstar_auditsys
...
cstar_auditsys:
movl %r9d,R9(%rsp) /* register to be clobbered by call */
auditsys_entry_common <== OUR MACRO
movl R9(%rsp),%r9d /* reload 6th syscall arg */
jmp cstar_dispatch
On Thu, Aug 13, 2015 at 2:35 PM, Denys Vlasenko <[email protected]> wrote:
> On 08/13/2015 08:47 PM, Kees Cook wrote:
>> On Thu, Aug 13, 2015 at 10:39 AM, David Drysdale <[email protected]> wrote:
>>> On Thu, Aug 13, 2015 at 6:15 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <[email protected]> wrote:
>>>>> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <[email protected]> wrote:
>>>>>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>>>>>> else could reproduce it.
>>>>>>>
>>>>>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>>>>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>>>>>> filter, plus other oddities (selftest output below).
>>>>>>>
>>>>>>> The problem seems to need a combination of circumstances to occur:
>>>>>>>
>>>>>>> - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>>>>> 64-bit kernel -- I'm testing with seccomp_bpf from
>>>>>>> tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>>>>>
>>>>>> Does it work correctly when built as 64-bit program?
>>>>>
>>>>> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>>>>
>>>>>>>
>>>>>>> - The kernel needs to be running as a VM guest -- it occurs inside my
>>>>>>> VMware Fusion host, but not if I run on bare metal. Kees tells me he
>>>>>>> cannot repro with a kvm guest though.
>>>>>>>
>>>>>>> Bisecting indicates that the commit that induces the problem is
>>>>>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>>>>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>>>>>
>>>>>>> Apologies if I've just got something odd with my local setup, but the
>>>>>>> bisection was unequivocal enough that I thought it worth reporting...
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>> seccomp_bpf failure outputs:
>>>>>
>>>>> [snip]
>>>>>
>>>>>> End result should be:
>>>>>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>>>>>> pt_regs->orig_ax = -1 ("skip syscall")
>>>>>> and syscall_trace_enter_phase1() usually returns with 0,
>>>>>> meaning "re-execute syscall at once, no phase2 needed".
>>>>>>
>>>>>> This, in turn, is called from .S files, and when it returns there,
>>>>>> execution loops back to syscall dispatch.
>>>>>>
>>>>>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>>>>>> So -E2BIG should survive and be returned...
>>>>>
>>>>> So I was just about to send:
>>>>>
>>>>> That makes sense, and given that exactly the same 32-bit binary
>>>>> runs fine on a different machine, there's presumably something up
>>>>> with my local setup. The failing machine is a VMware guest, but
>>>>> maybe that's not the relevant interaction -- particularly if no-one
>>>>> else can repro.
>>>>>
>>>>> But then I noticed some odd audit entries in the main log:
>>>>>
>>>>> Aug 13 16:52:56 ubuntu kernel: [ 20.687249] audit: type=1326
>>>>> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
>>>>> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
>>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
>>>>> compat=1 ip=0xf773cc90 code=0x0
>>>>> Aug 13 16:52:56 ubuntu kernel: [ 20.691157] audit: type=1326
>>>>> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
>>>>> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
>>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
>>>>> compat=1 ip=0xf773cc90 code=0x10000000
>>>>> ...
>>>>>
>>>>> I didn't think I had any audit stuff turned on, and indeed:
>>>>> # auditctl -l
>>>>> No rules
>>>>>
>>>>> But as soon as I'd run that auditctl command, the 32-bit
>>>>> seccomp_bpf binary started running fine!
>>>>>
>>>>> So now I'm confused, and I can no longer reproduce the
>>>>> problem. Which probably means this was a false alarm, in
>>>>> which case, my apologies.
>>>>
>>>> You might have triggered TIF_AUDIT or whatever it's called, which
>>>> causes a whole different path through the asm tangle, so you might
>>>> really have a problem.
>>>>
>>>> Try auditctl -a task,never. If that doesn't change anything, try
>>>> rebooting the guest.
>>>
>>> Aha, that seems to re-instate the problem -- with that auditctl setup
>>> I get the 32-bit seccomp failures on two different machines (one VM,
>>> one bare). So can anyone else repro?
>>>
>>> I guess the relevant steps are thus:
>>> - sudo auditctl -a task,never
>>> - cd tools/testing/selftests/seccomp
>>> - CFLAGS=-m32 make clean run_tests
>>
>> That was it! I can reproduce this now on kvm (after adding the auditctl rule).
>
> I suspect this change:
>
> .macro auditsys_entry_common
> ...
> movl %ebx,%esi /* 2nd arg: 1st syscall arg */
> movl %eax,%edi /* 1st arg: syscall number */
> call __audit_syscall_entry
> - movl RAX(%rsp),%eax /* reload syscall number */
> - cmpq $(IA32_NR_syscalls-1),%rax
> - ja ia32_badsys
> + movl ORIG_RAX(%rsp),%eax /* reload syscall number */
> movl %ebx,%edi /* reload 1st syscall arg */
> movl RCX(%rsp),%esi /* reload 2nd syscall arg */
> movl RDX(%rsp),%edx /* reload 3rd syscall arg */
>
> We were reloading syscall# from pt_regs->ax.
I am so glad that this code is gone in -tip. Good riddance!
>
> After the patch, pt_regs->ax isn't equal to syscall# on entry,
> instead it contains -ENOSYS. Therefore the change shown above
> was made, to reload it from pt_regs->orig_ax.
>
> Well. This still should work... in fact it is "more correct"
> than it was before...
>
> 64-bit code has no call to __audit_syscall_entry, it uses
> syscall_trace_enter_phase1/phase2 mechanism instead of
> "only audit" shortcut. If the bug is here (though I don't see it),
> it explains why 64-bit binary works.
>
>
> Now, how do we reach this bit of code?
>
> ia32_sysenter_target:
> ...
> testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> jnz sysenter_tracesys
> ...
> sysenter_tracesys:
> testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> jz sysenter_auditsys
> ...
> sysenter_auditsys:
> auditsys_entry_common <== OUR MACRO
> movl %ebp,%r9d /* reload 6th syscall arg */
> jmp sysenter_dispatch
>
>
> ia32_cstar_target:
> ...
> testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> jnz cstar_tracesys
> ...
> cstar_tracesys:
> testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> jz cstar_auditsys
> ...
> cstar_auditsys:
> movl %r9d,R9(%rsp) /* register to be clobbered by call */
> auditsys_entry_common <== OUR MACRO
> movl R9(%rsp),%r9d /* reload 6th syscall arg */
> jmp cstar_dispatch
>
TIF_SECCOMP had better be set, so that code should be unreachable.
syscall_trace_enter_phase1 returns 0 if we hit SECCOMP_RET_ERRNO (i.e.
SECCOMP_PHASE1_SKIP). syscall_trace_enter sees that and returns
regs->orig_ax, which is -1.
It seems to me that the bug is that sysexit_from_sys_call isn't
reloading RAX from regs->ax.
--Andy
On Thu, Aug 13, 2015 at 2:35 PM, Denys Vlasenko <[email protected]> wrote:
>
> I suspect this change:
>
> .macro auditsys_entry_common
> ...
> movl %ebx,%esi /* 2nd arg: 1st syscall arg */
> movl %eax,%edi /* 1st arg: syscall number */
> call __audit_syscall_entry
> - movl RAX(%rsp),%eax /* reload syscall number */
> - cmpq $(IA32_NR_syscalls-1),%rax
> - ja ia32_badsys
> + movl ORIG_RAX(%rsp),%eax /* reload syscall number */
>
> We were reloading syscall# from pt_regs->ax.
>
> After the patch, pt_regs->ax isn't equal to syscall# on entry,
> instead it contains -ENOSYS. Therefore the change shown above
> was made, to reload it from pt_regs->orig_ax.
>
> Well. This still should work... in fact it is "more correct"
> than it was before...
Well, since it doesn't work, that's clearly not the case.
Also, you do realize that ORIG_RAX can get changed by signal handling
and ptrace?
In fact, I think that whole "save -ENOSYS to pt_regs->ax" is BS.
Exactly because we use pt_regs->ax for ptrace etc, and you've changed
the register state we expose.
So, considering that we're in -rc6, and this has been bisected, I
would normally just revert the commit with extreme prejudice. However,
in this case it's unnecessarily painful, just because there's been a
ton of pointless churn in this area (cfi annotations, renaming, no end
of crap.
I'm also going to be a *lot* less inclined to take all these idiotic
low-level x86 changes from now on. It's been a total pain, for very
little gain. These "cleanups" have been buggy as hell, and test
coverage for the compat case is clearly lacking.
Denys, this was not the first "obvious cleanup" of yours that broke
things subtly, and where your reaction to the problem report was "that
cannot happen".
Linus
On Thu, Aug 13, 2015 at 2:47 PM, Andy Lutomirski <[email protected]> wrote:
>
> It seems to me that the bug is that sysexit_from_sys_call isn't
> reloading RAX from regs->ax.
Ugh. That code is confusing, and _most_ cases seem to have %rax already loaded.
There seems to be three cases:
- fallthrough from cstar_dispatch after a successful call to the system call
This has %rax as the correct return value (which also got saved to
RAX on stack)
- the 'auditsys_exit' macro 'exit' case.
This seems to have %rax reloaded inside the macro
- the out-of-range system call case for cstar_dispatch
This does *not* seem to load %rax with $ENOSYS, but keeps the bad
system call number in it.
so yeah, there seems to be a bug there, but if I read it right, that
bug seems to happen just for the out-of-range system call case, which
afaik isn't the case reported here.
I guess adding a re-load of %rax is ok, even though in the common
cases it is already loaded.
Oh, and sysexit_from_sys_call seems to have the exact same situation.
The "system call dispatch with %rax out of range" fallthrough case
doesn't set %rax to ENOSYS.
So I guess we could remove the reloading of system call return value
from auditsys_exit, and just do it unconditionally in the common path.
Which is sad, since the *really* common case already has the right
value, but whatever.
Does the attached patch make sense and work? Totally untested, just
looking at the code. But maybe it's right, because it's exactly that
ENOSYS case that the bad patch in question changed.
Btw, the old ENOSYS code also cleared ORIG_EAX. I'm not sure why, but
we used to have
ia32_badsys:
movq $0,ORIG_RAX(%rsp)
movq $-ENOSYS,%rax
jmp ia32_sysret
for that case.
Linus
On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
<[email protected]> wrote:
>
> Does the attached patch make sense and work?
Btw, I'm not all that happy with it anyway.
I still think Denys' patch also potentially changed what audit and
strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
is a good change.
But maybe that three-liner patch fixes the immediate problem that
David sees. David?
Linus
On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Does the attached patch make sense and work?
>
> Btw, I'm not all that happy with it anyway.
>
> I still think Denys' patch also potentially changed what audit and
> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
> is a good change.
>
> But maybe that three-liner patch fixes the immediate problem that
> David sees. David?
Your patch fixes it for me. The seccomp compat selftests pass again
with audit enabled.
-Kees
--
Kees Cook
Chrome OS Security
On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Does the attached patch make sense and work?
>
> Btw, I'm not all that happy with it anyway.
>
> I still think Denys' patch also potentially changed what audit and
> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
> is a good change.
For better for for worse, the native 64-bit path changed several
versions agi, and nothing broke that I'm aware of. The change was:
commit 54eea9957f5763dd1a2555d7e4cb53b4dd389cc6
Author: Andy Lutomirski <[email protected]>
Date: Fri Sep 5 15:13:55 2014 -0700
x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls
AFAIK, ptrace has always seen ax == -ENOSYS on syscall entry for
native 64-bit syscalls. My change just simplified the fast path
(which is invisible by ptrace for obvious reasons, unless someone
traces fork or something along those lines *without*) and made it less
different from the slow path. (IIRC it also simplified some stuff
down the road.)
Looking at 3.19's ia32entry.S, it has:
sysenter_tracesys:
#ifdef CONFIG_AUDITSYSCALL
testl $(_TIF_WORK_SYSCALL_ENTRY &
~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
jz sysenter_auditsys
#endif
SAVE_REST
CLEAR_RREGS
movq $-ENOSYS,RAX(%rsp)/* ptrace can change this for a bad syscall */
So I think it's always been the intent and practice that ptracers
would see ax == -ENOSYS on syscall entry.
IOW, whether this is good or bad, I don't think it's really a change.
--Andy
On Thu, Aug 13, 2015 at 3:56 PM, Kees Cook <[email protected]> wrote:
> On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> Does the attached patch make sense and work?
>>
>> Btw, I'm not all that happy with it anyway.
>>
>> I still think Denys' patch also potentially changed what audit and
>> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
>> is a good change.
>>
>> But maybe that three-liner patch fixes the immediate problem that
>> David sees. David?
>
> Your patch fixes it for me. The seccomp compat selftests pass again
> with audit enabled.
Kees, would it be straightforward to rig up the seccomp tests to
automatically test compat? The x86 selftests automatically test both
native and compat, and that might be usable as a model. I did that
because it's extremely easy to regress one and not the other.
--Andy
On Thu, Aug 13, 2015 at 3:59 PM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Aug 13, 2015 at 3:56 PM, Kees Cook <[email protected]> wrote:
>> On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
>>> <[email protected]> wrote:
>>>>
>>>> Does the attached patch make sense and work?
>>>
>>> Btw, I'm not all that happy with it anyway.
>>>
>>> I still think Denys' patch also potentially changed what audit and
>>> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
>>> is a good change.
>>>
>>> But maybe that three-liner patch fixes the immediate problem that
>>> David sees. David?
>>
>> Your patch fixes it for me. The seccomp compat selftests pass again
>> with audit enabled.
>
> Kees, would it be straightforward to rig up the seccomp tests to
> automatically test compat? The x86 selftests automatically test both
> native and compat, and that might be usable as a model. I did that
> because it's extremely easy to regress one and not the other.
Yeah, I'll figure out how to get this working sanely. There are some
ugly behaviors on arm64 doing compat that seccomp found too, so I'll
need those targets for more than just x86.
-Kees
--
Kees Cook
Chrome OS Security
On Thu, Aug 13, 2015 at 3:58 PM, Andy Lutomirski <[email protected]> wrote:
>
> For better for for worse, the native 64-bit path changed several
> versions agi, and nothing broke that I'm aware of.
Ok, I'll take that as an ACK for the patch, and apply it, since it
seems to fix this regression.
Linus
On Thu, Aug 13, 2015 at 3:59 PM, Andy Lutomirski <[email protected]> wrote:
>
> Kees, would it be straightforward to rig up the seccomp tests to
> automatically test compat? The x86 selftests automatically test both
> native and compat, and that might be usable as a model. I did that
> because it's extremely easy to regress one and not the other.
Note that in this case, the bug was actually _hidden_ by audit (since
the audit path would end up reloading %rax, and is why doing "auditctl
-a task,never" actually enabled people to see it), so it would also be
good to try to make sure that the tests would try both with and
without audit involved too.
I'm very tired of these bugs, but I guess and hope that your patches
to move as much as possible of this to C will actually end up helping
in the long run. So while I'm not really looking forward to even
_more_ patches to the low-level asm, at least the C rewrite seems more
worthwhile than some of the noise that made this all so painful has
felt.
Linus
On Thu, Aug 13, 2015 at 11:56 PM, Kees Cook <[email protected]> wrote:
> On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> Does the attached patch make sense and work?
>>
>> Btw, I'm not all that happy with it anyway.
>>
>> I still think Denys' patch also potentially changed what audit and
>> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
>> is a good change.
>>
>> But maybe that three-liner patch fixes the immediate problem that
>> David sees. David?
>
> Your patch fixes it for me. The seccomp compat selftests pass again
> with audit enabled.
>
> -Kees
Yep, that patch fixes it for me too.
On 08/14/2015 12:27 AM, Linus Torvalds wrote:
> On Thu, Aug 13, 2015 at 2:35 PM, Denys Vlasenko <[email protected]> wrote:
>>
>> I suspect this change:
>>
>> .macro auditsys_entry_common
>> ...
>> movl %ebx,%esi /* 2nd arg: 1st syscall arg */
>> movl %eax,%edi /* 1st arg: syscall number */
>> call __audit_syscall_entry
>> - movl RAX(%rsp),%eax /* reload syscall number */
>> - cmpq $(IA32_NR_syscalls-1),%rax
>> - ja ia32_badsys
>> + movl ORIG_RAX(%rsp),%eax /* reload syscall number */
>>
>> We were reloading syscall# from pt_regs->ax.
>>
>> After the patch, pt_regs->ax isn't equal to syscall# on entry,
>> instead it contains -ENOSYS. Therefore the change shown above
>> was made, to reload it from pt_regs->orig_ax.
>>
>> Well. This still should work... in fact it is "more correct"
>> than it was before...
>
> Well, since it doesn't work, that's clearly not the case.
>
> Also, you do realize that ORIG_RAX can get changed by signal handling
> and ptrace?
I am very aware of that, yes. If it changes, we should use *it*.
That's why new code in this part is "more correct" than old one.
> In fact, I think that whole "save -ENOSYS to pt_regs->ax" is BS.
> Exactly because we use pt_regs->ax for ptrace etc, and you've changed
> the register state we expose.
ptrace always sees pt_regs->ax = -ENOSYS on syscall entry.
That's part of the ABI. Syscall# is in pt_regs->orig_ax.
We used to do that _only_ on ptrace code path, the fast path
didn't store -ENOSYS in pt_regs->ax. This optimization ended up being
more pain than gain, and it was changed for 64-bit code by this commit:
commit 54eea9957f5763dd1a2555d7e4cb53b4dd389cc6
Author: Andy Lutomirski <[email protected]>
Date: Fri Sep 5 15:13:55 2014 -0700
x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls
I changed 32-bit compat code to do the same thing.
> I'm also going to be a *lot* less inclined to take all these idiotic
> low-level x86 changes from now on. It's been a total pain, for very
> little gain. These "cleanups" have been buggy as hell, and test
> coverage for the compat case is clearly lacking.
The code in question was an unholy mess, with about a half dozen
"clever optimizations" tangled together. Now it is much, much
less ugly.
It was nearly inevitable that something would break during untangling.
I understand the frustration of having things breaking
"because of the stupid cleanup".
On 08/14/2015 12:59 AM, Andy Lutomirski wrote:
> On Thu, Aug 13, 2015 at 3:56 PM, Kees Cook <[email protected]> wrote:
>> On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
>>> <[email protected]> wrote:
>>>>
>>>> Does the attached patch make sense and work?
>>>
>>> Btw, I'm not all that happy with it anyway.
>>>
>>> I still think Denys' patch also potentially changed what audit and
>>> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
>>> is a good change.
>>>
>>> But maybe that three-liner patch fixes the immediate problem that
>>> David sees. David?
>>
>> Your patch fixes it for me. The seccomp compat selftests pass again
>> with audit enabled.
>
> Kees, would it be straightforward to rig up the seccomp tests to
> automatically test compat? The x86 selftests automatically test both
> native and compat, and that might be usable as a model. I did that
> because it's extremely easy to regress one and not the other.
BTW, why 64-bt code doesn't need this RAX read-back?
On Fri, Aug 14, 2015 at 4:58 AM, Denys Vlasenko <[email protected]> wrote:
> On 08/14/2015 12:59 AM, Andy Lutomirski wrote:
>> On Thu, Aug 13, 2015 at 3:56 PM, Kees Cook <[email protected]> wrote:
>>> On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
>>> <[email protected]> wrote:
>>>> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
>>>> <[email protected]> wrote:
>>>>>
>>>>> Does the attached patch make sense and work?
>>>>
>>>> Btw, I'm not all that happy with it anyway.
>>>>
>>>> I still think Denys' patch also potentially changed what audit and
>>>> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
>>>> is a good change.
>>>>
>>>> But maybe that three-liner patch fixes the immediate problem that
>>>> David sees. David?
>>>
>>> Your patch fixes it for me. The seccomp compat selftests pass again
>>> with audit enabled.
>>
>> Kees, would it be straightforward to rig up the seccomp tests to
>> automatically test compat? The x86 selftests automatically test both
>> native and compat, and that might be usable as a model. I did that
>> because it's extremely easy to regress one and not the other.
>
> BTW, why 64-bt code doesn't need this RAX read-back?
>
It's hiding inside of RESTORE_C_REGS_EXCEPT_RCX_R11.
--Andy
* Denys Vlasenko <[email protected]> wrote:
> It was nearly inevitable that something would break during untangling.
1)
So the 'chronic lack of compat, audit/noaudit and Wine testing' was certainly
avoidable.
The problem wasn't the fact that something was bound to break, but the latency of
finding these bugs. If we cannot reduce the latency so that bugs are caught early
enough (before they reach mainline) then we shouldn't be doing such changes.
We are slowly adding tests for that in the x86 self-tests, but IMHO we should be
more proactive than that.
2)
Another structural problem I saw occasionally was the attempt to characterise away
regressions.
That's a 100% no-no: if a change breaks any user-space program, it does not matter
how 'correct' a change is, how weird the user-space dependence and how rare the
user-space program: the regression needs to be fixed either by going forward with
a fix or by going backwards via reverting the change.
Thanks,
Ingo