Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752902AbbHMPRm (ORCPT ); Thu, 13 Aug 2015 11:17:42 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59136 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752079AbbHMPRk (ORCPT ); Thu, 13 Aug 2015 11:17:40 -0400 Message-ID: <55CCB510.3060807@redhat.com> Date: Thu, 13 Aug 2015 17:17:36 +0200 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: David Drysdale , Kees Cook , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Will Drewry , Ingo Molnar CC: Alok Kataria , Linus Torvalds , Borislav Petkov , Alexei Starovoitov , Frederic Weisbecker , "H. Peter Anvin" , Oleg Nesterov , Steven Rostedt , X86 ML Subject: Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM? References: In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5050 Lines: 149 On 08/13/2015 10:30 AM, David Drysdale wrote: > Hi folks, > > I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone > else could reproduce it. > > The problem occurs with a seccomp-bpf filter program that's set up to return > an errno value -- an errno of 1 is always returned instead of what's in the > filter, plus other oddities (selftest output below). > > The problem seems to need a combination of circumstances to occur: > > - The seccomp-bpf userspace program needs to be 32-bit, running against a > 64-bit kernel -- I'm testing with seccomp_bpf from > tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'. Does it work correctly when built as 64-bit program? > > - The kernel needs to be running as a VM guest -- it occurs inside my > VMware Fusion host, but not if I run on bare metal. Kees tells me he > cannot repro with a kvm guest though. > > Bisecting indicates that the commit that induces the problem is > 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the > 64-bit logic", included in all the v4.2-rc* candidates. > > Apologies if I've just got something odd with my local setup, but the > bisection was unequivocal enough that I thought it worth reporting... > > Thanks, > David > > > seccomp_bpf failure outputs: > > seccomp_bpf.c:533:global.ERRNO_valid:Expected 7 (7) == > (*__errno_location ()) (1) Test source code: TEST(ERRNO_valid) { struct sock_filter filter[] = { BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_read, 0, 1), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | E2BIG), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog prog = { .len = (unsigned short)ARRAY_SIZE(filter), .filter = filter, }; long ret; pid_t parent = getppid(); ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); ASSERT_EQ(0, ret); ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); ASSERT_EQ(0, ret); EXPECT_EQ(parent, syscall(__NR_getppid)); EXPECT_EQ(-1, read(0, NULL, 0)); EXPECT_EQ(E2BIG, errno); } The last EXPECT expects 7 (E2BIG) but sees 1. I'm trying to see how that happens. SECCOMP_RET_ERRNO action is processed as follows: static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd) { ... case SECCOMP_RET_ERRNO: /* Set low-order bits as an errno, capped at MAX_ERRNO. */ if (data > MAX_ERRNO) data = MAX_ERRNO; syscall_set_return_value(current, task_pt_regs(current), -data, 0); goto skip; ... skip: audit_seccomp(this_syscall, 0, action); return SECCOMP_PHASE1_SKIP; // "the syscall should not be invoked" } The above is called from: unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) { ... if (work & _TIF_SECCOMP) { ... ret = seccomp_phase1(&sd); if (ret == SECCOMP_PHASE1_SKIP) { regs->orig_ax = -1; ret = 0; } ... } /* Do our best to finish without phase 2. */ if (work == 0) return ret; /* seccomp and/or nohz only (ret == 0 here) */ #ifdef CONFIG_AUDITSYSCALL if (work == _TIF_SYSCALL_AUDIT) { /* * If there is no more work to be done except auditing, * then audit in phase 1. Phase 2 always audits, so, if * we audit here, then we can't go on to phase 2. */ do_audit_syscall_entry(regs, arch); return 0; } #endif return 1; /* Something is enabled that we can't handle in phase 1 */ } ... long syscall_trace_enter(struct pt_regs *regs) { u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64; unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch); if (phase1_result == 0) return regs->orig_ax; else return syscall_trace_enter_phase2(regs, arch, phase1_result); } End result should be: pt_regs->ax = -E2BIG (via syscall_set_return_value()) pt_regs->orig_ax = -1 ("skip syscall") and syscall_trace_enter_phase1() usually returns with 0, meaning "re-execute syscall at once, no phase2 needed". This, in turn, is called from .S files, and when it returns there, execution loops back to syscall dispatch. Because of orig_ax = -1, syscall dispatch should skip calling syscall. So -E2BIG should survive and be returned... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/