Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp8990952ybi; Tue, 23 Jul 2019 19:35:51 -0700 (PDT) X-Google-Smtp-Source: APXvYqwMvNKspTAyUkPxmF0/d5XfkooFQ1AZM+ASpeDmzavlP/OjZiEeBOekc0o320qna/yFB62J X-Received: by 2002:a65:44cb:: with SMTP id g11mr33832744pgs.288.1563935750931; Tue, 23 Jul 2019 19:35:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563935750; cv=none; d=google.com; s=arc-20160816; b=A2MRYh96X90lrt4cQyS8f9uuuwMSbCWFM3WGH+NiphDq/Wzmi4T2rRxMDddsxCkTHq gFbaIqWQcjW+bRabEd4t5I0foP/Ce0cWWVPz31lklJrkYpT0MGcHqk4GQVL5K+/3JiJj kXwYwLFm7FH3GxvkCm+GRwQyiuQxYpihU5A/24OvF9BMFucYcYrq+8O2kOCCtL9iHDSp nj7RByL5gIP1VjUaFMXtOSjFMmHE877E15sGVJCdpY27+Yp4sdCFlgkweS2pWcoIkVo+ JtndMQpxnJ1IUrK0istmXfTKKo3ueyooh/YTbqRfImfU6WmjEh/uxUJbmkEJKhPpLdfw G2hg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=a7YBRcL9aDezjqcAKhzSVGxGEJP15hFtSDS2Yfbtco4=; b=m7KiNI/YyFTMawXBChvqb6YxNVSgBxdZjEyc//TKk7doHm477d2LwR7IcBgx3SW/o0 RpFdAQcS+/MJWzerI92sgqUFDHB4gMdyGUMY+ERSFGGdkzIcF/4cRRJR7q0ACixeSrB9 l2T2O2MesjMFyyLbTo2kCNOhKdkrf8N+xOGaWc/cV+l1qRzEDbKZv3D4ab2UMzcoQCdU o9MASN7NIOAW0D09wuJy6JDpakDttKxZ1E1DUx3dLiDi3M54KcVpZ/vqXnyQNaQ37aKO wBNxxNQKyjg3dy5TOxeqVgunqC3klT/r0mK3BN8kGKGcotghJhRG08tmilp9VpEg6EXt bkjw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=MYCOGZ5F; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 24si10598297pfp.107.2019.07.23.19.35.35; Tue, 23 Jul 2019 19:35:50 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=MYCOGZ5F; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2392456AbfGWWzc (ORCPT + 99 others); Tue, 23 Jul 2019 18:55:32 -0400 Received: from mail.kernel.org ([198.145.29.99]:57848 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726713AbfGWWzb (ORCPT ); Tue, 23 Jul 2019 18:55:31 -0400 Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com [209.85.128.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 8D1D52238C for ; Tue, 23 Jul 2019 22:55:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1563922530; bh=TbrU37cIgVknUDN3eT4rxXv1UjD9mrPPOVWac90XS54=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=MYCOGZ5FTwvoxAe6aG/9SzxjB/bWawNqHdRtx2LOH7i3cVwTA2pvnMd01kHkSs4b9 oNMg+2PRbD8lCYr6IiXMFzf5YNRwCSRAcgt0GAPoFWxCn8qUmNlGQ0Sj0C08HjIaDh cOLse67OCywv/lWK1EQKjtCsJTKcExqWV5IwFQfI= Received: by mail-wm1-f49.google.com with SMTP id s3so40000139wms.2 for ; Tue, 23 Jul 2019 15:55:30 -0700 (PDT) X-Gm-Message-State: APjAAAXVzeE0us4TfQf3yWvaqQyurPQjbL7dfMMfelPnjIF0kSlqA5Vh WNyTfZCdTA/jRXBqqI0Ayqmj58qWkea0bcrbNphwVg== X-Received: by 2002:a1c:9a53:: with SMTP id c80mr11669186wme.173.1563922529083; Tue, 23 Jul 2019 15:55:29 -0700 (PDT) MIME-Version: 1.0 References: <20190719170343.GA13680@linux.intel.com> <19EF7AC8-609A-4E86-B45E-98DFE965DAAB@amacapital.net> <201907221012.41504DCD@keescook> <201907221135.2C2D262D8@keescook> <201907221620.F31B9A082@keescook> <201907231437.DB20BEBD3@keescook> In-Reply-To: <201907231437.DB20BEBD3@keescook> From: Andy Lutomirski Date: Tue, 23 Jul 2019 15:55:17 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386 To: Kees Cook Cc: Andy Lutomirski , Thomas Gleixner , Sean Christopherson , Vincenzo Frascino , X86 ML , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 23, 2019 at 2:55 PM Kees Cook wrote: > > On Mon, Jul 22, 2019 at 04:47:36PM -0700, Andy Lutomirski wrote: > > On Mon, Jul 22, 2019 at 4:28 PM Kees Cook wrote: > > > I've built a straw-man for this idea... but I have to say I don't > > > like it. This can lead to really unexpected behaviors if someone > > > were to have differing filters for the two syscalls. For example, > > > let's say someone was doing a paranoid audit of 2038-unsafe clock usage > > > and marked clock_gettime() with RET_KILL and marked clock_gettime64() > > > with RET_LOG. This aliasing would make clock_gettime64() trigger with > > > RET_KILL... > > > > This particular issue is solvable: > > > > > + /* Handle syscall aliases when result is not SECCOMP_RET_ALLOW. */ > > > + if (unlikely(action != SECCOMP_RET_ALLOW)) { > > > + int alias; > > > + > > > + alias = seccomp_syscall_alias(sd->arch, sd->nr); > > > + if (unlikely(alias != -1)) { > > > + /* Use sd_local for an aliased syscall. */ > > > + if (sd != &sd_local) { > > > + sd_local = *sd; > > > + sd = &sd_local; > > > + } > > > + sd_local.nr = alias; > > > + > > > + /* Run again, with the alias, accepting the results. */ > > > + filter_ret = seccomp_run_filters(sd, &match); > > > + data = filter_ret & SECCOMP_RET_DATA; > > > + action = filter_ret & SECCOMP_RET_ACTION_FULL; > > > > How about: > > > > new_data = ...; > > new_action = ...; > > if (new_action == SECCOMP_RET_ALLOW) { > > data = new_data; > > action = new_action; > > } > > Spelling it out for myself: this means that if both syscalls have > non-RET_ALLOW results, the original result is kept. But if the alias > is allowed, allow it. That solves my particular example, but I don't > think it's enough. (And it might be just as bad.) What if someone wants > to RET_TRACE clock_gettime64 and other syscalls are RET_ALLOWed. Now > clock_gettime64 cannot be traced since clock_gettime gets RET_ALLOW and > replaces the results. Fair enough. > > > It might also be nice to allow a filter to say "hey, I want to set > > this result and I do *not* want compatibility aliases applied", but > > I'm not quite sure how to express that. > > Especially since we're working on "old" filters. > > > I don't love this whole concept, but I also don't have a better idea. > > How about we revert the vDSO change? :P > > I keep coming back to using the vDSO return address as an indicator. > Most vDSO calls don't make syscalls, yes? So they're normally > unfilterable by seccomp. > > What was the prior vDSO behavior? The prior vDSO behavior was identical except that it was clock_gettime, not clock_gettime64. It's a little bit more subtle: the syscall only happens sometimes, mostly depending on hypervisor details, hardware, and hardware configuration. In the preferred TSC clock mode, we will not do the syscall for most clock ids. The emulated vsyscall behavior is that, if the gettimeofday() vsyscall is issued, we fake a seccomp event and we make a somewhat credible effort to honor the result. See arch/x86/entry/vsyscall_64.c. In general, I consider the combination of PR_SET_TSC to PR_TSC_DISABLE + seccomp blocking timing syscalls to mean that fine-grained timing is genuinely disabled. So I don't love bypassing seccomp in this type of context. Also, there are a few ways that programs could fake the return address as coming from the vDSO, the simplest of which is to simply jump to that syscall instruction. This is horrible but: what if we added a new parameter passed when a seccomp filter is registered that indicates that the kernel should remap syscalls with numbers above the given parameter. The default value for legacy users would be clock_gettime64 - 1 :) The idea being that an old trace-everything filter will sometimes get clock_gettime64 wrong, but a new one can opt in to more accurate behavior. Ditto for a hypothetical restart_syscall_renumbered(). I admit I still don't love this either.