Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp2047472pxu; Tue, 24 Nov 2020 15:48:14 -0800 (PST) X-Google-Smtp-Source: ABdhPJxlcD0mvjn9RRN4sy7NpT2g/YBsV8f1ARn5NKTMMBvzr6bMcsRLyw7Y1yXu455Ly/CmtfI3 X-Received: by 2002:a17:906:d8a8:: with SMTP id qc8mr791438ejb.149.1606261694018; Tue, 24 Nov 2020 15:48:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1606261694; cv=none; d=google.com; s=arc-20160816; b=zpBWZJZIiZBuY9+EkXbjtISEKTQq80Tesc+InW+DO8LRpv1ogOuy2nswbv2tfxRIES hAY0m98ExeA5Ws/EfQVmB8W/xC2o3qaDObus/dzHURtAnzbpr5DzwaNqTFHJZHezdCf0 ++DFWvW6JU1uJUReCcwoc9OObWCeb52ZvBKdNd2FGBALxIg2cJLE6dWpQaMC4vM6nCvg uWKBffm2J3QLJIFc29Px5vupEklz1Py6yNEpBO3YvYhHfksZhRvf0aRWwNIUH334+/xZ a588Nd3OszWSWrlTFoBRk9idQ9l499Gr4LriA5ZoAEtZFb6Kv6x7clWqlhEFhp+iqVhV kG2A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=+3qDf5jxipkCq6mbEJuum98nsinTkRuaEICXSWq7bY4=; b=lQFli+tClN7MSA3OEwJIXYCa/TfirE5gbk1LkcstZzIwpGyjEAKWXbVNcydxHMa662 CUjKnhZH7KFULVAWa02PzFcExrIusk0Oa+3ShfxruTM1jZ8tYE1bU2jAScLXOZNFw5e3 5ZBjqaQdxhx/5simsQVRYEnwuTq/tQfXSP2nS+ltlxLf9L9D9pKfEHNRQlnaS4nicqC4 vYcZvCnRsAoyjWdveavss9nP1YXJonS7RAnpXGDOZKgqIL2bhSmPZBnRYMH1LKZJZal6 gRVtOofClhME306VK09Xwx2XigSg/LdLkzY81ZS9fOOxNm+eju0BSwFGPiw7T1j69gVI yYqA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=ny+JkgHr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id mf24si249716ejb.450.2020.11.24.15.47.50; Tue, 24 Nov 2020 15:48:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=ny+JkgHr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2403963AbgKXRoS (ORCPT + 99 others); Tue, 24 Nov 2020 12:44:18 -0500 Received: from mail.kernel.org ([198.145.29.99]:42072 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2403801AbgKXRoS (ORCPT ); Tue, 24 Nov 2020 12:44:18 -0500 Received: from localhost (82-217-20-185.cable.dynamic.v4.ziggo.nl [82.217.20.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 00ED0206C0; Tue, 24 Nov 2020 17:44:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1606239857; bh=v+FR2vsqTxzK3OgbBpZnkhLlhZ4CTkbjlxr8puXVyvo=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=ny+JkgHru82bk7v6yGs1akrkgaJrKXD4qCmUTKrpr8OFyjxO6Dk526RqbzcEnqu+l Krv6v9e/kqN623EmVjPbCFNYfwYJYgO8a9mm5nNOaN5mW6OPWfydPyxwEdZ8ErPgv2 8435/qTnkugbUDnQ5DNjX6yjrxOt+SSY7HXk+qHU= Date: Tue, 24 Nov 2020 18:44:14 +0100 From: Greg KH To: Jann Horn Cc: Christoph Hellwig , Kees Cook , Andy Lutomirski , Will Drewry , Mark Wielaard , Florian Weimer , Christian Brauner , Linux API , "open list:DOCUMENTATION" , kernel list , dev@opencontainers.org, Jonathan Corbet , Carlos O'Donell Subject: Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround Message-ID: References: <87lfer2c0b.fsf@oldenburg2.str.redhat.com> <20201124122639.x4zqtxwlpnvw7ycx@wittgenstein> <878saq3ofx.fsf@oldenburg2.str.redhat.com> <20201124164546.GA14094@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 24, 2020 at 06:30:28PM +0100, Jann Horn wrote: > On Tue, Nov 24, 2020 at 6:15 PM Greg KH wrote: > > On Tue, Nov 24, 2020 at 06:06:38PM +0100, Jann Horn wrote: > > > +seccomp maintainers/reviewers > > > [thread context is at > > > https://lore.kernel.org/linux-api/87lfer2c0b.fsf@oldenburg2.str.redhat.com/ > > > ] > > > > > > On Tue, Nov 24, 2020 at 5:49 PM Christoph Hellwig wrote: > > > > On Tue, Nov 24, 2020 at 03:08:05PM +0100, Mark Wielaard wrote: > > > > > For valgrind the issue is statx which we try to use before falling back > > > > > to stat64, fstatat or stat (depending on architecture, not all define > > > > > all of these). The problem with these fallbacks is that under some > > > > > containers (libseccomp versions) they might return EPERM instead of > > > > > ENOSYS. This causes really obscure errors that are really hard to > > > > > diagnose. > > > > > > > > So find a way to detect these completely broken container run times > > > > and refuse to run under them at all. After all they've decided to > > > > deliberately break the syscall ABI. (and yes, we gave the the rope > > > > to do that with seccomp :(). > > > > > > FWIW, if the consensus is that seccomp filters that return -EPERM by > > > default are categorically wrong, I think it should be fairly easy to > > > add a check to the seccomp core that detects whether the installed > > > filter returns EPERM for some fixed unused syscall number and, if so, > > > prints a warning to dmesg or something along those lines... > > > > Why? seccomp is saying "this syscall is not permitted", so -EPERM seems > > like the correct error to provide here. It's not -ENOSYS as the syscall > > is present. > > > > As everyone knows, there are other ways to have -EPERM be returned from > > a syscall if you don't have the correct permissions to do something. > > Why is seccomp being singled out here? It's doing the correct thing. > > AFAIU from what the others have said, it's being singled out because > it means that for two semantically equivalent operations (e.g. > openat() vs open()), one can fail while the other works because the > filter doesn't know about one of the syscalls. Normally semantically > equivalent syscalls are supposed to be subject to the same checks, and > if one of them fails, trying the other one won't help. They aren't being subject to the same checks, if the seccomp permissions are different for both of them, they will get different answers. Trying to use this to determine if the syscall is present or not is not ok, and as Christian just said, needs to be fixed in userspace. We can't change the kernel ABI now, odds are someone else relies on the api we have had in place and it can not be changed :) thanks, greg k-h