Received: by 2002:a05:7208:9594:b0:7e:5202:c8b4 with SMTP id gs20csp1385090rbb; Mon, 26 Feb 2024 07:42:24 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCXqYfqBGeOdJNWLjQcKlk/DRtOsX+EP4m7cA6E/r3dP1yAvK+ineWRJaGFxtYVdcjrd+FbsJDr3UJzpMGdhlT/kcWROAiiIFTOvbzk6GQ== X-Google-Smtp-Source: AGHT+IE5DHWzIzH3EZLfb1eLYMKt3t6DXkTWLu/iS/Om249XkSez66/rUzNWd2JoZUC1GQGLoA/Y X-Received: by 2002:a17:902:d355:b0:1db:c84c:3549 with SMTP id l21-20020a170902d35500b001dbc84c3549mr6994656plk.11.1708962144144; Mon, 26 Feb 2024 07:42:24 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708962144; cv=pass; d=google.com; s=arc-20160816; b=viFUQM0K1ft9dbg16axAlIUBuYC4qiVDNFXHDQ/w6jTcGxLc/ua4UZNv2F7mDOFXxJ OLsINOj6IfSS6e+d0lZ5X7pLoJf8HmmTcz+DlklkP6frslTAr9VxaGxTs6U1wzk2/wWJ 5MEWxRPxcm5ZmFXNBbuJ8UQYXLlY0rCPvb4nwPuiYRIjD3I5Hbyy9nxNCg9vfYqUsaaQ nn6YfjAdvXsZhP9VQbS+1d7PgLW8N+IcDPX+pxnX/C1hjV/ytAegQTIafEFpfdE7VNkV s8Gx3u3kol6PBui83oHZvKkoI/WZaPnEDy0gQjJwKiZRwjvxlcwCpSc25NT5Dxea2jnC ORKw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:subject:cc:to:from:date:dkim-signature; bh=WZkLk9E3XE+ojA635movCaJGU84eUyY+UsWknhV97Qg=; fh=jP6F9ye2+d5YugPCJomBHL40fwY7KTf6tiiowgIn23U=; b=bxEV0sGQcUdYXJTmKtG1ZVoBmPFR8aVHunX+hVD+WHO0obMfTLS+yykY7QJ4o6YHjs ikz9Vm0rmTJeKBLXCZmT2cUlqOYNhAyVKLxWv9HkVuVJScbhHhHwyKfZ/UKEzhw46/xi ES0ZFZpemYD1/34YwFe7lgKSWE0q6XSEKO11HW/UnS2LluoKu8gPms1OOaBCSBlFgG7a FNGWT8Gi9EMsWwEQglsvPlIhLJkNrjWvDpPIafMXnVPKO57qesOAOEexiyLT23QN26J9 B0sl6KLDMuyi8dZ3+o0wj7MYpe+8xpNxIhLAt7g8QlKx42lm03KaY/gBfZaEdiq2yLZe raSA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=QAFrZ5js; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-81816-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-81816-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id s13-20020a17090302cd00b001dc85510060si3715216plk.18.2024.02.26.07.42.23 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 26 Feb 2024 07:42:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-81816-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=QAFrZ5js; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-81816-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-81816-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 9FE652958A3 for ; Mon, 26 Feb 2024 15:35:45 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 67A6A12BE9C; Mon, 26 Feb 2024 15:35:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="QAFrZ5js" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B3DDD12B15B; Mon, 26 Feb 2024 15:35:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708961738; cv=none; b=PQyaydspEX2IyJwlMtkxE7UTkyZQLf9/iWb9S8OtJBmC02Y43kehOIltF9EaSpXEE0a5Q6ggPxVlf3XQKF15tsmeWYsLyOedZxsshF7rrCRs8Eay8fft3lvmFLaFtoZuBdHSfC8LGbUkHeo7jAr4HRjTasMc1suZzUrVxuVZZfE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708961738; c=relaxed/simple; bh=XI8VRq7ucg6AHJVkerSid+ikMVTAXLYHHE043YHYtTY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ixuwxTp60E19lnNjvdKMxQ4wU0fCaEU6ONIJHg5Y+xbYjRLdl5D1fzE42PjYHH3GR2PCjcxNk4aYJoB0gKK9kBd1EF4X6vJ2z/ECwD8uXyRu9rvNSVgbj2ivRq/I6oauhBymDCqITAFghOv8FDDc6arqZ+6j+1CcuCqWJ7KIYJ4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=QAFrZ5js; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3A6FAC433F1; Mon, 26 Feb 2024 15:35:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1708961738; bh=XI8VRq7ucg6AHJVkerSid+ikMVTAXLYHHE043YHYtTY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=QAFrZ5js6pTW5ejTIi62NrMQ/I+5wBQ0Q8LzEDJfDbhfKSjVZCto9kAObygW/58T+ s7l9DfyTlLqXxv+ITWetU3kO9hZ7AoijWG0ZLgENAOdTaTqbgHqDknrTLTJN1pr36H JsFwqDiv4p5Td0i5zBc6rv0+a/dHhCk3CrBu9qXSI+wcILeXVugZ0m9EIXuqm+D6kY Ybr0ShLv0drt0sz8H9xlkwNqMcHWv0O2ROxnEbxd4GTHG7tiAM4RoKt58nJqbynbCX 4FB90PDKPKn5U9Y3GRKSI40i++bvl0wOwu0hsfPnKd5OUruSG2U9vbqyoNODZ61N6Q It34GQkc4CHBQ== Date: Mon, 26 Feb 2024 16:35:30 +0100 From: Christian Brauner To: WANG Xuerui Cc: Arnd Bergmann , Xi Ruoyao , Icenowy Zheng , Huacai Chen , linux-api@vger.kernel.org, Kees Cook , Xuefeng Li , Jianmin Lv , Xiaotian Wu , WANG Rui , Miao Wang , "loongarch@lists.linux.dev" , Linux-Arch , Linux Kernel Mailing List Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? Message-ID: <20240226-altmodisch-gedeutet-91c5ba2f6071@brauner> References: <599df4a3-47a4-49be-9c81-8e21ea1f988a@xen0n.name> <24c47463f9b469bdc03e415d953d1ca926d83680.camel@xry111.site> <61c5b883762ba4f7fc5a89f539dcd6c8b13d8622.camel@icenowy.me> <3c396b7c-adec-4762-9584-5824f310bf7b@app.fastmail.com> <6f7a8e320f3c2bd5e9b704bb8d1f311714cd8644.camel@xry111.site> <20240226-graustufen-hinsehen-6c578a744806@brauner> <7641391a-b109-49b3-84c8-7e72053210d8@xen0n.name> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <7641391a-b109-49b3-84c8-7e72053210d8@xen0n.name> On Mon, Feb 26, 2024 at 10:00:05PM +0800, WANG Xuerui wrote: > On 2/26/24 21:32, Christian Brauner wrote: > > On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote: > > > On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote: > > > > On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote: > > > > > On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote: > > > > > > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: > > > > > > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: > > > > > > > > My idea is this problem needs syscalls to be designed with deep > > > > > > > > argument inspection in mind; syscalls before this should be > > > > > > > > considered > > > > > > > > as historical error and get fixed by resotring old syscalls. > > > > > > > > > > > > > > I'd not consider fstat an error as using statx for fstat has a > > > > > > > performance impact (severe for some workflows), and Linus has > > > > > > > concluded > > > > > > > > > > > > Sorry for clearance, I mean statx is an error in ABI design, not fstat. > > > > > > > > I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of > > > > "AT_NULL_PATH"/nullptr in the first place? > > > > > > Not sure, but it's hard to change now since the libc > > > implementation won't easily know whether using the NULL > > > path is safe on a given kernel. It could check the kernel > > > version number, but that adds another bit of complexity in > > > the fast path and doesn't work on old kernels with the > > > feature backported. > > > > > > > But it's not irrational to pass a path to syscall, as long as we still > > > > have the concept of file system (maybe in 2371 or some year we'll use a > > > > 128-bit UUID instead of path). > > > > > > > > > The problem I see with the 'use use fstat' approach is that this > > > > > does not work on 32-bit architectures, unless we define a new > > > > > fstatat64_time64() syscall, which is one of the things that statx() > > > > > > > > "fstat64_time64". Using statx for fstatat should be just fine. > > > > > > Right. It does feel wrong to have only an fstat() variant but not > > > fstatat() if we go there. > > > > > > > Or maybe we can just introduce a new AT_something to make statx > > > > completely ignore pathname but behave like AT_EMPTY_PATH + "". > > > > > > I think this is better than going back to fstat64_time64(), but > > > it's still not great because > > > > > > - all the reserved flags on statx() are by definition incompatible > > > with existing kernels that return -EINVAL for any flag they do > > > not recognize. > > > > > > - you still need to convince libc developers to actually use > > > the flag despite the backwards compatibility problem, either > > > with a fallback to the current behavior or a version check. > > > > > > Using the NULL path as a fallback would solve the problem with > > > seccomp, but it would not make the normal case any faster. > > > > > > > > was trying to avoid. > > > > > > > > Oops. I thought "newstat" should be using 64-bit time but it seems the > > > > "new" is not what I'd expected... The "new" actually means "newer than > > > > Linux 0.9"! :( > > > > > > > > Let's not use "new" in future syscall names... > > > > > > Right, we definitely can't ever succeed. On some architectures > > > we even had "oldstat" and "stat" before "newstat" and "stat64", > > > and on some architectures we mix them up. E.g. x86_64 has fstat() > > > and fstatat64() with the same structure but doesn't define > > > __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit > > > timestamps unlike all other 64-bit architectures. > > > > > > statx() was intended to solve these problems once and for all, > > > and it appears that we have failed again. > > > > New apis don't invalidate old apis necessarily. That's just not going to > > work in an age where you have containerized workloads. > > > > statx() is just the beginning of this. A container may have aritrary > > seccomp profiles that return ENOSYS or even EPERM for whatever reason > > for any new api that exists. So not implementing fstat() might already > > break container workloads. > > > > Another example: You can't just skip on implementing mount() and only > > implement the new mount api for example. Because tools that look for api > > simplicity and don't need complex setup will _always_ continue to use > > mount() and have a right to do so. > > > > And fwiw, mount() isn't fully inspectable by seccomp since forever. The > > list goes on and on. > > > > But let's look at the original mail. Why are they denying statx() and > > what's that claim about it not being able to be rewritten to something > > safe? Looking at: > > > > intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args, > > void* fs_denied_errno) { > > if (args.nr == __NR_fstatat_default) { > > if (*reinterpret_cast(args.args[1]) == '\0' && > > args.args[3] == static_cast(AT_EMPTY_PATH)) { > > return syscall(__NR_fstat_default, static_cast(args.args[0]), > > reinterpret_cast(args.args[2])); > > } > > return -reinterpret_cast(fs_denied_errno); > > } > > > > What this does it to rewrite fstatat() to fstat() if it was made with > > AT_EMPTY_PATH and the path argument was "". That is easily doable for > > statx() because it has the exact same AT_EMPTY_PATH semantics that > > fstatat() has. > > > > Plus, they can even filter on mask and rewrite that to something that > > they think is safe. For example, STATX_BASIC_STATS which is equivalent > > to what any fstat() call returns. So it's pretty difficult to understand > > what their actual gripe with statx() is. > > > > It can't be that statx() passes a struct because fstatat() and fstat() > > do that too. So what exactly is that problem there? > > From our investigation: > > For (new)fstatat calls that the sandboxed process may make, this SIGSYS > handler either: > > * turns allowed calls (those looking at fd's) into fstat's that only have > one argument (the fd) each, or > * denies the call, Yes, but look at the filtering that they do: if (args.nr == __NR_fstatat_default) { if (*reinterpret_cast(args.args[1]) == '\0' && args.args[3] == static_cast(AT_EMPTY_PATH)) { So if you have a statx() call instead of an fstatat() call this is trivially: if (args.nr == __NR_statx) { if (*reinterpret_cast(args.args[1]) == '\0' && args.args[2] == static_cast(AT_EMPTY_PATH)) { maybe if they care about it also simply check args.args[3] == STATX_BASIC_STATS. And then just as with fstatat() rewrite it to fstat(). > > so the sandbox only ever sees fstat calls and no (new)fstatat's, and the > guarantee that only open fds can ever been stat'ed trivially holds. > > With statx, however, there's no way of guaranteeing "only look at fd" > semantics without peeking into the path argument, because a non-empty path > makes AT_EMPTY_PATH ineffective, and the flags are not validated prior to > use making it near-impossible to introduce new semantics in a > backwards-compatible manner. I don't understand. That's exactly the same thing as for fstatat(). My point is that you can turn statx() into fstat() just like you can turn fstatat() into fstat(). So if you add fstat()/fstat64() what's left to do? > > What this tells me without knowing the exact reason is that they thought > > "Oh, if we just return ENOSYS then the workload or glibc will just > > always be able to fallback to fstat() or fstatat()". Which ultimately is > > the exact same thing that containers often assume. > > > > So really, just skipping on various system calls isn't going to work. > > You can't just implement new system calls and forget about the rest > > unless you know exactly what workloads your architecure will run on. > > > > Please implement fstat() or fstatat() and stop inventing hacks for > > statx() to make weird sandboxing rules work, please. > > We have already provided fstat(at) on LoongArch for a while by > unconditionally doing statx and translating the returned structure -- see > the [glibc] and [golang] [golang-2] implementations for example -- without But you're doing that translation in userspace. I was talking about adding the fstat()/fstat64() system calls.