Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp2820925imu; Sun, 23 Dec 2018 08:31:50 -0800 (PST) X-Google-Smtp-Source: ALg8bN7v8Fe/Bhu41hdx1h++NQvP27oQHM5q9qnu9Sl2M12D8pttFzdASXJdbVfcSuyU8grdHG69 X-Received: by 2002:a63:2d82:: with SMTP id t124mr9521513pgt.260.1545582710414; Sun, 23 Dec 2018 08:31:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545582710; cv=none; d=google.com; s=arc-20160816; b=N/JSKfvqAJBpVgrv+aHbcRZLwFyuRB3bgPnuq4sF2h0GcXFo/XZfgC47X+zw1UFOmO nCoYmGGbD1owAcHpmVcZY+mngHMK9cmYyIzTmrcU7D3z+0bUpd6oLD0PyjjoFOjsaVE2 N7N0GCmf+ebDR02DskkbZIinKmAfLCfzmuEh2lKI1MajyL5aOfEbD+Rkt25yMgFyqkeu GD4uLRR33f8WYJY4gDoHkOmNmbH6AtnPjlXmPjfArNLjdAg+zu/1p+14pSN4LYsUnwfn uhtrs6PkccZpPMz7cph6HqC2T0vE5pHwNYvPFVuYaVBcf90wjOu9Oa5oKhh1zx9nSm/F A0uw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=H915YKffmipu2PzY2CpxH/kwDrQvisKAVYA+53xOKaQ=; b=CGEPL5ZYu7mm69NU/JJo9Tja+UI9lLzZJlLTLty2l3G7Kn/T8LwO9i7dE1WPqFgTgV l14oaR1L4CfamX0fKQN9kj7dyq4Gnw028zGeAdY3nzSO7aQjVKSSgh3GiIqftFIgDebr UL6x91e3HO/L6ugqdiXuei1rc+oSPDmarxFCa1OtvnGhlNU6w9ULO7hkznU69tXTpLa9 gnR5i9GiC4TS6zkP+6W7xbZmDP2Wv0SesdP/5Y/kxfjaVq/zaz99IkEzPe3AloyN/StZ oCnyJ6X56d64BysV+92qcVVL8+t5i+gALVwemZCkAqbKwdbhsjUjTkRwJMhUiNCypBS/ MkGQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=aNbh6UCx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d12si25229147pga.506.2018.12.23.08.31.34; Sun, 23 Dec 2018 08:31:50 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=aNbh6UCx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389751AbeLVU6M (ORCPT + 99 others); Sat, 22 Dec 2018 15:58:12 -0500 Received: from mail-lf1-f68.google.com ([209.85.167.68]:45657 "EHLO mail-lf1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388396AbeLVU6L (ORCPT ); Sat, 22 Dec 2018 15:58:11 -0500 Received: by mail-lf1-f68.google.com with SMTP id b20so6165041lfa.12 for ; Sat, 22 Dec 2018 12:58:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=H915YKffmipu2PzY2CpxH/kwDrQvisKAVYA+53xOKaQ=; b=aNbh6UCxt6VTlKMKBclKpXydbHDRmOCh0sHUk+mMgl5hycjlxQHxqMgfDM+gjUoedv cliM+lGIl3zzEBqY8DO//79WrlE52e29p30V6pwFM1X4JDA196Sasa0oFscJBvxI9A0k +fQSUtHLegrkQXH2b8+ZL2R96o5thMspo81/LGPehTg332cu9wHn1wNxUzYT3dIVRxcU FWms51AyFP+DZvS1RMckk7I6p7PiEyo/It3ERkZq0OZLeuhFIED5Za4amsD/mWXT5pbt +nbPh47nes0KUyyMFQcb1M2Q0m6OXX6tby+rM4f9enrc32D8KNKKaZMkRgoDG17hYQlB kNFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=H915YKffmipu2PzY2CpxH/kwDrQvisKAVYA+53xOKaQ=; b=Hd8aG0nVZWs49sorEFLFjQrm2C3WHeUERN4aDsprWuvIIajwgHDj/np1SSLf4cyHHc 85u4dX2UiJC3TakDNtcG7y/xsDA94Ypv0JRyWbl1x6o4Ls/ob6K3P7hT374gmBM2Htct QYmb9NnGRZWBkYVMn81Ku3upo29dKNN4zXtfclvLTjulA8a5NLJ72CbtrcdN6rEP18ph m0SN2Jzk7YrIvC2ifCmMVs2ZQrGsmJacK63lg0k50PM5fJ0ZRYht7yePg2iaERj3BAro u1N9+mJUFmHHckR6UU6R7xave1eEOfeP6xb+43FKAWD5jzg2JIs+1Ff5qWFQxZkpDcLW Yn/A== X-Gm-Message-State: AA+aEWbttNTRotqG2k01MEltctB43aB9THwhGJKi4W2Iz7VXWeTFZ6AL y3c//+gUM4fyzE8xwzOjadsvW9l29ipktf+f8A== X-Received: by 2002:a19:c954:: with SMTP id z81mr3833422lff.150.1545512289138; Sat, 22 Dec 2018 12:58:09 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Gabriel C Date: Sat, 22 Dec 2018 21:57:42 +0100 Message-ID: Subject: Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn To: Ellie Reeves Cc: LKML , Linus Torvalds , Al Viro , "Eric W. Biederman" , Seth Forshee Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Added some people to CC that might want to see this.. Am Sa., 22. Dez. 2018 um 19:14 Uhr schrieb Ellie Reeves : > > Hi, > first off, allow me to express that this is my first time ever writing > on such a mailing list, and that if something is unclear or you would > need more information, just let me know. > I write to this list in hoping to see this change reverted. The linux > kernel always said it would avoid breaking user namespace as much as > possible, and yet this is what happens. I was hence very much surprised > when my perfectly working containers on systemd-nspawn which makes use > of userns by default, stopped working from one day to the next, till I > identified the problem as being kernel >=3D 4.18. This container is in > production, hence the annoyance it was. From one day to the next the > container started failing with stranges problems: > > * nginx, dovecot, postgresql, and postfix complained about getting > permission denied on /dev/null even though it appeared perfectly normal > to me, the correct permissions, all that > * /var was also acting very strangely, getting a lot of permission > denied or operation not supported messages. > * I could not delete a file that my user had the right to create, write > to and read in /var, I needed root > > Here is the pull request that was made to systemd, along with a small > amount of talk around the issue: > > https://github.com/systemd/systemd/pull/9483 > > It was ultimately decided among the systemd folks to bail out of the > issue, as shown in the news entry for systemd 240: > > * KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour > regarding > mknod() handling in user namespaces. Previously mknod() would > always > fail with EPERM in user namespaces. Since 4.18 mknod() will > succeed > but device nodes generated that way cannot be opened, and > attempts to > open them result in EPERM. This breaks the "graceful > fallback" logic > in systemd's PrivateDevices=3D sand-boxing option. This option= is > implemented defensively, so that when systemd detects it runs > in a > restricted environment (such as a user namespace, or an > environment > where mknod() is blocked through seccomp or absence of > CAP_SYS_MKNOD) > where device nodes cannot be created the effect of > PrivateDevices=3D is > bypassed (following the logic that 2nd-level sand-boxing is no= t > essential if the system systemd runs in is itself already > sand-boxed > as a whole). This logic breaks with 4.18 in container > managers where > user namespacing is used: suddenly PrivateDevices=3D succeeds > setting > up a private /dev/ file system containing devices nodes =E2=80= =94 but > when > these are opened they don't work. > > At this point is is recommended that container managers utiliz= ing > user namespaces that intend to run systemd in the payload > explicitly > block mknod() with seccomp or similar, so that the graceful > fallback > logic works again. > > We are very sorry for the breakage and the requirement to chan= ge > container configurations for newer kernels. It's purely > caused by an > incompatible kernel change. The relevant kernel developers > have been > notified about this userspace breakage quickly, but they chose= to > ignore it. > > Here's an email that was sent to lkml about the subject: > > https://lkml.org/lkml/2018/7/5/742 > > I link also this, quoting the last of it: > > https://lkml.org/lkml/2018/7/5/701 > > It has never been the case that mknod on a device node will guarantee > that you even can open the device node. The applications that regress > are broken. It doesn't mean we shouldn't be bug compatible, but we darn > well should document very clearly the bugs we are being bug compatible wi= th. > > I'm in the opinion that it is a kernel bug, and I quote someone from the > systemd irc channel: > > ewb said applications were broken. But the rule is, if userspace breaks, > its a bug. The kernel *has* to revert it. And honestly, this change > doesn't make much sense. You can set nodev yourself but then you know > mknod will not allow you to open the object. Here, the kernel does it > without your knowledge > > Also, it seems that if this change is reverted, things that were fixed > to work around the issue this breakage caused will not be broken again, > they should simply go back to their previous way of working. I > understand there may be security reason why this change was made in the > first place, but it is not so big a problem is it ? I can mknode > arbitrary devices in userns and open them as userns root. But my point > is, several things broke. My *working* stuff was broken from one day to > the next. > > I am not trying to pick a fight. I want to understand the reasoning > behind this change in the first place, and I'm simply making an attempt > at getting it reverted, because it is true that I don't much fancy > blocking the mknode() syscall in every template unit on every machine we > administer here, and that staying on kernel < 4.18 is not a good > sollution either. > > I would also like to be personally CC'ed the comments or answers posted > to this mailing list in response to this message. > > Thanks