Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp578765pxu; Thu, 15 Oct 2020 11:01:18 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy8CR+St2WC/UOnjhAi5l0bC87HXVqLqzuRZP991CdbVByxjyeZJT+wpEzotF6at/QjeXhI X-Received: by 2002:a17:906:28db:: with SMTP id p27mr5883227ejd.424.1602784878405; Thu, 15 Oct 2020 11:01:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1602784878; cv=none; d=google.com; s=arc-20160816; b=OTOlPGyJqxIA367oDT7afPLto6AIGxZyvQA+LCLnPrZxR50dytWFD0FpbZhPO67xh+ 1GoWBA0I/8Zhcwj2jeaTngi+JoUE2mX34j6Dn/WO+zEdPjzemLPmtp83svbrEkabLqAj PiOPUKWsWISmh/6UkquDIBPi1KXF2sEJzZXmjh+ytFC2zesjD47IKTXa/vI7KOf3GYeQ 26LxON5b8ITQJ1iT/ZRsnbWr2ZAIduR27+/sXGLquI/Ir8IwIA/XEsK3PLe0nXrK1UhY PVfM+GuKTbHUw6g1WGIve7J8qcdUk+dL0tXcMjGwel/akdUNDs7+nEFo4bRa1NqLOf3+ s/VQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=0PgN91Mc19ay+AH+kGL9Xa4U+mo9kI1WDaieHN2I3Iw=; b=gjO7+4BnpJsItHvaerDXBtf/acVdFDReoNjqTc1sazWJz7FscwkhLmXc3myNDJvaKC FyCinQ/g7ia3GWlr40j2RUbAY3Gbm5hSqnD5oEMENVcE0n2SJTVqJYE9ZVWsRzJqUVOI vFg5Q3d7k9Dl8xctXUaTJqDLXpfoslkJnpKFbD3RSZnU7gZt8OqVh+PowLRkNJjNH17w 0s470jW1krqIjDQ+h8lq+dEiJx1e+M9IuM9Pc1hpvTvbjC/0dyEYFa5oJUwRlgGayPJL RB9cma1blxePN2lwKvMadKitedOegMtT2gNjAd//cMCllbh2B6cQ6TczftHD7AllCP0e Swlg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id c17si2639490ejr.574.2020.10.15.11.00.44; Thu, 15 Oct 2020 11:01:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730991AbgJOPc1 (ORCPT + 99 others); Thu, 15 Oct 2020 11:32:27 -0400 Received: from mout.kundenserver.de ([212.227.17.10]:49835 "EHLO mout.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730951AbgJOPc1 (ORCPT ); Thu, 15 Oct 2020 11:32:27 -0400 Received: from [192.168.1.155] ([95.114.97.143]) by mrelayeu.kundenserver.de (mreue106 [212.227.15.183]) with ESMTPSA (Nemesis) id 1N94uf-1kQf451Ty3-016Bd8; Thu, 15 Oct 2020 17:31:50 +0200 Subject: Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces To: Christian Brauner , containers@lists.linux-foundation.org Cc: Alexander Mihalicyn , Giuseppe Scrivano , Joseph Christopher Sible , Kees Cook , linux-kernel@vger.kernel.org, Josh Triplett , Andy Lutomirski , "Eric W. Biederman" , =?UTF-8?Q?Micka=c3=abl_Sala=c3=bcn?= , Wat Lim , Mrunal Patel , Pavel Tikhomirov , Geoffrey Thomas References: <20200830143959.rhosiunyz5yqbr35@wittgenstein> From: "Enrico Weigelt, metux IT consult" Message-ID: Date: Thu, 15 Oct 2020 17:31:45 +0200 User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: <20200830143959.rhosiunyz5yqbr35@wittgenstein> Content-Type: text/plain; charset=utf-8 Content-Language: tl Content-Transfer-Encoding: 8bit X-Provags-ID: V03:K1:JA6LrJJtSGRZR2juLdm4r6JKn3coM8bLiyEwPxKL43OJpyW5TAK J+8ZvMhTAO1aRNlnib/d8GKkj/SyIGWSO01+uLmBY23inDY3ZZnH0q70b2/NO9Zj1hwde1p ExDmDdSBv2CzQ+KOWI4EhpneTYrbPugUhlYYpnxH8VTKaQhxy5Ie4nQ5Zy1z3uHBYXm2OjL dxshWEPytWFnQVucg2WNg== X-Spam-Flag: NO X-UI-Out-Filterresults: notjunk:1;V03:K0:OhtmHqpQM3U=:+wU7b6OYAGZWFy8aQu6SiR pO1dU14WlDReGRfJBtpL9jhfeAZ9PdStOEdyZo9v+KP4oHzrMoNGc5cP22d6DhxL+SvskPEIG j/f1tk4ZWuo3EyG86p91ezAspBL1FWdCNzMYNY14kkE3kg0retCLHr5ZhuGlp9cufxZsSPX3N V1TDmgixBceXZnM1MuSxC3ZRYPG3h+bY8e58eCRWxi2IvLohiH4s7CitL7dkp9C5ElOft2BEE 6EDrwI7YCx/io9z8KN9DgKopIXFk0jl306iCc0yY8bGp6iYhC2pAyDXJV0jw3Wb8DkgUIrY+R rQC/4Z4whMTuM0e73T0uRhViJxgxjTjEC9xqZY4BXvIhY7Z1gYDlM18S8SpK/FPxIch2/+zRC rIXhtRls8fRSAKvP5SOOS3REAxkTsgy1s4HUFEM+5X0Xq/aeXvJctsFb22CV7 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 30.08.20 16:39, Christian Brauner wrote: Hi Christian, > P1. Isolated id mappings can only be guaranteed to be locally isolated. > A container runtime/daemon can only guarantee non-overlapping id mappings > when no other users on the system create containers. Indeed. But couldn't we just record the mappings in some standardized place (eg. some file) which all engines maintain ? I'd guess other solutions would need changes in the runtimes, too. Please keep in mind that some scenarios actually need some overlaps, eg. application containers that shall have direct access to home dirs. > P2. Enforcing isolated id mappings in userspace is difficult. > It is always possible to create other processes with overlapping id > mappings. Coordinating id mappings in userspace will always remain > optional. Quite a few tools nowadays (including systemd) don't care about > /etc/sub{g,u}id and actively advise against using it. This is made even > more problematic since sub{g,u}iid delegation is done per-user rather than > per-container-runtime. I believe subusers aren't meant for tyical containers (like docker or lxc), but unprivileged user programs that wanna have further isolation for subprocesses (eg. a browser's renderer or js engine). Correct me if I'm wrong. > P3. The range of the id mapping of a container can't be predetermined. > While POSIX mandates that a standard system should use a range of 65536 ids > reality is very different. Some programs allocate high ids for random > processes or for network authentication. This means, in practice it is > often necessary to assign a range of up to 10 million ids to a container. > This limits a system to less than 500 containers total. In 25+ years, haven't seen such an application in the field. I'd consider this a horrible and dangerous bug. Sane applications create specific user entries (/etc/passwd) for that. I'd say we're safe w/ max 2^16 users per container, which should give us space for about 2^16 containers. > P4. Isolated id mappings severely restrict the number of containers that can be > run on a system. > This ties back to the point about pre-determining the id range of a > container and how large range allocations tend to be on real systems. That > becomes even more relevant when nesting containers. IMHO, all we need is to maintain a list of active ranges (more precisely the 16bit prefixes, just like class B networks ;-)). As said, I'd declare the scenario #P3 as invalid and rather fix those few broken applications. > P5. Container runtimes cannot reuse overlayfs lower directories if each > container uses isolated ID mappings, leading to either needless storage > overhead (LXD -- though the LXD folks don’t really mind), completely > ignoring the benefits of isolating containers from each other (Docker), or > not using them at all (Kubernetes). (This is a more general issue but bears > repeating since it is closely tied to most userns proposals.) Indeed. That's IMHO the main problem. We somehow need to map the UIDs. Maybe a synthetic filesystem that just does exactly the same uid<->kuid translations we're already doing in other places ? > P6. Rlimits pose a problem for containers that share the same id mapping. > This means containers with overlapping id mappings can DOS each other by > exhausting their rlimits. The reason for this lies with the current > implementation of rlimits -- rlimits are currently tied to users and are > not hierarchically limited like inotify limits are. This is a severe > problem in unprivileged workloads. Eric and others identified that this > issue can be fixed independently of the isolated user namespace proposal. Is this really an practical isssue, when we're using uid namespaces ? > S2. Kernel-enforced user namespace isolation. > This means, there is no need for different container runtimes to > collaborate on id ranges with immediate benefits for everyone. > This solves P1 and P2. Okay, but how to support scenarios where some of the UIDs should overlap on purpose ? (eg. mounting some of the host's user homedirs into namespaces ?) > S5. The owning id concept of a user namespace makes monitoring and interacting > with such containers way easier. What exactly is the owning id ? How is it created and managed ? Some magic id or an cryptographic token = > 1. How are interactions across isolated user namespaces handled? What kind of interaction do you have in mind ? Data transfers ? Process manipulaton ? Namespace destruction ? Can you please illustrate some actual use cases ? > Proposal 1.1 semmed prefered since it would allow an unprivileged > user creating an isolated user namespace to kill/ptrace all processes > in the isolated namespace they spawned. Don't we already have this if this user is mapped as root inside the container ? > The first consensus reached seemed to be to decouple isolated user > namespaces from shiftfs. The idea is to solely rely on tmpfs and fuse > at the beginning as filesystems which can be mounted inside isolated > user namespaces and so would have proper ownership. So, I'd essentially have to run the whole rootfs through fuse and a userland fileserver, which probably has to track things like ownerships in its own db (when running under unprivileged user) ? > For mount points > that originate from outside the namespace, everything will show as > the overflow ids and access would be restricted to the most > restricted permission bit for any path that can be accessed. So, I can't just take a btrfs snapshot as rootfs anymore ? --mtx -- --- Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren GPG/PGP-Schlüssel zu. --- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info@metux.net -- +49-151-27565287