Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BA0DC433F5 for ; Tue, 4 Jan 2022 22:18:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235339AbiADWSH (ORCPT ); Tue, 4 Jan 2022 17:18:07 -0500 Received: from out01.mta.xmission.com ([166.70.13.231]:50056 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232763AbiADWSG (ORCPT ); Tue, 4 Jan 2022 17:18:06 -0500 Received: from in01.mta.xmission.com ([166.70.13.51]:40466) by out01.mta.xmission.com with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1n4s83-00At50-64; Tue, 04 Jan 2022 15:18:03 -0700 Received: from ip68-110-24-146.om.om.cox.net ([68.110.24.146]:40010 helo=email.froward.int.ebiederm.org.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1n4s7x-00755i-Pa; Tue, 04 Jan 2022 15:18:00 -0700 From: ebiederm@xmission.com (Eric W. Biederman) To: Manfred Spraul Cc: Alexey Gladkov , LKML , Linux Containers , Alexander Mikhalitsyn , Andrew Morton , Christian Brauner , Daniel Walsh , Davidlohr Bueso , Kirill Tkhai , Serge Hallyn , Varad Gautam , Vasily Averin , kernel test robot References: <0f0408bb7fbf3187966a9bf19a98642a5d9669fe.1641225760.git.legion@kernel.org> <792dcae82bc228cd0bec1fa80ed4d2c9340b0f8f.1641296947.git.legion@kernel.org> <87v8yzfilp.fsf@email.froward.int.ebiederm.org> <40ca86a1-ea36-0185-1ba5-c69005e46d3f@colorfullife.com> Date: Tue, 04 Jan 2022 16:17:49 -0600 In-Reply-To: <40ca86a1-ea36-0185-1ba5-c69005e46d3f@colorfullife.com> (Manfred Spraul's message of "Tue, 4 Jan 2022 21:48:54 +0100") Message-ID: <87zgob87si.fsf@email.froward.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1n4s7x-00755i-Pa;;;mid=<87zgob87si.fsf@email.froward.int.ebiederm.org>;;;hst=in01.mta.xmission.com;;;ip=68.110.24.146;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX18vqjrRwY29fdIl10fLcjcB81RO3hYYVt8= X-SA-Exim-Connect-IP: 68.110.24.146 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: [PATCH v2] ipc: Store mqueue sysctls in the ipc namespace X-SA-Exim-Version: 4.2.1 (built Sat, 08 Feb 2020 21:53:50 +0000) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Manfred Spraul writes: > Hello Eric, > > On 1/4/22 19:42, Eric W. Biederman wrote: >> Manfred Spraul writes: >> Hi Alexey, >>> On 1/4/22 12:51, Alexey Gladkov wrote: >>>> Right now, the mqueue sysctls take ipc namespaces into account in a >>>> rather hacky way. This works in most cases, but does not respect the >>>> user namespace. >>>> >>>> Within the user namespace, the user cannot change the /proc/sys/fs/mqueue/* >>>> parametres. This poses a problem in the rootless containers. >>>> >>>> To solve this I changed the implementation of the mqueue sysctls just >>>> like some other sysctls. >>>> >>>> Before this change: >>>> >>>> $ echo 5 | unshare -r -U -i tee /proc/sys/fs/mqueue/msg_max >>>> tee: /proc/sys/fs/mqueue/msg_max: Permission denied >>>> 5 >>> Could you crosscheck that all (relevant) allocations in ipc/mqueue.c >>> use GFP_KERNEL_ACCOUNT? >> They are not. >> >>> We should not allow normal users to use up all memory. >>> >>> Otherwise: >>> The idea is good, the limits do not really prevent using up all >>> memory, _ACCOUNT is the better approach. >>> And with _ACCOUNT, it doesn't hurt that the namespace root is able to >>> set limits. >> Saying the cgroup kernel memory limit is the only thing that works, and >> that is always better is silly. >> >> >> First the cgroup kernel memory limits noted with ACCOUNT are not >> acceptable on several kernel hot paths because they are so expensive. > > I was not aware that ACCOUNT allocations are very expensive. I think it is a bit like refcount_t vs atomic_t. Usually it doesn't matter but there are cases where it does, and so you need to be aware before adding ACCOUNT. > OTHO adding ACCOUNT resolved various out of memory crashes for IIRC ipc/sem.c > and/or ipc/msg.c. But we also do not have an RLIMIT for ipc/sem.c or ipc/msg.c > > Let me rephrase my question: > > When we allow non-root users to write to /proc/sys/fs/mqueue/msg_max, are there > any _relevant_ allocations that bypass _all_ limits? With respect to writing msg_max. All of the allocations where the sysctl value is concerned go through mqueue_get_inode. The interesting call is mq_open. Where attributes can be specified today that override the sysctl. And nothing can override HARD_MSGMAX and HARD_MSGSIZEMAX. So I don't think we introduce anything new if we allow userspace to write to the limits. > As you write, we have RLIMIT_MSGQUEUE. > > And several allocations for ipc/mqueue already use ACCOUNT: > > - the messages themselves, via load_msg()/alloc_msg(). > > - the inodes, via mqueue_inode_cachep(). Yes. I don't think ACCOUNT is evil or bad, just different. >> Further the memory cgroup kernel memory limit is not always delegated to >> non-root users, which precludes using the memory cgroup kernel memory >> limit in many situations. >> >> >> The RLIMIT_MQUEUE limit definitely works, and as I read the kernel >> source correct it defaults to MQ_BYTES_MAX aka 819200. A limit of >> 800KiB should prevent using up all of system memory, except on very low >> memory machines. > > I'd agree that 800 kB is not relevant. But we need to be certain that there are > no loopholes. > > I do not see anything relevant, e.g. 0-byte messages should be covered by > mq_maxmsg. But perhaps I overlook something. > >> So please let's not confuse apples and oranges, and let's use the tools >> in the kernel where they work, and not set them up in contest with each >> other. >> >> Rlimits with generous but real limits in general are good at catching >> when a program misbehaves. The cgroups are better at setting a total >> memory cap. In this case the rlimit cap is low enough it simply should >> not matter. >> >> What has been fixed with the ucount rlimits is that (baring >> implementation bugs) it is now not possible to create a user namespace >> and escape your rlimits by using multiple users. > I'll try to check the patch in detail in the next few days. Thank you. It is always useful to look over things closely and ensure that unprivileged users can't do something unfortunate, before we find them doing something unfortunate. So far I have just skimmed the patch but from 10,000 feet it looks good. Eric