Received: by 2002:a05:6602:18e:0:0:0:0 with SMTP id m14csp5778031ioo; Wed, 1 Jun 2022 12:23:08 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzxe69nBdp0Lt6TugKi+AeaEz5ZKjD0P22Bs9k/SRTsLsYa9vDt28V/M76L6/4V2uTrFnk1 X-Received: by 2002:a17:903:240c:b0:153:c8df:7207 with SMTP id e12-20020a170903240c00b00153c8df7207mr1022270plo.44.1654111387967; Wed, 01 Jun 2022 12:23:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654111387; cv=none; d=google.com; s=arc-20160816; b=ZwFrXnoSkldR9AErUbEtoaQ+H1oRh73+xmzlQumD4+IxvlgO1lp2JoxKWMZ3jisL7M HANqLj00YzAzyhazczHTCAd4VfT+Blj3znVzyentyzsoLKcAHB+bATKnLoODlOVcba+z h9e011Ri8zj2F/H+9mk26wGXo/BR5chag03qGECbXlMSBnqZE8HI73diZvakvRCoWkdf C00VCu/o04lzcffRvrjkgwpsDs7StlcH4IpMWpMKYp9x6n9Ays6FEoh0JCLw3UYijSfK BKtAjmNLeatzGAs5v7LUh9p2+tWOkKfPLJPJUxZ9kSilFcApYRmK2AS0JesYdDDqRS/S HuEQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=8JRzXAFyBJ83o3cQYXfJAjgf/18XPkXFhsvvS1XfXo8=; b=jVb8NlrNMzW2dP/aPcHl2wx/iG3yAd6/faTkX9D2gjpYQaGx0/ebVEv2f8ZP5iuasF PMW0hUq2ivW/qpE7uv/OtQbhwCt3gJhrOb/jDO0raZdguZT+mzg74r6V4z5rWHLB0PD8 hkpEKNQq7Tq14IKWJQyurCFXhDo228ZansTX0nRhq6EAMk112SJTCkKiW1IDiw6JThyS 3ZTli3lLno+RPVzxcb0DcI7k/1lETjeIQrmfU4QwesWCnR68iLDf0btgiqFQfKlBpxLh 1QqHBILN8awjODOlmOflzeEjVoISh1wq3Pa7hfwoqNo/MFUH4wkucx3FgKm6yTJGNeRy k6iA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=LoroXGjQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id g12-20020a17090a640c00b001e02d8b314dsi6994079pjj.177.2022.06.01.12.23.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 01 Jun 2022 12:23:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=LoroXGjQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 73765B495; Wed, 1 Jun 2022 11:56:58 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244473AbiEaHQ7 (ORCPT + 99 others); Tue, 31 May 2022 03:16:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244461AbiEaHQw (ORCPT ); Tue, 31 May 2022 03:16:52 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 75AF84477C; Tue, 31 May 2022 00:16:51 -0700 (PDT) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id 309B51F8E6; Tue, 31 May 2022 07:16:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1653981410; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=8JRzXAFyBJ83o3cQYXfJAjgf/18XPkXFhsvvS1XfXo8=; b=LoroXGjQaGygCf+ii5RaT5Q6MKzPcDtkhbE5h0W3gOfcb/7CQvzVuGFV/fknsIsXESUz2X Eq5mSiyZmpaIS0fl79wzRLZ7o2tmFDJCW3MTNZONWSCfx9BQdH2fGTHP02Kcs+dcmpJE/S oCqKSXqNAAlLDjO78RW+1p4mEJh29eU= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 080FB2C141; Tue, 31 May 2022 07:16:48 +0000 (UTC) Date: Tue, 31 May 2022 09:16:47 +0200 From: Michal Hocko To: Vasily Averin Cc: Andrew Morton , kernel@openvz.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Shakeel Butt , Roman Gushchin , Michal =?iso-8859-1?Q?Koutn=FD?= , Vlastimil Babka , Muchun Song , cgroups@vger.kernel.org Subject: Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup Message-ID: References: <06505918-3b8a-0ad5-5951-89ecb510138e@openvz.org> <3e1d6eab-57c7-ba3d-67e1-c45aa0dfa2ab@openvz.org> <3a1d8554-755f-7976-1e00-a0e7fb62c86e@openvz.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3a1d8554-755f-7976-1e00-a0e7fb62c86e@openvz.org> X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 30-05-22 22:58:30, Vasily Averin wrote: > On 5/30/22 17:22, Michal Hocko wrote: > > On Mon 30-05-22 16:09:00, Vasily Averin wrote: > >> On 5/30/22 14:55, Michal Hocko wrote: > >>> On Mon 30-05-22 14:25:45, Vasily Averin wrote: > >>>> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on > >>>> 4cpu VM with Fedora and self-complied upstream kernel. The calculations > >>>> are not precise, it depends on kernel config options, number of cpus, > >>>> enabled controllers, ignores possible page allocations etc. > >>>> However this is enough to clarify the general situation. > >>>> All allocations are splited into: > >>>> - common part, always called for each cgroup type > >>>> - per-cgroup allocations > >>>> > >>>> In each group we consider 2 corner cases: > >>>> - usual allocations, important for 1-2 CPU nodes/Vms > >>>> - percpu allocations, important for 'big irons' > >>>> > >>>> common part: ~11Kb + 318 bytes percpu > >>>> memcg: ~17Kb + 4692 bytes percpu > >>>> cpu: ~2.5Kb + 1036 bytes percpu > >>>> cpuset: ~3Kb + 12 bytes percpu > >>>> blkcg: ~3Kb + 12 bytes percpu > >>>> pid: ~1.5Kb + 12 bytes percpu > >>>> perf: ~320b + 60 bytes percpu > >>>> ------------------------------------------- > >>>> total: ~38Kb + 6142 bytes percpu > >>>> currently accounted: 4668 bytes percpu > >>>> > >>>> - it's important to account usual allocations called > >>>> in common part, because almost all of cgroup-specific allocations > >>>> are small. One exception here is memory cgroup, it allocates a few > >>>> huge objects that should be accounted. > >>>> - Percpu allocation called in common part, in memcg and cpu cgroups > >>>> should be accounted, rest ones are small an can be ignored. > >>>> - KERNFS objects are allocated both in common part and in most of > >>>> cgroups > >>>> > >>>> Details can be found here: > >>>> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ > >>>> > >>>> I checked other cgroups types was found that they all can be ignored. > >>>> Additionally I found allocation of struct rt_rq called in cpu cgroup > >>>> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) > >>>> percpu structure and should be accounted too. > >>> > >>> One thing that the changelog is missing is an explanation why do we need > >>> to account those objects. Users are usually not empowered to create > >>> cgroups arbitrarily. Or at least they shouldn't because we can expect > >>> more problems to happen. > >>> > >>> Could you clarify this please? > >> > >> The problem is actual for OS-level containers: LXC or OpenVz. > >> They are widely used for hosting and allow to run containers > >> by untrusted end-users. Root inside such containers is able > >> to create groups inside own container and consume host memory > >> without its proper accounting. > > > > Is the unaccounted memory really the biggest problem here? > > IIRC having really huge cgroup trees can hurt quite some controllers. > > E.g. how does the cpu controller deal with too many or too deep > > hierarchies? > > Could you please describe it in more details? > Maybe it was passed me by, maybe I messed or forgot something, > however I cannot remember any other practical cgroup-related issues. > > Maybe deep hierarchies does not work well. > however, I have not heard that the internal configuration of cgroup > can affect the upper level too. My first thought was any controller with a fixed math constrains like cpu controller. But I have to admit that I haven't really checked whether imprecision can accumulate and propagate outside of the hierarchy. Another concern I would have is a id space depletion. At least memory controller depends on idr ids which have a space that is rather limited #define MEM_CGROUP_ID_MAX USHRT_MAX Also the runtime overhead would increase with a large number of cgroups. Take a global memory reclaim as an example. All the cgroups have to be iterated. This will have an impact outside of the said hierarchy. One could argue that limiting untrusted top level cgroups would be a certain mitigation but I can imagine this could get very non trivial easily. Anyway, let me just be explicit. I am not against these patches. In fact I cannot really judge their overhead. But right now I am not really sure they are going to help much against untrusted users. -- Michal Hocko SUSE Labs