Received: by 2002:a05:6a10:1287:0:0:0:0 with SMTP id d7csp4618846pxv; Tue, 20 Jul 2021 07:55:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwHYuIjDHXySjjuKvG7d05CHpSmCkDvbP0BP9KE0pUfJ+zcxATjF/7tyROP49xmUig5NG9R X-Received: by 2002:a05:6402:605:: with SMTP id n5mr41217029edv.89.1626792900158; Tue, 20 Jul 2021 07:55:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626792900; cv=none; d=google.com; s=arc-20160816; b=eQCprswPgiB80+N88pbruthGmv5771ItCF6HkuWLpab4/b4WTbuoyhdf7ZBXgp/Edp 1YuwktBzlCAN7FY0dYKeaD5qt1Q/PyZ1P/7D/1JGaCczKiHUtEWyXWIqxG0XqRRYEDyW ldNwyzgq6/9OcLoPm3mSCNOQbnT3g2gAbed6buRdDLwO9MWLbEEFs6zAzndU0GQon1c0 xKpAWiIx8204wn0xXWwjCAdwP5TbtxBwhVv4JG1YWs3e4vyjslAkZJObukJ7xVbjQS7N hk7L+y9Fp/dJiECLxcAsIXZH+AbYw/MfzQkgDwqb8iu7Qbeckf9mD5tZ3SbnhMUpNV+8 kBIA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=+rWWfzd6bQ9JjiqpPAkEm1e3rkA1YDcLMgeBdbZjGHE=; b=EVyPwRMTSZdCv1zqOYxU7sFaSrCJcGQAPLwGnVtxR9XnWZncx+xF/JfmMloUYPaRuC oEh69Y79EQAdvC6tYPdy+pspEKlwyU0t7ZeEKr1lLzylBer66EcpE/hJ1TPXjE7T4ofA GLyqTMs4JY9ck9QskwGAoCmgDRfzqdNzPnHEK1gPWMX4+slJU6YbXOh/0vBXHgIykpU+ XpC39YNTI1egpvNUTyUzKlEpMGp4FWZCRd9VXuwcGqao0bUmhRUh+M5yBKdNFrJdtngl ijC90/JOiJUz16dTOVZJLHFZpzrTyoUjfQRU7qx/78X96IXGMd5lqsxiuo/IKEEINqa3 pC4Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="HKe/MM/C"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o13si25539234ejy.231.2021.07.20.07.54.37; Tue, 20 Jul 2021 07:55:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="HKe/MM/C"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238815AbhGTOLn (ORCPT + 99 others); Tue, 20 Jul 2021 10:11:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44594 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239427AbhGTN5U (ORCPT ); Tue, 20 Jul 2021 09:57:20 -0400 Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F23F2C0613E9 for ; Tue, 20 Jul 2021 07:37:36 -0700 (PDT) Received: by mail-lf1-x130.google.com with SMTP id f30so29559533lfv.10 for ; Tue, 20 Jul 2021 07:37:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=+rWWfzd6bQ9JjiqpPAkEm1e3rkA1YDcLMgeBdbZjGHE=; b=HKe/MM/CDpZt+dke7UDYRel5+FRRKk9Yw6Ua0mr69h36UDVSDxKhSkQS1pzI4Y98W/ y9V69UHqD7m34WnU6PrkwK3yGSuLP5zBohcyW58x0hhg1sGMDyI9FqRhry3tPeBwpA3A wxKT6nknH9sF39nV/H5aFR2RJ4Pgoa3ct4OomK3jR1Nv0VSdwgVafX2iQd1FrKCgp9mT m5MZPkk27YwTaW7uI1/Y3Zmv6urRYetSrbVohznAjnU5JvElKLo3vqpDE+23PISKCKNm VILMxJLM9e+5ZjO2fV2cugUIjQQBULGIXVOu4Lw1FTaXea6seyt0GtoMI4IedHwa9+Xe bnKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=+rWWfzd6bQ9JjiqpPAkEm1e3rkA1YDcLMgeBdbZjGHE=; b=P3s2M2DJGUxEkE6IqrnRtvQvLwwBQA88i95taH9o4CQAn78hLedKQN3cGWTRuZFf/O Qs3hQOuqUetejzhF2rbqCgDsBNOzXpWOu7pzaE1AzAaYEr2Fe9wM999F5IwqEYVzUFBs /qbYngX7rToBVGJwRBmfsBJEi+CVDUcbqgpT2RyIH8LhwAqeYMhFnP4BmZrs1uHsmGDg A9OzcvzTiyLFcBSLiS+hyv3+dH0+T3iv0R0VV8S5YNxh9vlbZckwoL1FpCmkQNV/8sP4 Te/4H9EiT4w6wIIy871l70+KiC1IGybKV3OtzgACxd/l2UznbHh/knfyjDUcrKegoETy smqQ== X-Gm-Message-State: AOAM530lvFuEUChmtHBD5iCzDsn1Y4iQOOu+d8TIjNYp6nSSkCVvM6jt SCIY0HHgpUmaosiriEVGOvsGa69TFfuQvdeMwSPLAA== X-Received: by 2002:a19:ad4d:: with SMTP id s13mr21787072lfd.432.1626791854890; Tue, 20 Jul 2021 07:37:34 -0700 (PDT) MIME-Version: 1.0 References: <87k0lmryyp.fsf@disp2133> <56816a9d-c2e5-127d-4d90-5d7d17782c8a@virtuozzo.com> In-Reply-To: <56816a9d-c2e5-127d-4d90-5d7d17782c8a@virtuozzo.com> From: Shakeel Butt Date: Tue, 20 Jul 2021 07:37:23 -0700 Message-ID: Subject: Re: [PATCH v5 13/16] memcg: enable accounting for signals To: Vasily Averin Cc: "Eric W. Biederman" , Andrew Morton , Cgroups , Michal Hocko , Johannes Weiner , Vladimir Davydov , Roman Gushchin , Jens Axboe , Oleg Nesterov , LKML Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 20, 2021 at 1:35 AM Vasily Averin wrote: > > On 7/19/21 8:32 PM, Eric W. Biederman wrote: > > Vasily Averin writes: > > > >> When a user send a signal to any another processes it forces the kernel > >> to allocate memory for 'struct sigqueue' objects. The number of signals > >> is limited by RLIMIT_SIGPENDING resource limit, but even the default > >> settings allow each user to consume up to several megabytes of memory. > >> Moreover, an untrusted admin inside container can increase the limit or > >> create new fake users and force them to sent signals. > > > > Not any more. Currently the number of sigqueue objects is limited > > by the rlimit of the creator of the user namespace of the container. > > > >> It makes sense to account for these allocations to restrict the host's > >> memory consumption from inside the memcg-limited container. > > > > Does it? Why? The given justification appears to have bit-rotted > > since -rc1. > > Could you please explain what was changed in rc1? > From my POV accounting is required to help OOM-killer to select proper target. > > > I know a lot of these things only really need a limit just to catch a > > program that starts malfunctioning. If that is indeed the case > > reasonable per-resource limits are probably better than some great big > > group limit that can be exhausted with any single resource in the group. > > > > Is there a reason I am not aware of that where it makes sense to group > > all of the resources together and only count the number of bytes > > consumed? > > Any new limits: > a) should be set properly depending on huge number of incoming parameters. > b) should properly notify about hits > c) should be updated properly after b) > d) do a)-c) automatically if possible > > In past OpenVz had own accounting subsystem, user beancounters (UBC). > It accounted and limited 20+ resources per-container: numfiles, file locks, > signals, netfilter rules, socket buffers and so on. > I assume you want to do something similar, so let me share our experience. > > We had a lot of problems with UBC: > - it's quite hard to set up the limit. > Why it's good to consume N entities of some resource but it's bad to consume N+1 ones? > per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host? > To answer the questions host admin should have additional knowledge and skills. > > - Ok, we have set all limits. Some application hits it and fails. > It's quite hard to understand that application hits the limit, and failed due to this reason. > From users point of view, if some application does not work (stable enough) > inside container => containers are guilty. > > - It's quite hard to understand that failed application just want to increase limit X up to N entities. > > As result both host admins and container users was unhappy. > So after years of such fights we decided just to limit accounted memory instead. > > Anyway, OOM-killer must know who consumed memory to select proper target. > Just to support Vasily's point further, for systems running multiple workloads, it is much more preferred to be able to set one limit for each workload than to set many different limits. One concrete example is described in commit ac7b79fd190b ("inotify, memcg: account inotify instances to kmemcg"). The inotify instances which can be limited through fs sysctl inotify/max_user_instances and be further partitioned to users through per-user namespace specific sysctl but there is no sensible way to set a limit and partition it on a system that runs different workloads.