Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp1836578ybh; Tue, 14 Jul 2020 08:33:17 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwBXTur+PD6TdABa/Za8DXQVtgtcGdVQs1vgwjk6Gf+/Jyq9WTwxE4/tUcqy3rB7alV3tYG X-Received: by 2002:a05:6402:b9b:: with SMTP id cf27mr5268011edb.84.1594740797073; Tue, 14 Jul 2020 08:33:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594740797; cv=none; d=google.com; s=arc-20160816; b=ao1UNcvRbcdFUc8S3xU2Mz8P2yVw4rCCp62okwFm6YozlQvZXzIoQF8fSiCOToWbB9 jyV3dLd8SGlSafCosmtAI12oK9Jkg0RcHdQZTM0UXn7dUl2LdOI2CcY0xEeiTyoj6YYU pElDotMth2G2z210ssmV83SGXQrIdfzXBE84aw9hoop6KYb2px9YQv5j2Wk3vyzrjjaq NLR17E1l45AqkfvnFasawgGhImaQN4W2Ze+cP/DVPKqOeLXYY4TvaC8mENX57cH2e/X7 yoa7zmD2AmuPKXM8pmCLkECpcJEIaNDJ1oPpt6LPI5crrP+zvJspWsXAx4FGUCauINBo A1dQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=oqbV9Nb2FFz6+EoRL68dxT5M8+4XjqDj51J15tcXDqg=; b=pXjaMjKsaLP3at+rdhW+7tFYYDYlOBFn53mebtlMz5tUo6olHjtoCSuYki+RLijci1 HNphAzqb65v+6xzHyZNkLAxAgWiO8WIuZw4t4ASkktXAS45BW/KnfbVY515hdpGOokZF uYkDNlTRpeDWj5Qi2Ud+lpjnlMBawSpzR5lQSvWoe3ikhvZXRPNTXxK+vbwb7O4PYJ5x lQtv6FLYhQr4p0cFG5sRzD8bLH1U+AUjyk2mju1RWy7U1+Af2Lt2Nio/aQUiGhnES1fs FplIN/5EHXdILnq+3U21IjkAzMgnOYpvWTGrD10kqk+x4j14oxJUA7zuOfnriXEHKyhK //7Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=TSj2JWpT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f3si11507681edy.2.2020.07.14.08.32.52; Tue, 14 Jul 2020 08:33:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=TSj2JWpT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726600AbgGNPcY (ORCPT + 99 others); Tue, 14 Jul 2020 11:32:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35520 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725280AbgGNPcX (ORCPT ); Tue, 14 Jul 2020 11:32:23 -0400 Received: from mail-lf1-x142.google.com (mail-lf1-x142.google.com [IPv6:2a00:1450:4864:20::142]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2D257C061755 for ; Tue, 14 Jul 2020 08:32:23 -0700 (PDT) Received: by mail-lf1-x142.google.com with SMTP id g2so11987911lfb.0 for ; Tue, 14 Jul 2020 08:32:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=oqbV9Nb2FFz6+EoRL68dxT5M8+4XjqDj51J15tcXDqg=; b=TSj2JWpT7atvmen6t6ki58YuhCN+eCwS79p1FUpE9N5QRc+YVPKIylEYF0XkIn4ojQ n9suCH/mse8LpARQjWFyAmnQQG0mvE33JDrl17LlBdjqrYyEL9XR6TR6HiAFx+zpPWIr b5XktYTaXhyiCxoBVFk5GQL7BaHSXNlwa1350f0SSaFmwxjcwNbrhA45tedwUJmdGTOM R2aqL+w9fA84Yr2PcgKKOSYcl0XcukNRMlHnUtQ0XFG4iff9z3cKFx2hUNLQa24maz5n zDrxLrKus91Kvqk2I+ooprMV7o9sVg5RMk4V30ubIx6uafPYDXZBqYldm7a2fS0PVExk ABPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=oqbV9Nb2FFz6+EoRL68dxT5M8+4XjqDj51J15tcXDqg=; b=qAwHHN2WF3TXIyAIpaLZQ2IonCLJyVl2qIjkDRVT2eUMBmSvMeVRjC326L4gmjGP8n gBmRnaHZujRNZM/RPtN0JmHPnKmO4BHfcfgrgyN0jGtWQ++rPFXF9ERu8ESIYHcDx3dc dwP6Fvxm4/K/caHrVhQAmlCQm0Xifat1Ux5bXey1fd2/mutDWxBKd2kYcxv0pbBmRHGp fzYXUdYeEMZCuNjyImScYiCDA2emdAdm4YdVmz7DrIg0vAMCc+2dLLioSRN4kYP1XQkO 8hjTxIZgDyDM7SqJmvXDc6urZyq6f1J353ShSnxavxDQpBs+r46KRBFiQkONuBaKmfbj /Q2g== X-Gm-Message-State: AOAM533cNrbsL9kNRfFqE0nz+lqsgafnvjYPH9GjSJKhLyX2/y9jJZr7 E2KzD2Bb15zSShX4zmqYOentmrEtVjKpgxT2AlflWL6Y X-Received: by 2002:a05:6512:3f1:: with SMTP id n17mr2470898lfq.125.1594740741307; Tue, 14 Jul 2020 08:32:21 -0700 (PDT) MIME-Version: 1.0 References: <20200709194718.189231-1-guro@fb.com> <20200710122917.GB3022@dhcp22.suse.cz> <20200710184205.GB350256@carbon.dhcp.thefacebook.com> <20200714084123.GG24642@dhcp22.suse.cz> In-Reply-To: <20200714084123.GG24642@dhcp22.suse.cz> From: Shakeel Butt Date: Tue, 14 Jul 2020 08:32:09 -0700 Message-ID: Subject: Re: [PATCH] mm: memcontrol: avoid workload stalls when lowering memory.high To: Michal Hocko Cc: Roman Gushchin , Andrew Morton , Johannes Weiner , Linux MM , Kernel Team , LKML , Domas Mituzas , Tejun Heo , Chris Down Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 14, 2020 at 1:41 AM Michal Hocko wrote: > > On Fri 10-07-20 12:19:37, Shakeel Butt wrote: > > On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin wrote: > > > > > > On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote: > > > > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko wrote: > > > > > > > > > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote: > > > > > > Memory.high limit is implemented in a way such that the kernel > > > > > > penalizes all threads which are allocating a memory over the limit. > > > > > > Forcing all threads into the synchronous reclaim and adding some > > > > > > artificial delays allows to slow down the memory consumption and > > > > > > potentially give some time for userspace oom handlers/resource control > > > > > > agents to react. > > > > > > > > > > > > It works nicely if the memory usage is hitting the limit from below, > > > > > > however it works sub-optimal if a user adjusts memory.high to a value > > > > > > way below the current memory usage. It basically forces all workload > > > > > > threads (doing any memory allocations) into the synchronous reclaim > > > > > > and sleep. This makes the workload completely unresponsive for > > > > > > a long period of time and can also lead to a system-wide contention on > > > > > > lru locks. It can happen even if the workload is not actually tight on > > > > > > memory and has, for example, a ton of cold pagecache. > > > > > > > > > > > > In the current implementation writing to memory.high causes an atomic > > > > > > update of page counter's high value followed by an attempt to reclaim > > > > > > enough memory to fit into the new limit. To fix the problem described > > > > > > above, all we need is to change the order of execution: try to push > > > > > > the memory usage under the limit first, and only then set the new > > > > > > high limit. > > > > > > > > > > Shakeel would this help with your pro-active reclaim usecase? It would > > > > > require to reset the high limit right after the reclaim returns which is > > > > > quite ugly but it would at least not require a completely new interface. > > > > > You would simply do > > > > > high = current - to_reclaim > > > > > echo $high > memory.high > > > > > echo infinity > memory.high # To prevent direct reclaim > > > > > # allocation stalls > > > > > > > > > > > > > This will reduce the chance of stalls but the interface is still > > > > non-delegatable i.e. applications can not change their own memory.high > > > > for the use-cases like application controlled proactive reclaim and > > > > uswapd. > > > > > > Can you, please, elaborate a bit more on this? I didn't understand > > > why. > > > > > > > Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I > > don't think so otherwise any job on a system can change their > > memory.high and can adversely impact the isolation and memory > > scheduling of the system. > > Is this really the case? There should always be a parent cgroup that > overrides the setting. Can you explain a bit more? I don't see any requirement of having a layer of cgroup between root and the job cgroup. Internally we schedule jobs as top level cgroups. There do exist jobs which are a combination of other jobs and there we do use an additional layer of cgroup (similar to pods running multiple containers in kubernetes). Surely we can add a layer for all the jobs but it comes with an overhead and at scale that overhead is not negligible. > Also you can always set the hard limit if you do > not want to add another layer of cgroup in the hierarchy before > delegation. Or am I missing something? > Yes, we can set memory.max though it has different oom semantics and not really a replacement for memory.high.