Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp1631901pxb; Wed, 20 Oct 2021 08:49:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJybsERCa4TrAwGgXjt9Ho2EHozWuV0xt4ji3QAWKL5uKbMbatza/QGmUhLBEPk12GPbEDLq X-Received: by 2002:a05:6402:3588:: with SMTP id y8mr261101edc.285.1634744988632; Wed, 20 Oct 2021 08:49:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634744988; cv=none; d=google.com; s=arc-20160816; b=Nsy/F2I+1hVJdDXpknrVK7VCzr2NqOz+io/9VbA8NnLVIvf7Jvnlv88pfGI41lh4Sw p+yZFm/ZzwjepVfLIw8kSOA5dXXFAgl/jr4Mz9IsIGsHySEQnuwBvVA/szubwArYZCUe q4oB4/pNPGhpTLWmbhuJNhnKTCPfobs8bLqxaqgauxQYeIornrXQivjjO7NJhYmw6G4X q/3ZyVIcUSA4mxUk+MBn2XqO12aZRlna5zMm2UMeny1FKzJjRmnyE3pwGVL3baDRfiSi /4n6ctzcDBL5gFc4ItxX+AkzYHGibOLl2yNIi6TWMD4upIrQ2owjYxQcP3ibF7/b1PG5 7+WQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject:dkim-signature; bh=ttAK6Gqf/TyX+MT4z0DPfUPRY7KYPOn3puccc9GorZE=; b=tq51CPJxr3gguxzfqW7SD5+UIUZ2JqlKLZH00QIWsHP4+OW2DAtsFbZWFnZm/73XMc ej8lLv+8ttX/uxQdVJ/lT4nC1/QPYxtHyXlx4tEzmyQZe7426F1V9qZjSbXT/YEsE9/d 4yEDWdmkRiqBcOuLax14pDm5yeA7p4G9mgkgPijWEmMTOCpZa9KrFC4Wt1bdtAU+DPxe SDaJysEieycI5TtDC0+wRMV+5HhxzD+qSRbOnx/K2RRu+vMNjD/ozHjm6itGC9B5/n8i 4WUmjyB9Lsm5p/pIBGs1X6aT2TctwwQxT81qUDg5PvUbfSo95zgdQ4UNLfcg6euvaSJn 0Nhw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@virtuozzo.com header.s=relay header.b=rqfgh5Q5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=virtuozzo.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id 29si3278441ejf.281.2021.10.20.08.49.19; Wed, 20 Oct 2021 08:49:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@virtuozzo.com header.s=relay header.b=rqfgh5Q5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=virtuozzo.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230265AbhJTPtk (ORCPT + 99 others); Wed, 20 Oct 2021 11:49:40 -0400 Received: from relay.sw.ru ([185.231.240.75]:36086 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229570AbhJTPth (ORCPT ); Wed, 20 Oct 2021 11:49:37 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=virtuozzo.com; s=relay; h=Content-Type:MIME-Version:Date:Message-ID:From: Subject; bh=ttAK6Gqf/TyX+MT4z0DPfUPRY7KYPOn3puccc9GorZE=; b=rqfgh5Q5comYze+CV u0HaVR5J9Mkytg6GHPvR91X3KbZesnUGemY2IC41F4dwNhVLm/NOsEkGzpwcriwMuNZ7lJejaVELX b5i5NPD2PCdADzCdMFoT5B+yheZly9Hk9agPrNnFVoHi05+aHXdAtXu+to+93VWpX8rGNeY3SvlLI =; Received: from [172.29.1.17] by relay.sw.ru with esmtp (Exim 4.94.2) (envelope-from ) id 1mdDoE-006cn0-HO; Wed, 20 Oct 2021 18:47:18 +0300 Subject: Re: [PATCH memcg 3/3] memcg: handle memcg oom failures To: Michal Hocko Cc: Johannes Weiner , Vladimir Davydov , Andrew Morton , Roman Gushchin , Uladzislau Rezki , Vlastimil Babka , Shakeel Butt , Mel Gorman , Tetsuo Handa , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel@openvz.org References: From: Vasily Averin Message-ID: Date: Wed, 20 Oct 2021 18:46:56 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 20.10.2021 16:02, Michal Hocko wrote: > On Wed 20-10-21 15:14:27, Vasily Averin wrote: >> mem_cgroup_oom() can fail if current task was marked unkillable >> and oom killer cannot find any victim. >> >> Currently we force memcg charge for such allocations, >> however it allow memcg-limited userspace task in to overuse assigned limits >> and potentially trigger the global memory shortage. > > You should really go into more details whether that is a practical > problem to handle. OOM_FAILED means that the memcg oom killer couldn't > find any oom victim so it cannot help with a forward progress. There are > not that many situations when that can happen. Naming that would be > really useful. I've pointed above: "if current task was marked unkillable and oom killer cannot find any victim." This may happen when current task cannot be oom-killed because it was marked unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN and other processes in memcg are either dying, or are kernel threads or are marked unkillable by the same way. Or when memcg have this process only. If we always approve such kind of allocation, it can be misused. Process can mmap a lot of memory, ant then touch it and generate page fault and make overcharged memory allocations. Finally it can consume all node memory and trigger global memory shortage on the host. >> Let's fail the memory charge in such cases. >> >> This failure should be somehow recognised in #PF context, > > explain why When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM, then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail, it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process or just crash node if sysctl vm.panic_on_oom is set to 1. Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly. However it is not aware that memcg can reject some other allocations, do not recognize the fault as memcg-related and allows to run global OOM. >> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED >> >> ToDo: what is the best way to notify pagefault_out_of_memory() about >> mem_cgroup_out_of_memory failure ? > > why don't you simply remove out_of_memory from pagefault_out_of_memory > and leave it only with the blocking memcg OOM handling? Wouldn't that be a > more generic solution? Your first patch already goes that way partially. I clearly understand that global out_of_memory should not be trggired by memcg restrictions. I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying. However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it. > This change is more risky than the first one. If somebody returns > VM_FAULT_OOM without invoking allocator then it can loop for ever but > invoking OOM killer in such a situation is equally wrong as the oom > killer cannot really help, right? I'm not ready to comment this right now and take time out to think about. Thank you, Vasily Averin