Received: by 2002:a05:6a10:eb17:0:0:0:0 with SMTP id hx23csp3564987pxb; Tue, 7 Sep 2021 02:34:40 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxaCOyJVmqcrPtJoj6951/pMXdx2uROtK9EZJLsPAjCLlzk2G1k5qtBxbMnkwCXK/H0YQ8l X-Received: by 2002:a17:906:b1d5:: with SMTP id bv21mr17819240ejb.346.1631007280450; Tue, 07 Sep 2021 02:34:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1631007280; cv=none; d=google.com; s=arc-20160816; b=cjFvuLCgnmzcnWcbUBe6Kgmmz63hMlo815uXA+xqZ4kbLI99vmYW5h485h0DvCyY+/ lbkBmRDApjnbrKbO+SFDg9z10Slgd5HrSoCKHGDqstTy+jbT3xnnAX1MkikEcCqZHtQO vx4Rt5m9cNjaolIQxA40B/bjytAdHEXhJdAqnm19lUNZp3nbSdz11VUajOstgw/otW6r TnjGsxVdMYzO7D6XBqVn3FzMnwTRINxPNXk5bje2UHDzGCSZBA6yDtHXBkfV8aN9hMnn i0GVEzKyLE0SoapdNpns6OpPxwwah5UFVmm83nYxdfLzDvOkwkMWOMuzB6wvpjfgf+jK LhTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=KLbWJgt7Pq7iTo2+u+wRsltkx42moH/GKLU6EPym12c=; b=q3JUiQujtg3HHyeEa9ZL3Iud/tQt8ghNfcGaDPW5N+HPhvzvkbmUvId/5+JbK6Z2k/ WdABud0npnMXABuhwSCiY9iJXtaNfaZxvsvcXg4mHhcSr0MxzB2dHALdIE4+IJ8BBrJw V2j9T+ztUvNF4zDRfQ+F95KPhci9Ej5zZef9r8WTlRzZkzSESpvNMDfyZlVJelmLTeCW n+rkIVcXrMBJRjh1L6sCe5Cd2eYRQBROP5Nf/inGY0FF/ROUuy/yX7OyuKhQ3WfFGBCc zMzVsxlRporZet3o3V7pq6I/1A+nnk0JA+ZAkm0BaFjO/Uc27SnjUTMU8Q8QkT0PARtV 1OqQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=fb8TAUdn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id qw29si11145210ejc.320.2021.09.07.02.34.17; Tue, 07 Sep 2021 02:34:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=fb8TAUdn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242801AbhIGIpn (ORCPT + 99 others); Tue, 7 Sep 2021 04:45:43 -0400 Received: from smtp-out1.suse.de ([195.135.220.28]:37400 "EHLO smtp-out1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236858AbhIGIpn (ORCPT ); Tue, 7 Sep 2021 04:45:43 -0400 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 2BACB21FE4; Tue, 7 Sep 2021 08:44:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1631004276; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=KLbWJgt7Pq7iTo2+u+wRsltkx42moH/GKLU6EPym12c=; b=fb8TAUdnRMI1pKMM76wf/il79XFi0Xe4TReUmU6/lhZUL3XOJc62vvSBdlMVUv9YqHGlKm veoFeOD4ahD2cCGJRlkIqS0cRLa/UWqJsPQNi4VfQP43kHbcnp0xPo7hTXoEha1qFpV3H1 bxfKDDTXbJyIg0mAxgPGQLyk7Oh4nGw= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id EFDD5A3B81; Tue, 7 Sep 2021 08:44:35 +0000 (UTC) Date: Tue, 7 Sep 2021 10:44:32 +0200 From: Michal Hocko To: Feng Tang Cc: Andrew Morton , David Rientjes , Mel Gorman , Vlastimil Babka , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm/page_alloc: detect allocation forbidden by cpuset and bail out early Message-ID: References: <1631003150-96935-1-git-send-email-feng.tang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1631003150-96935-1-git-send-email-feng.tang@intel.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 07-09-21 16:25:50, Feng Tang wrote: > There was report that starting an Ubuntu in docker while using cpuset > to bind it to movlabe nodes (a node only has movable zone, like a node s@movlabe@movable@ > for hotplug or a Persistent Memory node in normal usage) will fail > due to memory allocation failure, and then OOM is involved and many > other innocent processes got killed. It can be reproduced with command: > $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c > "grep Mems_allowed /proc/self/status" (node 4 is a movable node) > > The reason is, in the case, the target cpuset nodes only have movable > zone, while the creation of an OS in docker sometimes needs to allocate > memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and > the cpuset limit forbids the allocation, then out-of-memory killing is > involved even when normal nodes and movable nodes both have many free > memory. It would be great to add a oom report here as an example. > The failure is reasonable, but still there is one problem, that when > the usage fails as it's an mission impossible due to the cpuset limit, > the allocation should just not trigger reclaim/compaction, and more > importantly, not get any innocent process oom-killed. I would reformulate to something like: " The OOM killer cannot help to resolve the situation as there is no usable memory for the request in the cpuset scope. The only reasonable measure to take is to fail the allocation right away and have the caller to deal with it. " > So add detection for cases like this in the slowpath of allocation, > and bail out early returning NULL for the allocation. > > We've run some cases of malloc/mmap/page_fault/lru-shm/swap from > will-it-scale and vm-scalability, and didn't see obvious performance > change (all inside +/- 1%), test boxes are 2 socket Cascade Lake and > Icelake servers. > > [thanks to Micho Hocko and David Rientjes for suggesting not handle > it inside OOM code] While this is a good fix from the functionality POV I believe you can go a step further. Please add a detection to the cpuset code and complain to the kernel log if somebody tries to configure movable only cpuset. Once you have that in place you can easily create a static branch for cpuset_insane_setup() and have zero overhead for all reasonable configuration. There shouldn't be any reason to pay a single cpu cycle to check for something that almost nobody does. What do you think? -- Michal Hocko SUSE Labs