Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756051AbdDMFnc (ORCPT ); Thu, 13 Apr 2017 01:43:32 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:57965 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751573AbdDMFn3 (ORCPT ); Thu, 13 Apr 2017 01:43:29 -0400 Subject: Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update To: Vlastimil Babka , linux-mm@kvack.org References: <20170411140609.3787-1-vbabka@suse.cz> <20170411140609.3787-2-vbabka@suse.cz> Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Li Zefan , Michal Hocko , Mel Gorman , David Rientjes , Christoph Lameter , Hugh Dickins , Andrea Arcangeli , Anshuman Khandual , "Kirill A. Shutemov" From: Anshuman Khandual Date: Thu, 13 Apr 2017 11:12:16 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <20170411140609.3787-2-vbabka@suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-TM-AS-MML: disable x-cbid: 17041305-0040-0000-0000-00000306B490 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17041305-0041-0000-0000-00000C7E71A1 Message-Id: <95469f35-56e9-7dc4-b7fd-a3e8c25bdff3@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-04-13_05:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001 definitions=main-1704130045 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2629 Lines: 41 On 04/11/2017 07:36 PM, Vlastimil Babka wrote: > Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset > mems update") has fixed known recent regressions found by LTP's cpuset01 > testcase. I have however found that by modifying the testcase to use per-vma > mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2), > the premature OOM still happens and the issue is much older. Meanwhile while we are discussing this RFC, will it be better to WARN out these situations where we dont have node in the intersection, hence no usable zone during allocation. That might actually give a hint to the user before a premature OOM/allocation failure comes. > > The root of the problem is that the cpuset's mems_allowed and mempolicy's > nodemask can temporarily have no intersection, thus get_page_from_freelist() > cannot find any usable zone. The current semantic for empty intersection is to > ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in > node_zonelist(), but the racy update can happen after we already passed the > check. Such races should be protected by the seqlock task->mems_allowed_seq, > but it doesn't work here, because 1) mpol_rebind_mm() does not happen under > seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem > for write, while the allocation can have mmap_sem for read when it's taking the > seqlock for read. And 2) the seqlock cookie of callers of node_zonelist() > (alloc_pages_vma() and alloc_pages_current()) is different than the one of > __alloc_pages_slowpath(), so there's still a potential race window. > > This patch fixes the issue by having __alloc_pages_slowpath() check for empty > intersection of cpuset and ac->nodemask before OOM or allocation failure. If > it's indeed empty, the nodemask is ignored and allocation retried, which mimics > node_zonelist(). This works fine, because almost all callers of > __alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only > exception is new_node_page() from hotplug, where the potential violation of > nodemask isn't an issue, as there's already a fallback allocation attempt > without any nodemask. If there's a future caller that needs to have its specific > nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp > flag for that. Did you really mean node_zonelist() in both the instances above. Because that function just picks up either FALLBACK_ZONELIST or NOFALLBACK_ZONELIST depending upon the passed GFP flags in the allocation request and does not deal with ignoring the passed nodemask.