Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp1148693ybe; Wed, 11 Sep 2019 10:06:29 -0700 (PDT) X-Google-Smtp-Source: APXvYqyMJ2fvX8N1DNFP3F/TRoDzG29FNDJS1K/vKQK2rUYX5ZE4ahrgqASar22E+o+U421jJPYC X-Received: by 2002:a50:ed1a:: with SMTP id j26mr26135742eds.138.1568221589312; Wed, 11 Sep 2019 10:06:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1568221589; cv=none; d=google.com; s=arc-20160816; b=ARDeAZ9D3rRdoUby3hStlZ9BMZVGR45RwgI3KJIqBh3UG4jovRehfzttQJ7/uUJ0Xc YMMgL6ghqbIQS0VhQNZTQ2uVVhUM0hJlQc7o7Yq17Uofn2T+XMtV0VwZRzp6y82PMNgu K3dDRrQxPv+oVwUrHupavb33YQwckqAgiuTJjqaxNrZPfcWEo3vMx4KzasGvPXujHzGm p0np+9/ANQG26ALa/SZGUJX9FUCipMBO4Z/PJnz8DRdHj/L+Sdn7FjNprqUJYTv+QjuT iq/efMgCx3uD332bz9yhJLuitByunE8f6nczT8VqqRoovhTylEbEXYBx6yzFFkn6yd6m e2vA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=tjDusgHWwwcPuTZyb2yaXH/a3BzidRe9zFpLRpAYF4g=; b=zEQ44a6GH5gNtM/ai0qTYETCuvQvx5K4q4cdr38OCe0HhQLmQ8USgwgQlQyvBB6ZVV UpbR0sfAYjLkuBsskfdeRWiN3HaTBQZEoa3zA60gfHyDRHEae36YuAQzUSN5jEisdjO1 6Q+LpuuP/0o8AwCsGlbEEtQZaLyhB1AoYGP9p20PQNsrVxS1RBCkMoMf6DAEiKzgbYOe vNEGrmaNDRnYDJ2EUPFBsb133apWWhgGz8L3gglxaG2Ey3iXB13v9WC9GjmYIANccXuK uEP2SXXVJTLIPUvlaFhtE4xDwCXFwcOdHsSSI1FoVTH2S29EjRtbpIUTySNqARCgq96k 40Hg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2019-08-05 header.b="rahUj/tG"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c4si2825756ejb.183.2019.09.11.10.06.05; Wed, 11 Sep 2019 10:06:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2019-08-05 header.b="rahUj/tG"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729451AbfIKRDx (ORCPT + 99 others); Wed, 11 Sep 2019 13:03:53 -0400 Received: from aserp2120.oracle.com ([141.146.126.78]:35828 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728937AbfIKRDx (ORCPT ); Wed, 11 Sep 2019 13:03:53 -0400 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x8BH0Ggv100146; Wed, 11 Sep 2019 17:03:21 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2019-08-05; bh=tjDusgHWwwcPuTZyb2yaXH/a3BzidRe9zFpLRpAYF4g=; b=rahUj/tGvPrsXSREtfv19e5V4S+Aat3lWreLhWdAQ9x7+xLkfXGW4KF3+clgs2tdXqxW 7ENlChx6ATWdWvhKqJJDXFe79tbQ9ja9/KN50pDbtsoDgkIC9sWcgi/NGDpy+ogROC9H 2RMalKU6L3BcXF6TydcEKfB6sLDGAwU5iVAn5lvB2A8NGtQGjFrBhII6Vb+zax6sufO9 V0leGK7NozKN8/zrqWZtytnCcuIKfNQc+I2m1AgRuolpDr7y/mu7/A1WaIZQaCgTAIEn lpGfhs5nspoPn6OfmijX1Usx3VYGUHvtWPt0DdkTOgqlv+W9mBNzq1Q8anqqvdK53kRz oQ== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by aserp2120.oracle.com with ESMTP id 2uw1jybfmx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 11 Sep 2019 17:03:20 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x8BH3AIP145409; Wed, 11 Sep 2019 17:03:20 GMT Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserp3020.oracle.com with ESMTP id 2uxk0terde-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 11 Sep 2019 17:03:20 +0000 Received: from abhmp0018.oracle.com (abhmp0018.oracle.com [141.146.116.24]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id x8BH3Iu2004387; Wed, 11 Sep 2019 17:03:18 GMT Received: from [192.168.1.222] (/71.63.128.209) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 11 Sep 2019 10:03:17 -0700 Subject: Re: [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD To: Waiman Long , Matthew Wilcox Cc: Peter Zijlstra , Ingo Molnar , Will Deacon , Alexander Viro , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Davidlohr Bueso References: <20190911150537.19527-1-longman@redhat.com> <20190911150537.19527-6-longman@redhat.com> <20190911151451.GH29434@bombadil.infradead.org> <19d9ea18-bd20-e02f-c1de-70e7322f5f22@redhat.com> From: Mike Kravetz Message-ID: <40a511a4-5771-f9a9-40b6-64e39478bbcb@oracle.com> Date: Wed, 11 Sep 2019 10:03:16 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 In-Reply-To: <19d9ea18-bd20-e02f-c1de-70e7322f5f22@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9377 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1906280000 definitions=main-1909110158 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9377 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1906280000 definitions=main-1909110158 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 9/11/19 8:44 AM, Waiman Long wrote: > On 9/11/19 4:14 PM, Matthew Wilcox wrote: >> On Wed, Sep 11, 2019 at 04:05:37PM +0100, Waiman Long wrote: >>> When allocating a large amount of static hugepages (~500-1500GB) on a >>> system with large number of CPUs (4, 8 or even 16 sockets), performance >>> degradation (random multi-second delays) was observed when thousands >>> of processes are trying to fault in the data into the huge pages. The >>> likelihood of the delay increases with the number of sockets and hence >>> the CPUs a system has. This only happens in the initial setup phase >>> and will be gone after all the necessary data are faulted in. >> Can;t the application just specify MAP_POPULATE? > > Originally, I thought that this happened in the startup phase when the > pages were faulted in. The problem persists after steady state had been > reached though. Every time you have a new user process created, it will > have its own page table. This is still at fault time. Although, for the particular application it may be after the 'startup phase'. > It is the sharing of the of huge page shared > memory that is causing problem. Of course, it depends on how the > application is written. It may be the case that some applications would find the delays acceptable for the benefit of shared pmds once they reach steady state. As you say, of course this depends on how the application is written. I know that Oracle DB would not like it if PMD sharing is disabled for them. Based on what I know of their model, all processes which share PMDs perform faults (write or read) during the startup phase. This is in environments as big or bigger than you describe above. I have never looked at/for delays in these environments around pmd sharing (page faults), but that does not mean they do not exist. I will try to get the DB group to give me access to one of their large environments for analysis. We may want to consider making the timeout value and disable threshold user configurable. -- Mike Kravetz