Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp1165099ybe; Wed, 11 Sep 2019 10:20:12 -0700 (PDT) X-Google-Smtp-Source: APXvYqy6YohF1Ya+Oln1hROsjMozAyLGtG/RQqWSwcw3echcuEA12ira5LofBDPtXy7Gqlak8Qoj X-Received: by 2002:a17:906:3715:: with SMTP id d21mr30285713ejc.24.1568222412762; Wed, 11 Sep 2019 10:20:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1568222412; cv=none; d=google.com; s=arc-20160816; b=acvPn+kUwlFGPkjNNlorIBVxS93AT6VlcLU4RAG07GeUN2PQIjhaQhwj+ysIuXLovr gbZ9ZSAPJ3lNtzpX2Dw2Kk3oO9DIf6PAMRB8USiRGzXTrEvn1TEHVtBcRpYw/ABGyLNh 9NT0hWVdlfETonYbLGEwUDVA4+uQtvR8lMQcFIacaRTpbD8XGUJf/4F5qz4MKAWh7Oyk 6DsOJxwJUA77sSf9xoilea+FPcb0yJoozLjIs7TkcKrJiQr0ii1OZAk3P26qWy6tQoBK 4DN84bbae/c2c8+KEwFgVGs9qj/28kmsRsL1bDogHsTRLPyRbd+4bGOQhT9a/cGnSMlS 512Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject; bh=1dRNsrMNlkFM3JdG9TFhEjeDjCW7j0CwEAOSd68VMog=; b=Zez6PCgLpzfIG+grpGJlcXwq4X1jWoA5RLOagtthcaZBsmoSftnDrqG6udpWc+awRo Eg0YANHAjalD2ARcFqLgyQAWKsNqezo65QbmdRT2yzkvPUtxPoDHTbRfiMD7wXJoVZtS Og2GEdEg5s0LQjP7RUSSnjy/DI/54FGkQ+QjmdcEn/rH4Xiqp+pz5MmAZRXh2boAEylG 8LaPUih4l98xGIjw5jY/pvlQ2QFQ4Gp4DqJspUiqHWuNeOotTpJ5SkB6I7bX46Y2QoEz iVdULEKGytfyaEJ8KeC0QDZImTbsNJXs7y3jZejB8eCnE5tKJACAV5s6GxMqgyVFClSk IQTg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k37si13528128edd.428.2019.09.11.10.19.49; Wed, 11 Sep 2019 10:20:12 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729565AbfIKRPb (ORCPT + 99 others); Wed, 11 Sep 2019 13:15:31 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33248 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729380AbfIKRPb (ORCPT ); Wed, 11 Sep 2019 13:15:31 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 783A81DB0; Wed, 11 Sep 2019 17:15:30 +0000 (UTC) Received: from llong.remote.csb (ovpn-123-234.rdu2.redhat.com [10.10.123.234]) by smtp.corp.redhat.com (Postfix) with ESMTP id CEEB060872; Wed, 11 Sep 2019 17:15:27 +0000 (UTC) Subject: Re: [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD To: Mike Kravetz , Matthew Wilcox Cc: Peter Zijlstra , Ingo Molnar , Will Deacon , Alexander Viro , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Davidlohr Bueso References: <20190911150537.19527-1-longman@redhat.com> <20190911150537.19527-6-longman@redhat.com> <20190911151451.GH29434@bombadil.infradead.org> <19d9ea18-bd20-e02f-c1de-70e7322f5f22@redhat.com> <40a511a4-5771-f9a9-40b6-64e39478bbcb@oracle.com> From: Waiman Long Organization: Red Hat Message-ID: <5229662c-d709-7aca-be4c-53dea1a49fda@redhat.com> Date: Wed, 11 Sep 2019 18:15:26 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: <40a511a4-5771-f9a9-40b6-64e39478bbcb@oracle.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (mx1.redhat.com [10.5.110.71]); Wed, 11 Sep 2019 17:15:30 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 9/11/19 6:03 PM, Mike Kravetz wrote: > On 9/11/19 8:44 AM, Waiman Long wrote: >> On 9/11/19 4:14 PM, Matthew Wilcox wrote: >>> On Wed, Sep 11, 2019 at 04:05:37PM +0100, Waiman Long wrote: >>>> When allocating a large amount of static hugepages (~500-1500GB) on a >>>> system with large number of CPUs (4, 8 or even 16 sockets), performance >>>> degradation (random multi-second delays) was observed when thousands >>>> of processes are trying to fault in the data into the huge pages. The >>>> likelihood of the delay increases with the number of sockets and hence >>>> the CPUs a system has. This only happens in the initial setup phase >>>> and will be gone after all the necessary data are faulted in. >>> Can;t the application just specify MAP_POPULATE? >> Originally, I thought that this happened in the startup phase when the >> pages were faulted in. The problem persists after steady state had been >> reached though. Every time you have a new user process created, it will >> have its own page table. > This is still at fault time. Although, for the particular application it > may be after the 'startup phase'. > >> It is the sharing of the of huge page shared >> memory that is causing problem. Of course, it depends on how the >> application is written. > It may be the case that some applications would find the delays acceptable > for the benefit of shared pmds once they reach steady state. As you say, of > course this depends on how the application is written. > > I know that Oracle DB would not like it if PMD sharing is disabled for them. > Based on what I know of their model, all processes which share PMDs perform > faults (write or read) during the startup phase. This is in environments as > big or bigger than you describe above. I have never looked at/for delays in > these environments around pmd sharing (page faults), but that does not mean > they do not exist. I will try to get the DB group to give me access to one > of their large environments for analysis. > > We may want to consider making the timeout value and disable threshold user > configurable. Making it configurable is certainly doable. They can be sysctl parameters so that the users can reenable PMD sharing by making those parameters larger. Cheers, Longman