Received: by 2002:ac0:946b:0:0:0:0:0 with SMTP id j40csp313439imj; Sat, 16 Feb 2019 00:42:18 -0800 (PST) X-Google-Smtp-Source: AHgI3IbtOz5ZBRSHSY34RcpvJNsIHGybitnlO0Xe+cfEGt1qn/Eh/ZCeFRWd3TEKNERfyBdHaBT1 X-Received: by 2002:a63:2d6:: with SMTP id 205mr13380676pgc.180.1550306538880; Sat, 16 Feb 2019 00:42:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550306538; cv=none; d=google.com; s=arc-20160816; b=GBK/415dNX7aleieTpqX0Pvg2v6xshKe2JFe+3a4RgbF2UTd1BYZ3nBQxLOnOZgTMi nczSM7QtYm8pSaSVZElVoCOcoC0Q7cp15if6kLcpKkBhRT+dLxU/9KwG5KfKKQvMAFhe FtltksuHKjgrYr0ftbjEP5qhyw/nMmGHjwPJqthGIcL+tqmjDN/NZZinL7eqvwwaR8Hv F/bcHr0Y1eEbdejSp6wUYmkK+T5FZmTzPN1ncN3kblufCE991GtsMaQWBkv5FZukR7Sz rshKHYfCz1q3lMfWdI9mrvuJ+NU1O9sytyL2AwoqwJWyPamaJUKtfA3ntcEUsycGuzei na5w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:dkim-signature:content-transfer-encoding :mime-version:message-id:date:subject:cc:to:from; bh=PVlYGjyWaFVWJioBZgBkG+3PLaDi4XJfqvkkL2DVWxY=; b=H2FfPkNUqLurJbrKsY+uI2FoNCOG4czNYp4faejdGTlS/RIl/hkqPvw2nj9eCftdxX MZfMMlNPqBEC7bzFgRHfT9OOQ6SrMbRAQYkiA2YL15QoFSgco4+mU/Tr1gEcn+eNQnud h/q+xmHbvodTHHyEGzahGH8zHzoZvnULMkzwoDmwnht39BEc1MRaBlAbThQN0F3tjJtT /Fka0cte/R4hpfXd2DHV+bY29TMD/tiL3Y2JwlN7C9O0rEVeFPplgKOECCp+TdgR+EIJ nIXkqC2450mSWb1dOaeoPeG0TK02n73oHcbbhxbex9dH6ehMp5gFgDRR3mX3V5I2f9Es lB5w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=ogWR0BCu; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 9si7523538plf.398.2019.02.16.00.42.03; Sat, 16 Feb 2019 00:42:18 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=ogWR0BCu; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387942AbfBOWUl (ORCPT + 99 others); Fri, 15 Feb 2019 17:20:41 -0500 Received: from hqemgate14.nvidia.com ([216.228.121.143]:15220 "EHLO hqemgate14.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726000AbfBOWUk (ORCPT ); Fri, 15 Feb 2019 17:20:40 -0500 Received: from hqpgpgate102.nvidia.com (Not Verified[216.228.121.13]) by hqemgate14.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Fri, 15 Feb 2019 14:20:43 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate102.nvidia.com (PGP Universal service); Fri, 15 Feb 2019 14:20:38 -0800 X-PGP-Universal: processed; by hqpgpgate102.nvidia.com on Fri, 15 Feb 2019 14:20:38 -0800 Received: from [10.2.160.210] (10.124.1.5) by HQMAIL101.nvidia.com (172.20.187.10) with Microsoft SMTP Server (TLS) id 15.0.1395.4; Fri, 15 Feb 2019 22:20:38 +0000 From: Zi Yan To: CC: , , Michal Hocko , Mel Gorman , Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Hugh Dickins , Mike Kravetz , Anshuman Khandual , John Hubbard , Mark Hairgrove , Nitin Gupta , David Nellans Subject: [LSF/MM TOPIC] Generating physically contiguous memory Date: Fri, 15 Feb 2019 14:20:37 -0800 X-Mailer: MailMate (1.12.4r5594) Message-ID: MIME-Version: 1.0 X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL101.nvidia.com (172.20.187.10) To HQMAIL101.nvidia.com (172.20.187.10) Content-Type: text/plain; format=flowed Content-Transfer-Encoding: quoted-printable DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1550269243; bh=PVlYGjyWaFVWJioBZgBkG+3PLaDi4XJfqvkkL2DVWxY=; h=X-PGP-Universal:From:To:CC:Subject:Date:X-Mailer:Message-ID: MIME-Version:X-Originating-IP:X-ClientProxiedBy:Content-Type: Content-Transfer-Encoding; b=ogWR0BCuJXewZzQEbJpYlL6oQOPHs3TDFPbVoaDaQfS985aojrx16wa56IBizh27Q W650rtKEZTBrMGR3NmV59/2JI9QFULJ8lNlkvmq/GvvhmDe5oMNGjiEsk/Gs0B4xcB 0tvVG0B1FBs/z0ynvpN1lT1F8jz2MPYNDa4qM+hFaGG0ZiQuCkq+aiJz8i3WerXaJM 91rderrFOSmUTTdB9kXVGlKcGeqsseaHFRhyP49yPEzT4LNVH2Jm1v1KuAFrna9S/4 7bepyHG4HtcPv0/PbMEYRmU3+He3k/+6BVAM/2SOB+KTg9g0NTwahvWZUi0Dz/gvVg C3Fr5BaFr5tgQ== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The Problem ---- Large pages and physically contiguous memory are important to devices, = such as GPUs, FPGAs, NICs and RDMA controllers, because they can often = reduce address translation overheads and hence achieve better = performance when operating on large pages (2MB and beyond). The same can = be said of CPU performance, of course, but there is an important = difference: GPUs and high-throughput devices often take a more severe = performance hit, in the event of a TLB miss, as compared to a CPU, = because larger volume of in-flight work is stalled due to the TLB miss = and the induced page table walks. The effect is sufficiently large that = such devices *really* want highly reliable ways to allocate large pages = to minimize TLB misses and reduce the duration of page table walks. Due to the lack of flexibility, Approaches using memory reservation at = boot time (such as hugetlbfs) are a compromise that would be nice to = avoid. THPs, in general, seems to be a proper way to go because it is = transparent to userspace and provides large pages, but it is not perfect = yet. The community is still working on it since 1) THP size is limited = by the page allocation system and 2) THP creation requires a lot of = effort (e.g., memory compaction and page reclamation on the critical = path of page allocations). Possible solutions ---- 1. I recently posted an RFC [1] about actively generating physically = contiguous memory from in-use pages after page allocation. This RFC = moves pages around and make them physically contiguous when possible. It = is different from existing approaches, since it does not rely on page = allocation. On the other hand, this approach is still affected by = non-moveable pages scattered across the memory, which is highly related = but orthogonal and one of whose possible solutions is proposed by Mel = Gorman recently [2]. 2. THPs could be a solution as it provide large pages. THP avoids memory = reservation at boot time, but to meet the needs, i.e., a lot of large = pages, of some of these high-throughput accelerators, we need to make it = easier to produce large pages, namely increasing the successful rate of = allocating THPs and decreasing the overheads of allocating them. Mel = Gorman has posted a related patchset [3]. It is also possible to generate THPs in the background, either like what = khugepaged does right now, or periodically perform memory compaction to = lower whole memory fragmentation level, or having certain amount of THP = pools for future use. But these solutions still face the same problem. 3. A more restricted but more reliable way might be using libhugetlbfs. = It reserves memory, which is dedicated to large page allocations and = hence requires less effort to obtain large pages. It also supports page = sizes larger than 2MB, which further reduces address translation = overheads. But AFAIK device drivers are not able to directly grab large = pages from libhugetlbfs, which is something devices want. 4. Recently Matthew Wilcox mentioned his XArray is going to support = arbitrary sized pages [4], which would help maintain physically = contiguous ranges once created (aka my RFC). Once my RFC generates = physically contiguous memory, XArrays would maintain the page size and = prevent reclaim/compaction from breaking them. Getting arbitrary sized = pages can still be beneficial to devices when larger than 2MB pages = becomes very difficult to get. Feel free to provide your comments. Thanks. [1] https://lore.kernel.org/lkml/20190215220856.29749-1-zi.yan@sent.com/ [2] = https://lore.kernel.org/lkml/20181123114528.28802-1-mgorman@techsingulari= ty.net/ [3] = https://lore.kernel.org/lkml/20190118175136.31341-1-mgorman@techsingulari= ty.net/ [4] = https://lore.kernel.org/lkml/20190208042448.GB21860@bombadil.infradead.or= g/ -- Best Regards, Yan Zi