Received: by 10.223.176.46 with SMTP id f43csp2664024wra; Thu, 25 Jan 2018 13:14:57 -0800 (PST) X-Google-Smtp-Source: AH8x227NaUIXTEV0uENE1v4wi0UJvA8dAC9/KlwyorQCoGJhCD0r3bJYaTay5pNPr8kqDvXMcRVT X-Received: by 10.98.89.198 with SMTP id k67mr17210317pfj.110.1516914897406; Thu, 25 Jan 2018 13:14:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516914897; cv=none; d=google.com; s=arc-20160816; b=uzGX9SG/dLS78ATbu48XyiMrK/K8wyNa0iyE73bTZ4CEbtc9DOYMNxBCa/TwIF+Llt erO2NohGZSFfPyZAc98wB12pcyfhOvueN7jz/t0AAdPWbV3/R5VGHYAA1+kzZqy/w0VJ b0zRXpXoJQHLYayOG10FPMbLnFfiPbycC6tJ+6EzSUqmN7gEmIcmFfa6S2sp+X6Feh/c c0INsXfO6UwbVGULxX0FCtyRgs5w063IAARS5fWsbuU8bGjUWlks1BuLNH+33O03HeY1 gxPUb7qNupAHPL3kcuEiZORLYiCGEpoNuQ42+D176dcQZHIpG+CVWd3K30HgcEIuYLfr xr5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=rDIaYgGXuZRVFTZ3qts6+jmIQt1l4WQKMcGiSwNJE8I=; b=0yHhTaWNBS4sQwpcofFKz5e8nicLGTz/AQOEvVb1aBdU2CN+IWT4A5Fc9J/wxhiij1 Z/QtL3imPeHTGCqS9WKuEXAGMGg4MZ3gmhkkqxxh2WMsW4Bna8RAPWP/TUONVSs2KhxS JpcrVz1aUrcPVThNCQzauAg560weq6wtzxZuPBjkfqJHGtm6lLK8BSF4/+j01qWNi8hu vV6raxD5BpiQ4bit9gkGMdTEfXZzP56GDkYChXJP06T+zkOmRGaCznS0UT+j+DtnvzGN bG8WlIHMQ7lz8dqnuNUovyVgzhoinuBPYnba//7rRFh94738JeWpwQk4mKy511JIjD28 12YA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a33-v6si2582076pla.36.2018.01.25.13.14.42; Thu, 25 Jan 2018 13:14:57 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751513AbeAYVNO (ORCPT + 99 others); Thu, 25 Jan 2018 16:13:14 -0500 Received: from mx2.suse.de ([195.135.220.15]:52851 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751200AbeAYVNM (ORCPT ); Thu, 25 Jan 2018 16:13:12 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id B39B7AF37; Thu, 25 Jan 2018 21:13:09 +0000 (UTC) Date: Thu, 25 Jan 2018 21:13:03 +0000 From: Mel Gorman To: Nitin Gupta Cc: Zi Yan , Michal Hocko , Nitin Gupta , steven.sistare@oracle.com, Andrew Morton , Ingo Molnar , Nadav Amit , Minchan Kim , "Kirill A. Shutemov" , Peter Zijlstra , Vegard Nossum , "Levin, Alexander" , Mike Rapoport , Hillf Danton , Shaohua Li , Anshuman Khandual , Andrea Arcangeli , David Rientjes , Rik van Riel , Jan Kara , Dave Jiang , J?r?me Glisse , Matthew Wilcox , Ross Zwisler , Hugh Dickins , Tobin C Harding , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP Message-ID: <20180125211303.rbfeg7ultwr6hpd3@suse.de> References: <1516318444-30868-1-git-send-email-nitingupta910@gmail.com> <20180119124957.GA6584@dhcp22.suse.cz> <59F98618-C49F-48A8-BCA1-A8F717888BAA@cs.rutgers.edu> <4d7ce874-9771-ad5f-c064-52a46fc37689@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <4d7ce874-9771-ad5f-c064-52a46fc37689@oracle.com> User-Agent: NeoMutt/20170912 (1.9.0) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote: > >> It's not really about memory scarcity but a more efficient use of it. > >> Applications may want hugepage benefits without requiring any changes to > >> app code which is what THP is supposed to provide, while still avoiding > >> memory bloat. > >> > > I read these links and find that there are mainly two complains: > > 1. THP causes latency spikes, because direction compaction slows down THP allocation, > > 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than > > THP size and fails because of THP. > > > > The first complain is not related to this patch. > > I'm trying to address many different THP issues and memory bloat is > first among them. Expecting userspace to get this right is probably going to go sideways. It'll be screwed up and be sub-optimal or have odd semantics for existing madvise flags. The fact is that an application may not even know if it's going to be sparsely using memory in advance if it's a computation load modelling from unknown input data. I suggest you read the old Talluri paper "Superpassing the TLB Performance of Superpages with Less Operating System Support" and pay attention to Section 4. There it discusses a page reservation scheme whereby on fault a naturally aligned set of base pages are reserved and only one correctly placed base page is inserted into the faulting address. It was tied into a hypothetical piece of hardware that doesn't exist to give best-effort support for superpages so it does not directly help you but the initial idea is sound. There are holes in the paper from todays perspective but it was written in the 90's. From there, read "Transparent operating system support for superpages" by Navarro, particularly chapter 4 paying attention to the parts where it talks about opportunism and promotion threshold. Superficially, it goes like this 1. On fault, reserve a THP in the allocator and use one base page that is correctly-aligned for the faulting addresses. By correctly-aligned, I mean that you use base page whose offset would be naturally contiguous if it ever was part of a huge page. 2. On subsequent faults, attempt to use a base page that is naturally aligned to be a THP 3. When a "threshold" of base pages are inserted, allocate the remaining pages and promote it to a THP 4. If there is memory pressure, spill "reserved" pages into the main allocation pool and lose the opportunity to promote (which will need khugepaged to recover) By definition, a promotion threshold of 1 would be the existing scheme of allocation a THP on the first fault and some users will want that. It also should be the default to avoid unexpected overhead. For workloads where memory is being sparsely addressed and the increased overhead of THP is unwelcome then the threshold should be tuned higher with a maximum possible value of HPAGE_PMD_NR. It's non-trivial to do this because at minimum a page fault has to check if there is a potential promotion candidate by checking the PTEs around the faulting address searching for a correctly-aligned base page that is already inserted. If there is, then check if the correctly aligned base page for the current faulting address is free and if so use it. It'll also then need to check the remaining PTEs to see if both the promotion threshold has been reached and if so, promote it to a THP (or else teach khugepaged to do an in-place promotion if possible). In other words, implementing the promotion threshold is both hard and it's not free. However, if it did exist then the only tunable would be the "promotion threshold" and applications would not need any special awareness of their address space. -- Mel Gorman SUSE Labs