Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp272588imm; Fri, 1 Jun 2018 00:05:02 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJjB1oQwuW6aDKEdAypaiSUSPhPxQP2Tbd3vpyUbzDSdwwNb8QMyo75r9WxCYA6BDg7tY1Z X-Received: by 2002:a17:902:c81:: with SMTP id 1-v6mr9946434plt.126.1527836702908; Fri, 01 Jun 2018 00:05:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527836702; cv=none; d=google.com; s=arc-20160816; b=TOwtw2NR4VAt7Bj0KdhNciymAz0RPVNlTlmMBog7tqWad/e4Haus5okoMTuOTY8us5 IteBugoy/9rsPVQ0qJWMRvF0Mwv4xVxC/aMBc6jG4LJqBMxbsA59mUWd5TlnCSpAqRf2 MxGUuLhA9xdaIhmhkJR0hNVJCtpGxvXVbYT3fPQETobswCNn2S2qbn5Ti/aGdeLlQ/o6 R6wn8aQKpUr70mXe6Fam291msobzyD5DRuo569hgML/bNTdABQoBACji8D6qwc511utK IazfEgszzhKFJrrb1ZrXdtSmlk9LHZabsFn3nysnZgqqaGzNFd0QTUyfdLwyulqn2hRI uEWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:message-id:in-reply-to:date:references:subject:cc:to :from:arc-authentication-results; bh=15ZnfwyyxU2xIBzv700EVYVOIAst+pzxMpJ2XY1LG8M=; b=xXqDXZS/5ThGBiUp9nq8s/psLLbIIoTXVS52zxxvjS+8s4il4wuSmb6rPc0BNtVPwo MHUwcnarNLr+81jwYCDzEfFgWfYB/5iEDjRlMoEULPgbHS8iSleKnHVLLzqrowGGOCHo ygCjPOsVHLT1OoZ+2++xXdFVoiRolFsoepYmbvIKQyq8xywUJNvdawOju6SLOryTxo9G JoZE97O55ChrclBjuhnwRwcgnpEHAfogWNoQJr9VWvwKLXDkPTDvbbtzLuNo3E9LouQi 7z297zJZXM5E/Sa0wxH9T8jSo0fQ7MQIuxSPHNnGsEzrdsv00QBT0D42qwLqiJu8wjKE PwBQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e5-v6si30590465pgs.317.2018.06.01.00.04.48; Fri, 01 Jun 2018 00:05:02 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751305AbeFAHDx (ORCPT + 99 others); Fri, 1 Jun 2018 03:03:53 -0400 Received: from mga18.intel.com ([134.134.136.126]:59489 "EHLO mga18.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750812AbeFAHDw (ORCPT ); Fri, 1 Jun 2018 03:03:52 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 Jun 2018 00:03:51 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,464,1520924400"; d="scan'208";a="45590424" Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.13.118]) by orsmga007.jf.intel.com with ESMTP; 01 Jun 2018 00:03:50 -0700 From: "Huang\, Ying" To: Naoya Horiguchi Cc: Andrew Morton , "linux-mm\@kvack.org" , "linux-kernel\@vger.kernel.org" Subject: Re: [PATCH -mm -V3 00/21] mm, THP, swap: Swapout/swapin THP in one piece References: <20180523082625.6897-1-ying.huang@intel.com> <20180601061116.GA4813@hori1.linux.bs1.fc.nec.co.jp> Date: Fri, 01 Jun 2018 15:03:50 +0800 In-Reply-To: <20180601061116.GA4813@hori1.linux.bs1.fc.nec.co.jp> (Naoya Horiguchi's message of "Fri, 1 Jun 2018 06:11:16 +0000") Message-ID: <87efhryomh.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Naoya Horiguchi writes: > On Wed, May 23, 2018 at 04:26:04PM +0800, Huang, Ying wrote: >> From: Huang Ying >> >> Hi, Andrew, could you help me to check whether the overall design is >> reasonable? >> >> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the >> swap part of the patchset? Especially [02/21], [03/21], [04/21], >> [05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21], >> [12/21], [20/21]. >> >> Hi, Andrea and Kirill, could you help me to review the THP part of the >> patchset? Especially [01/21], [07/21], [09/21], [11/21], [13/21], >> [15/21], [16/21], [17/21], [18/21], [19/21], [20/21], [21/21]. >> >> Hi, Johannes and Michal, could you help me to review the cgroup part >> of the patchset? Especially [14/21]. >> >> And for all, Any comment is welcome! > > Hi Ying Huang, > I've read through this series and find no issue. Thanks a lot for your review! > It seems that thp swapout never happens if swap devices are backed by > rotation storages. I guess that's because this feature depends on swap > cluster searching algorithm which only supports non-rotational storages. > > I think that this limitation is OK because non-rotational storage is > better for swap device (most future users will use it). But I think > it's better to document the limitation somewhere because swap cluster > is in-kernel thing and we can't assume that end users know about it. Yes. I will try to document it somewhere. Best Regards, Huang, Ying > Thanks, > Naoya Horiguchi > >> >> This patchset is based on the 2018-05-18 head of mmotm/master. >> >> This is the final step of THP (Transparent Huge Page) swap >> optimization. After the first and second step, the splitting huge >> page is delayed from almost the first step of swapout to after swapout >> has been finished. In this step, we avoid splitting THP for swapout >> and swapout/swapin the THP in one piece. >> >> We tested the patchset with vm-scalability benchmark swap-w-seq test >> case, with 16 processes. The test case forks 16 processes. Each >> process allocates large anonymous memory range, and writes it from >> begin to end for 8 rounds. The first round will swapout, while the >> remaining rounds will swapin and swapout. The test is done on a Xeon >> E5 v3 system, the swap device used is a RAM simulated PMEM (persistent >> memory) device. The test result is as follow, >> >> base optimized >> ---------------- -------------------------- >> %stddev %change %stddev >> \ | \ >> 1417897 ± 2% +992.8% 15494673 vm-scalability.throughput >> 1020489 ± 4% +1091.2% 12156349 vmstat.swap.si >> 1255093 ± 3% +940.3% 13056114 vmstat.swap.so >> 1259769 ± 7% +1818.3% 24166779 meminfo.AnonHugePages >> 28021761 -10.7% 25018848 ± 2% meminfo.AnonPages >> 64080064 ± 4% -95.6% 2787565 ± 33% interrupts.CAL:Function_call_interrupts >> 13.91 ± 5% -13.8 0.10 ± 27% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath >> >> Where, the score of benchmark (bytes written per second) improved >> 992.8%. The swapout/swapin throughput improved 1008% (from about >> 2.17GB/s to 24.04GB/s). The performance difference is huge. In base >> kernel, for the first round of writing, the THP is swapout and split, >> so in the remaining rounds, there is only normal page swapin and >> swapout. While in optimized kernel, the THP is kept after first >> swapout, so THP swapin and swapout is used in the remaining rounds. >> This shows the key benefit to swapout/swapin THP in one piece, the THP >> will be kept instead of being split. meminfo information verified >> this, in base kernel only 4.5% of anonymous page are THP during the >> test, while in optimized kernel, that is 96.6%. The TLB flushing IPI >> (represented as interrupts.CAL:Function_call_interrupts) reduced >> 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%. These >> are performance benefit of THP swapout/swapin too. >> >> Below is the description for all steps of THP swap optimization. >> >> Recently, the performance of the storage devices improved so fast that >> we cannot saturate the disk bandwidth with single logical CPU when do >> page swapping even on a high-end server machine. Because the >> performance of the storage device improved faster than that of single >> logical CPU. And it seems that the trend will not change in the near >> future. On the other hand, the THP becomes more and more popular >> because of increased memory size. So it becomes necessary to optimize >> THP swap performance. >> >> The advantages to swapout/swapin a THP in one piece include: >> >> - Batch various swap operations for the THP. Many operations need to >> be done once per THP instead of per normal page, for example, >> allocating/freeing the swap space, writing/reading the swap space, >> flushing TLB, page fault, etc. This will improve the performance of >> the THP swap greatly. >> >> - The THP swap space read/write will be large sequential IO (2M on >> x86_64). It is particularly helpful for the swapin, which are >> usually 4k random IO. This will improve the performance of the THP >> swap too. >> >> - It will help the memory fragmentation, especially when the THP is >> heavily used by the applications. The THP order pages will be free >> up after THP swapout. >> >> - It will improve the THP utilization on the system with the swap >> turned on. Because the speed for khugepaged to collapse the normal >> pages into the THP is quite slow. After the THP is split during the >> swapout, it will take quite long time for the normal pages to >> collapse back into the THP after being swapin. The high THP >> utilization helps the efficiency of the page based memory management >> too. >> >> There are some concerns regarding THP swapin, mainly because possible >> enlarged read/write IO size (for swapout/swapin) may put more overhead >> on the storage device. To deal with that, the THP swapin is turned on >> only when necessary. A new sysfs interface: >> /sys/kernel/mm/transparent_hugepage/swapin_enabled is added to >> configure it. It uses "always/never/madvise" logic, to be turned on >> globally, turned off globally, or turned on only for VMA with >> MADV_HUGEPAGE, etc. >> GE, etc. >> >> Changelog >> --------- >> >> v3: >> >> - Rebased on 5/18 HEAD of mmotm/master >> >> - Fixed a build bug, Thanks 0-Day! >> >> v2: >> >> - Fixed several build bugs, Thanks 0-Day! >> >> - Improved documentation as suggested by Randy Dunlap. >> >> - Fixed several bugs in reading huge swap cluster >>