Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp1352770imm; Mon, 9 Jul 2018 23:45:59 -0700 (PDT) X-Google-Smtp-Source: AAOMgpefo/GCnEW/U1PsshD5WT18tGsMLpMCos+HIFjNjWsmB3wF1j9nYcyMonJuN1xhXrt/Gtf0 X-Received: by 2002:a62:404e:: with SMTP id n75-v6mr20661634pfa.232.1531205159923; Mon, 09 Jul 2018 23:45:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531205159; cv=none; d=google.com; s=arc-20160816; b=am4egSbPplbc2S1S1WvWrUjcu8Wz0bZpFJcWrtpuaZZa+VrE0+4FLGt0Pq9hvqM2jM g6KCPUcLTaNz7QQbQQ1yS0sXR4ymi9haYNYVGVJ/d5qbIoSxVorB9diJdUIS4kFT6PhW sSMtORSRO03yGFtfd9G5n7GLmer8iD6LybX4j+mVhKXyGRVuXVIqsRHT2dNbyRNBL8n+ wAp7JwsKg9qsWSMaDk1mij46FfPHS4po9v0mkrb+UPPC9mgVwwyPQpZd+Z68RgatowAM wi4V5dK6tMojY3vSVzDlW373fRfJI1e/H9UWmaCqkodLxz7XnM9ATXvUPYhnMudv7B8P Q2uQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from :arc-authentication-results; bh=ZIOkZMsxGWAsx9eW+iztDfct5zOVLjqBtkiCosPw/e8=; b=Qjx4+uWxXmazHs6zoF4zubrBNjvZ/VNITzUyiRLFa6ZXcyePfrT8u/DU4s5gc8ml1L kB9L0YrI0/6pPShFXSBS6iB+yWJ/wh7Vyef4VViEtemSlF/7g0zJqsyHsYn0gdBFyboe 8vjj/WHbFjc1P5OKl4qHB7oybgYA5p+So3AwT6qtkPHYezCc5u89/UKOHA8KydqcI0ci FvheVcdPzD/8F439h9Ug+/GaLtic7uelsIOTFBc+ryr3qbatyoYMVHarbRVg2qBFkEML Abnp8rzrc0aUhLzgFw34sK4OjH7j0cXuznJFFGp9C8uv7Btnb2O+ZhG1cAsOxArEC4Q4 8ebQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j67-v6si630037pfg.34.2018.07.09.23.45.44; Mon, 09 Jul 2018 23:45:59 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751309AbeGJGo6 (ORCPT + 99 others); Tue, 10 Jul 2018 02:44:58 -0400 Received: from mga06.intel.com ([134.134.136.31]:24462 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750953AbeGJGoz (ORCPT ); Tue, 10 Jul 2018 02:44:55 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 09 Jul 2018 23:44:54 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.51,333,1526367600"; d="scan'208";a="65626699" Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.13.118]) by fmsmga002.fm.intel.com with ESMTP; 09 Jul 2018 23:44:51 -0700 From: "Huang\, Ying" To: Dave Hansen Cc: Andrew Morton , , , "Kirill A. Shutemov" , Andrea Arcangeli , Michal Hocko , Johannes Weiner , Shaohua Li , Hugh Dickins , Minchan Kim , Rik van Riel , Naoya Horiguchi , Zi Yan , Daniel Jordan Subject: Re: [PATCH -mm -v4 03/21] mm, THP, swap: Support PMD swap mapping in swap_duplicate() References: <20180622035151.6676-1-ying.huang@intel.com> <20180622035151.6676-4-ying.huang@intel.com> <92b86ab6-6f51-97b0-337c-b7e98a30b6cb@linux.intel.com> Date: Tue, 10 Jul 2018 14:44:50 +0800 In-Reply-To: <92b86ab6-6f51-97b0-337c-b7e98a30b6cb@linux.intel.com> (Dave Hansen's message of "Mon, 9 Jul 2018 09:51:42 -0700") Message-ID: <878t6jio7x.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave Hansen writes: >> +static inline bool thp_swap_supported(void) >> +{ >> + return IS_ENABLED(CONFIG_THP_SWAP); >> +} > > This seems like rather useless abstraction. Why do we need it? I just want to make it shorter, 19 vs 27 characters. But if you think IS_ENABLED(CONFIG_THP_SWAP) is much better, I can use that instead. > ... >> -static inline int swap_duplicate(swp_entry_t swp) >> +static inline int swap_duplicate(swp_entry_t *swp, bool cluster) >> { >> return 0; >> } > > FWIW, I despise true/false function arguments like this. When I see > this in code: > > swap_duplicate(&entry, false); > > I have no idea what false does. I'd much rather see: > > enum do_swap_cluster { > SWP_DO_CLUSTER, > SWP_NO_CLUSTER > }; > > So you see: > > swap_duplicate(&entry, SWP_NO_CLUSTER); > > vs. > > swap_duplicate(&entry, SWP_DO_CLUSTER); > Yes. Boolean parameter isn't good at most times. Matthew Wilcox suggested to use swap_duplicate(&entry, HPAGE_PMD_NR); vs. swap_duplicate(&entry, 1); He thinks this makes the interface more flexible to support other swap entry size in the future. What do you think about that? >> diff --git a/mm/memory.c b/mm/memory.c >> index e9cac1c4fa69..f3900282e3da 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -951,7 +951,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, >> swp_entry_t entry = pte_to_swp_entry(pte); >> >> if (likely(!non_swap_entry(entry))) { >> - if (swap_duplicate(entry) < 0) >> + if (swap_duplicate(&entry, false) < 0) >> return entry.val; >> >> /* make sure dst_mm is on swapoff's mmlist. */ > > I'll also point out that in a multi-hundred-line patch, adding arguments > to a existing function would not be something I'd try to include in the > patch. I'd break it out separately unless absolutely necessary. You mean add another patch, which only adds arguments to the function, but not change the body of the function? >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index f42b1b0cdc58..48e2c54385ee 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -49,6 +49,9 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, >> unsigned char); >> static void free_swap_count_continuations(struct swap_info_struct *); >> static sector_t map_swap_entry(swp_entry_t, struct block_device**); >> +static int add_swap_count_continuation_locked(struct swap_info_struct *si, >> + unsigned long offset, >> + struct page *page); >> >> DEFINE_SPINLOCK(swap_lock); >> static unsigned int nr_swapfiles; >> @@ -319,6 +322,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, >> spin_unlock(&si->lock); >> } >> >> +static inline bool is_cluster_offset(unsigned long offset) >> +{ >> + return !(offset % SWAPFILE_CLUSTER); >> +} >> + >> static inline bool cluster_list_empty(struct swap_cluster_list *list) >> { >> return cluster_is_null(&list->head); >> @@ -1166,16 +1174,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) >> return NULL; >> } >> >> -static unsigned char __swap_entry_free(struct swap_info_struct *p, >> - swp_entry_t entry, unsigned char usage) >> +static unsigned char __swap_entry_free_locked(struct swap_info_struct *p, >> + struct swap_cluster_info *ci, >> + unsigned long offset, >> + unsigned char usage) >> { >> - struct swap_cluster_info *ci; >> - unsigned long offset = swp_offset(entry); >> unsigned char count; >> unsigned char has_cache; >> >> - ci = lock_cluster_or_swap_info(p, offset); >> - >> count = p->swap_map[offset]; >> >> has_cache = count & SWAP_HAS_CACHE; >> @@ -1203,6 +1209,17 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p, >> usage = count | has_cache; >> p->swap_map[offset] = usage ? : SWAP_HAS_CACHE; >> >> + return usage; >> +} >> + >> +static unsigned char __swap_entry_free(struct swap_info_struct *p, >> + swp_entry_t entry, unsigned char usage) >> +{ >> + struct swap_cluster_info *ci; >> + unsigned long offset = swp_offset(entry); >> + >> + ci = lock_cluster_or_swap_info(p, offset); >> + usage = __swap_entry_free_locked(p, ci, offset, usage); >> unlock_cluster_or_swap_info(p, ci); >> >> return usage; >> @@ -3450,32 +3467,12 @@ void si_swapinfo(struct sysinfo *val) >> spin_unlock(&swap_lock); >> } >> >> -/* >> - * Verify that a swap entry is valid and increment its swap map count. >> - * >> - * Returns error code in following case. >> - * - success -> 0 >> - * - swp_entry is invalid -> EINVAL >> - * - swp_entry is migration entry -> EINVAL >> - * - swap-cache reference is requested but there is already one. -> EEXIST >> - * - swap-cache reference is requested but the entry is not used. -> ENOENT >> - * - swap-mapped reference requested but needs continued swap count. -> ENOMEM >> - */ >> -static int __swap_duplicate(swp_entry_t entry, unsigned char usage) >> +static int __swap_duplicate_locked(struct swap_info_struct *p, >> + unsigned long offset, unsigned char usage) >> { >> - struct swap_info_struct *p; >> - struct swap_cluster_info *ci; >> - unsigned long offset; >> unsigned char count; >> unsigned char has_cache; >> - int err = -EINVAL; >> - >> - p = get_swap_device(entry); >> - if (!p) >> - goto out; >> - >> - offset = swp_offset(entry); >> - ci = lock_cluster_or_swap_info(p, offset); >> + int err = 0; >> >> count = p->swap_map[offset]; >> >> @@ -3485,12 +3482,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) >> */ >> if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { >> err = -ENOENT; >> - goto unlock_out; >> + goto out; >> } >> >> has_cache = count & SWAP_HAS_CACHE; >> count &= ~SWAP_HAS_CACHE; >> - err = 0; >> >> if (usage == SWAP_HAS_CACHE) { >> >> @@ -3517,11 +3513,39 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) >> >> p->swap_map[offset] = count | has_cache; >> >> -unlock_out: >> +out: >> + return err; >> +} > > ... and that all looks like refactoring, not actively implementing PMD > swap support. That's unfortunate. > >> +/* >> + * Verify that a swap entry is valid and increment its swap map count. >> + * >> + * Returns error code in following case. >> + * - success -> 0 >> + * - swp_entry is invalid -> EINVAL >> + * - swp_entry is migration entry -> EINVAL >> + * - swap-cache reference is requested but there is already one. -> EEXIST >> + * - swap-cache reference is requested but the entry is not used. -> ENOENT >> + * - swap-mapped reference requested but needs continued swap count. -> ENOMEM >> + */ >> +static int __swap_duplicate(swp_entry_t entry, unsigned char usage) >> +{ >> + struct swap_info_struct *p; >> + struct swap_cluster_info *ci; >> + unsigned long offset; >> + int err = -EINVAL; >> + >> + p = get_swap_device(entry); >> + if (!p) >> + goto out; > > Is this an error, or just for running into something like a migration > entry? Comments please. __swap_duplicate() may be called with invalid swap entry because the swap device may be swapoff after we get swap entry during page fault. Yes, I will add some comments here. >> + offset = swp_offset(entry); >> + ci = lock_cluster_or_swap_info(p, offset); >> + err = __swap_duplicate_locked(p, offset, usage); >> unlock_cluster_or_swap_info(p, ci); >> + >> + put_swap_device(p); >> out: >> - if (p) >> - put_swap_device(p); >> return err; >> } > > Not a comment on this patch, but lock_cluster_or_swap_info() is woefully > uncommented. OK. Will add some comments for that. >> @@ -3534,6 +3558,81 @@ void swap_shmem_alloc(swp_entry_t entry) >> __swap_duplicate(entry, SWAP_MAP_SHMEM); >> } >> >> +#ifdef CONFIG_THP_SWAP >> +static int __swap_duplicate_cluster(swp_entry_t *entry, unsigned char usage) >> +{ >> + struct swap_info_struct *si; >> + struct swap_cluster_info *ci; >> + unsigned long offset; >> + unsigned char *map; >> + int i, err = 0; > > Instead of an #ifdef, is there a reason we can't just do: > > if (!IS_ENABLED(THP_SWAP)) > return 0; > > ? Good idea. Will do this for the whole patchset. >> + si = get_swap_device(*entry); >> + if (!si) { >> + err = -EINVAL; >> + goto out; >> + } >> + offset = swp_offset(*entry); >> + ci = lock_cluster(si, offset); > > Could you explain a bit why we do lock_cluster() and not > lock_cluster_or_swap_info() here? The code size of lock_cluster() is a little smaller, and I think it is a little easier to read. But I know lock_cluster_or_swap_info() can be used here without functionality problems. If we try to merge the code for huge and normal swap entry, that could be used. >> + if (cluster_is_free(ci)) { >> + err = -ENOENT; >> + goto unlock; >> + } > > Needs comments on how this could happen. We just took the lock, so I > assume this is some kind of race, but can you elaborate? Sure. Will add some comments for this. >> + if (!cluster_is_huge(ci)) { >> + err = -ENOTDIR; >> + goto unlock; >> + } > > Yikes! This function is the core of the new functionality and its > comment count is exactly 0. There was quite a long patch description, > which will be surely lost to the ages, but nothing in the code that > folks _will_ be looking at for decades to come. > > Can we fix that? Sure. Will add more comments. >> + VM_BUG_ON(!is_cluster_offset(offset)); >> + VM_BUG_ON(cluster_count(ci) < SWAPFILE_CLUSTER); > > So, by this point, we know we are looking at (or supposed to be looking > at) a cluster on the device? Yes. >> + map = si->swap_map + offset; >> + if (usage == SWAP_HAS_CACHE) { >> + if (map[0] & SWAP_HAS_CACHE) { >> + err = -EEXIST; >> + goto unlock; >> + } >> + for (i = 0; i < SWAPFILE_CLUSTER; i++) { >> + VM_BUG_ON(map[i] & SWAP_HAS_CACHE); >> + map[i] |= SWAP_HAS_CACHE; >> + } > > So, it's OK to race with the first entry, but after that it's a bug > because the tail pages should agree with the head page's state? Yes. Will add some comments about this. >> + } else { >> + for (i = 0; i < SWAPFILE_CLUSTER; i++) { >> +retry: >> + err = __swap_duplicate_locked(si, offset + i, usage); >> + if (err == -ENOMEM) { >> + struct page *page; >> + >> + page = alloc_page(GFP_ATOMIC | __GFP_HIGHMEM); > > I noticed that the non-clustering analog of this function takes a GFP > mask. Why not this one? The value of gfp_mask is GFP_ATOMIC in swap_duplicate(), so they are exactly same. >> + err = add_swap_count_continuation_locked( >> + si, offset + i, page); >> + if (err) { >> + *entry = swp_entry(si->type, offset+i); >> + goto undup; >> + } >> + goto retry; >> + } else if (err) >> + goto undup; >> + } >> + cluster_set_count(ci, cluster_count(ci) + usage); >> + } >> +unlock: >> + unlock_cluster(ci); >> + put_swap_device(si); >> +out: >> + return err; >> +undup: >> + for (i--; i >= 0; i--) >> + __swap_entry_free_locked( >> + si, ci, offset + i, usage); >> + goto unlock; >> +} > > So, we've basically created a fork of the __swap_duplicate() code for > huge pages, along with a presumably new set of bugs and a second code > path to update. Was this unavoidable? Can we unify this any more with > the small pages path? Will discuss this in another thread. >> /* >> * Increase reference count of swap entry by 1. >> * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required >> @@ -3541,12 +3640,15 @@ void swap_shmem_alloc(swp_entry_t entry) >> * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which >> * might occur if a page table entry has got corrupted. >> */ >> -int swap_duplicate(swp_entry_t entry) >> +int swap_duplicate(swp_entry_t *entry, bool cluster) >> { >> int err = 0; >> >> - while (!err && __swap_duplicate(entry, 1) == -ENOMEM) >> - err = add_swap_count_continuation(entry, GFP_ATOMIC); >> + if (thp_swap_supported() && cluster) >> + return __swap_duplicate_cluster(entry, 1); >> + >> + while (!err && __swap_duplicate(*entry, 1) == -ENOMEM) >> + err = add_swap_count_continuation(*entry, GFP_ATOMIC); >> return err; >> } > > Reading this, I wonder whether this has been refactored as much as > possible. Both add_swap_count_continuation() and > __swap_duplciate_cluster() start off with the same get_swap_device() dance. Yes. There's some duplicated code logic. Will think about how to improve it. >> @@ -3558,9 +3660,12 @@ int swap_duplicate(swp_entry_t entry) >> * -EBUSY means there is a swap cache. >> * Note: return code is different from swap_duplicate(). >> */ >> -int swapcache_prepare(swp_entry_t entry) >> +int swapcache_prepare(swp_entry_t entry, bool cluster) >> { >> - return __swap_duplicate(entry, SWAP_HAS_CACHE); >> + if (thp_swap_supported() && cluster) >> + return __swap_duplicate_cluster(&entry, SWAP_HAS_CACHE); >> + else >> + return __swap_duplicate(entry, SWAP_HAS_CACHE); >> } >> >> struct swap_info_struct *swp_swap_info(swp_entry_t entry) >> @@ -3590,51 +3695,13 @@ pgoff_t __page_file_index(struct page *page) >> } >> EXPORT_SYMBOL_GPL(__page_file_index); >> >> -/* >> - * add_swap_count_continuation - called when a swap count is duplicated >> - * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's >> - * page of the original vmalloc'ed swap_map, to hold the continuation count >> - * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called >> - * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc. > > This closes out with a lot of refactoring noise. Any chance that can be > isolated into another patch? Sure. Will do that. Best Regards, Huang, Ying