Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp11281808imu; Mon, 31 Dec 2018 18:10:48 -0800 (PST) X-Google-Smtp-Source: ALg8bN7ZVFQpNsAz1vVEyIoxyMHOJUY4UMT1K9rVKOFXVC+XFf9uFgNE1/boBA4gGGWtKfUT92z0 X-Received: by 2002:a17:902:28aa:: with SMTP id f39mr39432695plb.297.1546308647946; Mon, 31 Dec 2018 18:10:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546308647; cv=none; d=google.com; s=arc-20160816; b=XMi98ljBL1PZPCBRIF/iBWZm2OYmG81R0xo7r2Q6NbKiz+5v6zrH+p0+ffhdR1eVRo G1mPqpVijLcRyffRjjn773Siq39mLrFOZ7xO9Ix04YCuwi6ovar+ehEHjFdE+wFSfRck cZ09LEAcomSQOavQQq0dQq/9XrFFGvjx4nQV4CF+g3f4W3Dso4oQRmeYZGMDs0P3KGTE pBr9aFC41CN9qkZwKNhchPKzLtKiL7ZCignOjKwkl+OSDlgqzS0nwewJYV+2HB3S+tvD Fig44QHp9TN9FHClEYK2eMxKHQSF9shYwvmyZhHyTZ26DlfiiYHE4o+ysporwVzWERXQ zknw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature; bh=3clf4p2zR0Kr6AYYOg+cEyz2yTDTor5Y/lT1dJ6upqE=; b=xpY0AMHeSuuElgKa0cM8oEmNwbJhv6vvv4CvLwdiP6PHlZAf8CP6YLexTB0KtKNbSf +WPdFpb/Q2vnejQTpJzRS96XHu++WCWgArzyrMe5Bf3yf5ynpv3NeE55VcFXRG9FrA0X tgqEj6fePkZbxjaDYC9OfA7SoiURM3RTy1yrVG/DYuEHq6Mga4MU3NztBSIOYczPPS5F aUXyCwkKKmd5mY3tSBaCkiIzo/6wiECEyPGLWlvhIm8XlE95qODTas8fHZ3wMqW+zLDK rkuBTQhXZkJNX6xZILua7iGFKs6jz5df42SkwW87Sv0lAT+QbBzeMw/8YSeZLIZ5AURk Nqkg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=tMXnz5Xz; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t13si679225pgm.175.2018.12.31.18.10.06; Mon, 31 Dec 2018 18:10:47 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=tMXnz5Xz; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728037AbfAAAog (ORCPT + 99 others); Mon, 31 Dec 2018 19:44:36 -0500 Received: from mail-oi1-f194.google.com ([209.85.167.194]:40464 "EHLO mail-oi1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727761AbfAAAog (ORCPT ); Mon, 31 Dec 2018 19:44:36 -0500 Received: by mail-oi1-f194.google.com with SMTP id t204so22750125oie.7 for ; Mon, 31 Dec 2018 16:44:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=3clf4p2zR0Kr6AYYOg+cEyz2yTDTor5Y/lT1dJ6upqE=; b=tMXnz5Xz8WKhlJ0ckNW7G8rB3sE3qBwu8XJHbo/6WD8MB9JSNZfkmzQwX4j4kbscAH SWZZTgZQiV7sQcVT/S987wcIrjww6ttqifC+F8xBDfWSxyYlesy5pWJUoVIlm575Au2P HmQCgEQT4DRn9Jm+hdhnB7rBjShflNGsyGauRkwXJYga/L7meRCRjBvYqdjD6ReJwsBd uutvU2TAUt4QwEfoZ54QOrfrHgJUKVyAVZnIcYEysfGJEa6wBsJojOiXFaRCTKepGZEn nzqUKtyBHXCHAE3/P/X1Ie6cY6yqjuV7pT9O6xE0j8xaeqTzXs9f4CPZ/dnOTWFY22JL a++A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=3clf4p2zR0Kr6AYYOg+cEyz2yTDTor5Y/lT1dJ6upqE=; b=P2eakzRVhnH6/8vewjsYfCamEXYF8T2VpoAxMFz89KfcomTr6/3yZtwjP/J7ZHDWLg FMpdA/emnvqoQ5+sG7dxvodwjoZxxqwfABtflMBC/DvGwBcwkrHZyt1h1EWSu0Pz5aMk H1lGmdJLfyZ3vyHqTe6qSDnagP4y4uhcld3gKSYxDqDa0IwTcQuXN6OdcjEG8yr8sJMi 09SN3fy+prVi/Q0aNrVGMwz0eyd11zWuHdV1dpUdfSfSzVjIJC3/5F+i0KFKghxoS80z q4lp3uux3pwDau2xv54ugJDFwkcqtKepRf4vWgxPCkkt0pSISnWS1S/TO0kk+T8my7hD mevA== X-Gm-Message-State: AA+aEWZzFkCZW+1Fer3frrdcJdAG4KNxY9NMlF8b2tr7Hli2uZpqSR7r PYHn8qleAR17TCA59USaVIvnng== X-Received: by 2002:a54:4486:: with SMTP id v6mr26570949oiv.233.1546303474701; Mon, 31 Dec 2018 16:44:34 -0800 (PST) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id c78sm37919044oig.30.2018.12.31.16.44.32 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 31 Dec 2018 16:44:33 -0800 (PST) Date: Mon, 31 Dec 2018 16:44:24 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Vineeth Remanan Pillai cc: Matthew Wilcox , Hugh Dickins , Andrew Morton , Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Kelley Nielsen , Rik van Riel Subject: Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity In-Reply-To: <20181203170934.16512-2-vpillai@digitalocean.com> Message-ID: References: <20181203170934.16512-1-vpillai@digitalocean.com> <20181203170934.16512-2-vpillai@digitalocean.com> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 3 Dec 2018, Vineeth Remanan Pillai wrote: > This patch was initially posted by Kelley(kelleynnn@gmail.com). > Reposting the patch with all review comments addressed and with minor > modifications and optimizations. Tests were rerun and commit message > updated with new results. > > The function try_to_unuse() is of quadratic complexity, with a lot of > wasted effort. It unuses swap entries one by one, potentially iterating > over all the page tables for all the processes in the system for each > one. > > This new proposed implementation of try_to_unuse simplifies its > complexity to linear. It iterates over the system's mms once, unusing > all the affected entries as it walks each set of page tables. It also > makes similar changes to shmem_unuse. Hi Vineeth, please fold in fixes below before reposting your "mm,swap: rid swapoff of quadratic complexity" patch - or ask for more detail if unclear. I could split it up, of course, but since they should all (except perhaps one) just be merged into the base patch before going any further, it'll save me time to keep them together here and just explain:- shmem_unuse_swap_entries(): If a user fault races with swapoff, it's very normal for shmem_swapin_page() to return -EEXIST, and the old code was careful not to pass back any error but -ENOMEM; whereas on mmotm, /usr/sbin/swapoff often failed silently because it got that EEXIST. shmem_unuse(): A couple of crashing bugs there: a list_del_init without holding the mutex, and too much faith in the "safe" in list_for_each_entry_safe(): it does assume that the mutex has been held throughout, you (very nicely!) drop it, but that does require "next" to be re-evaluated. shmem_writepage(): Not a bug fix, this is the "except perhaps one": minor optimization, could be left out, but if shmem_unuse() is going through the list in the forward direction, and may completely unswap a file and del it from the list, then pages from that file can be swapped out to *other* swap areas after that, and it be reinserted in the list: better to reinsert it behind shmem_unuse()'s cursor than in front of it, which would entail a second pointless pass over that file. try_to_unuse(): Moved up the assignment of "oldi = i" (and changed the test to "oldi <= i"), so as not to get trapped in that find_next_to_unuse() loop when find_get_page() does not find it. try_to_unuse(): But the main problem was passing entry.val to find_get_page() there: that used to be correct, but since f6ab1f7f6b2d we need to pass just the offset - as it stood, it could only find the pages when swapping off area 0 (a similar issue was fixed in shmem_replace_page() recently). That (together with the late oldi assignment) was why my swapoffs were hanging on SWAP_HAS_CACHE swap_map entries. With those changes, it all seems to work rather well, and is a nice simplification of the source, in addition to removing the quadratic complexity. To my great surprise, the KSM pages are already handled fairly well - the ksm_might_need_to_copy() that has long been in unuse_pte() turns out to do (almost) a good enough job already, so most users of KSM and swapoff would never see any problem. And I'd been afraid of swapin readahead causing spurious -ENOMEMs, but have seen nothing of that in practice (though something else in mmotm does appear to use up more memory than before). My remaining criticisms would be: As Huang Ying pointed out in other mail, there is a danger of livelock (or rather, hitting the MAX_RETRIES limit) when a multiply mapped page (most especially a KSM page, whose mappings are not likely to be nearby in the mmlist) is swapped out then partially swapped off then some ptes swapped back out. As indeed the "Under global memory pressure" comment admits. I have hit the MAX_RETRIES 3 limit several times in load testing, not investigated but I presume due to such a multiply mapped page, so at present we do have a regression there. A very simple answer would be to remove the retries limiting - perhaps you just added that to get around the find_get_page() failure before it was understood? That does then tend towards the livelock alternative, but you've kept the signal_pending() check, so there's still the same way out as the old technique had (but greater likelihood of needing it with the new technique). The right fix will be to do an rmap walk to unuse all the swap entries of a single anon_vma while holding page lock (with KSM needing that page force-deleted from swap cache before moving on); but none of us have written that code yet, maybe just removing the retries limit good enough. Two dislikes on the code structure, probably one solution: the "goto retry", up two levels from inside the lower loop, is easy to misunderstand; and the oldi business is ugly - find_next_to_unuse() was written to wrap around continuously to suit the old loop, but now it's left with its "++i >= max" code to achieve that, then your "i <= oldi" code to detect when it did, to undo that again: please delete code from both ends to make that all simpler. I'd expect to see checks on inuse_pages in some places, to terminate the scans as soon as possible (swapoff of an unused swapfile should be very quick, shouldn't it? not requiring any scans at all); but it looks like the old code did not have those either - was inuse_pages unreliable once upon a time? is it unreliable now? Hugh --- mm/shmem.c | 12 ++++++++---- mm/swapfile.c | 8 ++++---- 2 files changed, 12 insertions(+), 8 deletions(-) --- mmotm/mm/shmem.c 2018-12-22 13:32:51.339584848 -0800 +++ linux/mm/shmem.c 2018-12-31 12:30:55.822407154 -0800 @@ -1149,6 +1149,7 @@ static int shmem_unuse_swap_entries(stru } if (error == -ENOMEM) break; + error = 0; } return error; } @@ -1216,12 +1217,15 @@ int shmem_unuse(unsigned int type) mutex_unlock(&shmem_swaplist_mutex); if (prev_inode) iput(prev_inode); + prev_inode = inode; + error = shmem_unuse_inode(inode, type); - if (!info->swapped) - list_del_init(&info->swaplist); cond_resched(); - prev_inode = inode; + mutex_lock(&shmem_swaplist_mutex); + next = list_next_entry(info, swaplist); + if (!info->swapped) + list_del_init(&info->swaplist); if (error) break; } @@ -1313,7 +1317,7 @@ static int shmem_writepage(struct page * */ mutex_lock(&shmem_swaplist_mutex); if (list_empty(&info->swaplist)) - list_add_tail(&info->swaplist, &shmem_swaplist); + list_add(&info->swaplist, &shmem_swaplist); if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) { spin_lock_irq(&info->lock); diff -purN mmotm/mm/swapfile.c linux/mm/swapfile.c --- mmotm/mm/swapfile.c 2018-12-22 13:32:51.347584880 -0800 +++ linux/mm/swapfile.c 2018-12-31 12:30:55.822407154 -0800 @@ -2156,7 +2156,7 @@ retry: while ((i = find_next_to_unuse(si, i, frontswap)) != 0) { /* - * under global memory pressure, swap entries + * Under global memory pressure, swap entries * can be reinserted back into process space * after the mmlist loop above passes over them. * This loop will then repeat fruitlessly, @@ -2164,7 +2164,7 @@ retry: * but doing nothing to actually free up the swap. * In this case, go over the mmlist loop again. */ - if (i < oldi) { + if (i <= oldi) { retries++; if (retries > MAX_RETRIES) { retval = -EBUSY; @@ -2172,8 +2172,9 @@ retry: } goto retry; } + oldi = i; entry = swp_entry(type, i); - page = find_get_page(swap_address_space(entry), entry.val); + page = find_get_page(swap_address_space(entry), i); if (!page) continue; @@ -2188,7 +2189,6 @@ retry: try_to_free_swap(page); unlock_page(page); put_page(page); - oldi = i; } out: return retval;