Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp4048850pxf; Tue, 16 Mar 2021 04:38:27 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyJppdS4oXBTP2x8REvNuuzv4ntDQkPufkjr5BECuwJxPcSeEvAwqwWyjvSRDU6hCbLy3oS X-Received: by 2002:a17:906:29c3:: with SMTP id y3mr27976544eje.430.1615894707225; Tue, 16 Mar 2021 04:38:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1615894707; cv=none; d=google.com; s=arc-20160816; b=mZ99cCASgLmsNGeHqWZ7NVtdRidYmme/l2/+MffOjzTRcyvg71roZi3AEdDoJfSo8h RE7Y1EazRRy3MJPMBcvzBjd1wF4nLYvgpCFR0qd4a9nMjeA4EXLe4+nnMrHGv5MdClTZ gz9/INfGAOu9Qr1o/SrHlVdBt5QoZi+rNb16UsqGS2MnCuzUyh1Dg+2vXjoI1SFeUJcf XVsXNP4P/H2z1dm6myE/a13r2v7K9duUFExp676eLCWPU625FqYdODRQfQHqh/TOy0bD 18MuFad+pC5Ck3iOCeXVh4BAH6IzZG4U3gXaE+KehbChvRoI6sF8uko21VYbJGTUu5mk Dxkw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=DBsfjI7Y9qCZHdzx/JVFP1TkC1/+jN/GH13dkfs78b4=; b=koVzIl7QBsOFcXY6ncbpvZY0kaexLxCZqz3Y9qNaT8duT63CNLGl6PPGfeNbsbaVmI 7c8YP3qSlJsLwquI+estgBO/zM+0NLkslqeHn+gU2a49LUJcaYNJwVLyy5m0kTrU31dK wmG3o+hjrajI7jCXMPQQaXI4SuecHr6AEaK8hQ7PqYpS6nV0oIOyjcJeOazGwZNPH/CL 7DlZM7paajk3KqxFqwd41C8gZyZpGvfZ/8rZi57fliK/7G9K5eHwC/yY9eXYI2agn61Z QilY1HqPef4Mw9vuRZNchvQxm71bpQj2glAI0NyqUyUIihRUZ7mtwJ2pPLDaHgvPn7kl mjxA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=g1xy5TEB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a7si13565408edy.54.2021.03.16.04.38.05; Tue, 16 Mar 2021 04:38:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=g1xy5TEB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233910AbhCPEpk (ORCPT + 99 others); Tue, 16 Mar 2021 00:45:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38634 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235077AbhCPEpY (ORCPT ); Tue, 16 Mar 2021 00:45:24 -0400 Received: from mail-io1-xd31.google.com (mail-io1-xd31.google.com [IPv6:2607:f8b0:4864:20::d31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 44B5BC06174A for ; Mon, 15 Mar 2021 21:45:24 -0700 (PDT) Received: by mail-io1-xd31.google.com with SMTP id n132so35832376iod.0 for ; Mon, 15 Mar 2021 21:45:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=DBsfjI7Y9qCZHdzx/JVFP1TkC1/+jN/GH13dkfs78b4=; b=g1xy5TEBHfyPxwIXSOaSp8xTF1R7KhbcM3BdOHdTD05qOZbn9PYJUcQzYtp1Ruj1dw 8rzK9epnwJp9Bkl8FdPGMgK1k5JbJhzDY0V5X5xGfivH30XARF+tbRUacnLA9uelzrE2 dCC06tiUSl2CxsJBvMquE22cSymjRNRX5fiXHqsPurc0w2/kpRhfr0RNA+bCKiKQU18T 98oynbx9Y9+sMVAGdQx2JDQAS7Ndloh0m2yeFDlCOVFXZH3M5iJLyb8Wv1T6a/2SR2/Q gpt/QNM1qKaoiyWYq+qn9PMDTxDDRe3UvQvtQQbBzEJGcKCx/ORO3Nl0jKbXJ1ZmZpHk qRgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=DBsfjI7Y9qCZHdzx/JVFP1TkC1/+jN/GH13dkfs78b4=; b=KPgHihwyf2tkqT0ch0eYAIq0uE9YxTLWDRgxYtgsuZVE1rncS/2CpyupEe+oEoiwd2 x6jm0Uycw9pRIB1aAuUNl1YsLjynrz/q4J3vDrijht9wo1FWhsPY2HK+IgVVVYNy7Gde R5muR0NW4/IkYxlnlh4Mq7tMxLBObrWYXwy3Q5PR3Zj4fszsWpzVINyfnBm6StPckDkn xAOTcBZmXg3yLUgXNC8hlXwazLXuW6JGDyqkUcq3P3JQzoblUpcrVGAfE5muVSMlJCjo hVaC/SUgk9zDcIwv5wDdVX7FcgO2tCGHmv7vkYafmmGs0lRogbv4QVUvVSHTrJdmxdjk PU6g== X-Gm-Message-State: AOAM532yu0c1jOEZ0jNvuL7AQRDGHxhrnkTRu560m8FXmPwN1UZmYaKE 1UpGvQsqbDwm7Qco9BWj32T3yQ== X-Received: by 2002:a02:9985:: with SMTP id a5mr12336093jal.122.1615869923602; Mon, 15 Mar 2021 21:45:23 -0700 (PDT) Received: from google.com ([2620:15c:183:200:d825:37a2:4b55:995f]) by smtp.gmail.com with ESMTPSA id 5sm8716358ill.20.2021.03.15.21.45.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Mar 2021 21:45:22 -0700 (PDT) Date: Mon, 15 Mar 2021 22:45:18 -0600 From: Yu Zhao To: "Huang, Ying" Cc: linux-mm@kvack.org, Alex Shi , Andrew Morton , Dave Hansen , Hillf Danton , Johannes Weiner , Joonsoo Kim , Matthew Wilcox , Mel Gorman , Michal Hocko , Roman Gushchin , Vlastimil Babka , Wei Yang , Yang Shi , linux-kernel@vger.kernel.org, page-reclaim@google.com Subject: Re: [PATCH v1 10/14] mm: multigenerational lru: core Message-ID: References: <87im5rsvd8.fsf@yhuang6-desk1.ccr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87im5rsvd8.fsf@yhuang6-desk1.ccr.corp.intel.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 16, 2021 at 10:08:51AM +0800, Huang, Ying wrote: > Yu Zhao writes: > [snip] > > > +/* Main function used by foreground, background and user-triggered aging. */ > > +static bool walk_mm_list(struct lruvec *lruvec, unsigned long next_seq, > > + struct scan_control *sc, int swappiness) > > +{ > > + bool last; > > + struct mm_struct *mm = NULL; > > + int nid = lruvec_pgdat(lruvec)->node_id; > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); > > + > > + VM_BUG_ON(next_seq > READ_ONCE(lruvec->evictable.max_seq)); > > + > > + /* > > + * For each walk of the mm list of a memcg, we decrement the priority > > + * of its lruvec. For each walk of memcgs in kswapd, we increment the > > + * priorities of all lruvecs. > > + * > > + * So if this lruvec has a higher priority (smaller value), it means > > + * other concurrent reclaimers (global or memcg reclaim) have walked > > + * its mm list. Skip it for this priority to balance the pressure on > > + * all memcgs. > > + */ > > +#ifdef CONFIG_MEMCG > > + if (!mem_cgroup_disabled() && !cgroup_reclaim(sc) && > > + sc->priority > atomic_read(&lruvec->evictable.priority)) > > + return false; > > +#endif > > + > > + do { > > + last = get_next_mm(lruvec, next_seq, swappiness, &mm); > > + if (mm) > > + walk_mm(lruvec, mm, swappiness); > > + > > + cond_resched(); > > + } while (mm); > > It appears that we need to scan the whole address space of multiple > processes in this loop? > > If so, I have some concerns about the duration of the function. Do you > have some number of the distribution of the duration of the function? > And may be the number of mm_struct and the number of pages scanned. > > In comparison, in the traditional LRU algorithm, for each round, only a > small subset of the whole physical memory is scanned. Reasonable concerns, and insightful too. We are sensitive to direct reclaim latency, and we tuned another path carefully so that direct reclaims virtually don't hit this path :) Some numbers from the cover letter first: In addition, direct reclaim latency is reduced by 22% at 99th percentile and the number of refaults is reduced 7%. These metrics are important to phones and laptops as they are correlated to user experience. And "another path" is the background aging in kswapd: age_active_anon() age_lru_gens() try_walk_mm_list() /* try to spread pages out across spread+1 generations */ if (old_and_young[0] >= old_and_young[1] * spread && min_nr_gens(max_seq, min_seq, swappiness) > max(spread, MIN_NR_GENS)) return; walk_mm_list(lruvec, max_seq, sc, swappiness); By default, spread = 2, which makes kswapd slight more aggressive than direct reclaim for our use cases. This can be entirely disabled by setting spread to 0, for worloads that don't care about direct reclaim latency, or larger values, they are more sensitive than ours. It's worth noting that walk_mm_list() is multithreaded -- reclaiming threads can work on different mm_structs on the same list concurrently. We do occasionally see this function in direct reclaims, on over-overcommitted systems, i.e., kswapd CPU usage is 100%. Under the same condition, we saw the current page reclaim live locked and triggered hardware watchdog timeouts (our hardware watchdog is set to 2 hours) many times.