Received: by 2002:a05:7412:bbc7:b0:fc:a2b0:25d7 with SMTP id kh7csp859465rdb; Fri, 2 Feb 2024 06:19:25 -0800 (PST) X-Google-Smtp-Source: AGHT+IFBViJB+bgyaZj9tmuygt1tCb19cywOoSJhDrmmLhdeCtdxlaXRe5NTZmEp/RYbNCfN4Eoq X-Received: by 2002:a05:6402:5248:b0:55f:d8a7:c997 with SMTP id t8-20020a056402524800b0055fd8a7c997mr2092835edd.41.1706883565642; Fri, 02 Feb 2024 06:19:25 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706883565; cv=pass; d=google.com; s=arc-20160816; b=gEYwHhvsZ9OS2sDRzfxVVHpfGChEzKXI23a6f+MVR9oD7u6CWDdlGzNBW4Hsv/ekES NvblDWdhMAdqVmoj5aWeJ2YxQc75RgLQVjHbfp4/A/tU7yvJi2ulotcIRUm3sftCGATd WTOBSz1PTZtb+7sFQ4eh8WD+ET+tZCOmTrLEP5SbrLyQE4+bEGY7xt+57fiESrrZwjNG GyBNdL/R8Vu9qk8tVuDyGQBxvGZ1KuHU5uNaIFyUTtJA9AI5kFN+FeX6uofkAtdbsqtf trmsi1G/UQQshLqllhfa6vQV6stamG6VRCpOR2diMPeG4XSGmfLiCyKCwWzzocYnkvE6 UddA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date; bh=PVCU/QdR6yOLyX27bti1riO+pE3pl8L1u9F5qqx6Qkg=; fh=s2sQY4O8pa1UqFfdEMWXf4UCPwldG/LNe5QssS7agMs=; b=lOVkeEh6VpL45wHvvnrZqaRZRSXuZGAIB7R7sKHjkfE76aYnS3R3ny95evBAjqusLV DXBKzxTi/8EHVj1hMR7QTKP5wIFzpPK/Gw4xhtP9vqS1HDggZTrqZui2b1/fZ2nMsntL n7Vevb93D4LCRjpT8y9oqQwMgTKaQX47Zwnl/CXq+Y4pJeXGvsr4k6OELNg0AjEBo+JT gLsbVZRnaKhN6sGdn5is9OD6cCpaWAssbiiRKI1kzU3Tl7hmMvprI5yTSoLCkgv9J1Sv d5Pcel4RhAL8FGFhUqCg02kwdZKrVUmQkgRzDTUWKgIiup7/ppRoOuB7M+Yf1zOxF/vs F2vw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-49934-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-49934-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Forwarded-Encrypted: i=1; AJvYcCVHo54iC+Ptdm4ppCa/f/Tf/izkUlaHz7Qjg8hJHiDLiUQZCf2EpUKPCFdzR1AxjEZccMOCjkqaDXnlwcwwRNUHvD22n3p657b2CA21uw== Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id l6-20020aa7d946000000b0055cbff8d22csi854053eds.120.2024.02.02.06.19.25 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Feb 2024 06:19:25 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-49934-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-49934-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-49934-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 38D6D1F29CB8 for ; Fri, 2 Feb 2024 14:19:25 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 023B614198F; Fri, 2 Feb 2024 14:19:18 +0000 (UTC) Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com [209.85.160.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9178113E214 for ; Fri, 2 Feb 2024 14:19:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.171 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706883557; cv=none; b=q5+zTM0Tp5O9km1n+si/HzouMF3Rd5pONAZSDhyCubax/DA+93WnbDOHPHYgOjemHRWIn0mUpOPZBJ0eqXPImKM73wSs8yRl9DEHDgq7vce7rXGtroKcVrJ0apvWNgdzwZxjsP30v4tGjArGEYTdW/IkOQbQNvK4iKjBxpjsBCg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706883557; c=relaxed/simple; bh=WaF05etmv6xtJTE48JVQIk8cC0oCi0ltt4j8nJKOOE4=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=pu4BTFVnXYfVB5etjPzULXqj9kiIke3D+nfTPAsb6r61tzMAR4DBnY/nPL1M6z8WAneAcsjlgBUg5o6e17paCiOxDHjk6x4kGMlUfBB/1uXD/LbjYf8u+YLTaIPTX9PGZ2SFUXcVPtPJfhuBZCyBBXOIyRmySasexB/ts3betIc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org; spf=pass smtp.mailfrom=redhat.com; arc=none smtp.client-ip=209.85.160.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Received: by mail-qt1-f171.google.com with SMTP id d75a77b69052e-42a4516ec46so8171751cf.0 for ; Fri, 02 Feb 2024 06:19:15 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706883554; x=1707488354; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=PVCU/QdR6yOLyX27bti1riO+pE3pl8L1u9F5qqx6Qkg=; b=cwOpXfI45Uopw+vVvDXUX51Gsg2gXGD5LqB/d2P/+mHwWO1bxtsOrhXl028LvnBsFB NmqrQpJXOXhu0g3/b9OCumfkSPGX5UTwggMWIaTKgJU2KDIkoEbjYwyR5tVoEMWQK6ei icZItpatKSg3900VB/jm1R+k1UoqOYE3NxQYDGWU9mu2aDuiQsWzXCEJ1pWXcWs0Go40 o6JBH/nmc7NdbMAw0HGYdQOXjvfccskaWG/i64yFQmLzul/wQ2808x/XVETBN/5LaXvy 56zZuI921/tYwCaLQNkmggRPpBnrW/V9ma4y8VA8BKlxMnyDBR7XBKNLAGbS3xmNXSfQ 2aZA== X-Gm-Message-State: AOJu0YzWsp1+6JpWaSpDpkFxcrrbi9ugF5Nes5gvQ377lMVEqRXmXCHy YUO0a+A19mewIi5Ytlf++wZV6XY9a8SR5w190zXewiHtZEQYMqYzW1EZpZCnqA== X-Received: by 2002:a05:622a:1748:b0:42b:f6a6:4058 with SMTP id l8-20020a05622a174800b0042bf6a64058mr3632398qtk.15.1706883554394; Fri, 02 Feb 2024 06:19:14 -0800 (PST) X-Forwarded-Encrypted: i=0; AJvYcCW5ywsfSWxoOXCneHQ193arCv1HiGofCXP9MquEQ/empUbTJttjRivFKD+KnwEX2OLj+A5f1aEr9yLYUa5qZ8AjkA8ZvXZpTKFJQS3dsrO1C30IU+HPlrVB1cjjZmcvyv/AMHE3CvAY/rEYfQBU7wZBv1M8VW+VuJA2ex8qh2Tootn+uaTgnTyf0ZjdSZSztAEe2WvzWJIu6J3TanTtKlobuVMRO8kzofHOpoPGwdv+hSk6SxHREzEqsgUmJn08eE/wOkPXdPwGzSBGaxYiBtNU3APPVYeEV5DsX3JXi59bCFDMOvWJbbOXwf5/2vmHmxX9Eg6VkKsXzntKtE6YpfvxvcTKRzq8ah30mL/Vs572aA== Received: from localhost (pool-68-160-141-91.bstnma.fios.verizon.net. [68.160.141.91]) by smtp.gmail.com with ESMTPSA id ey22-20020a05622a4c1600b0042bed7dc558sm860739qtb.6.2024.02.02.06.19.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Feb 2024 06:19:14 -0800 (PST) Date: Fri, 2 Feb 2024 09:19:13 -0500 From: Mike Snitzer To: Ming Lei Cc: Andrew Morton , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, David Hildenbrand , Matthew Wilcox , Alexander Viro , Christian Brauner , Don Dutile , Rafael Aquini , Dave Chinner Subject: Re: mm/madvise: set ra_pages as device max request size during ADV_POPULATE_READ Message-ID: References: <20240202022029.1903629-1-ming.lei@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Fri, Feb 02 2024 at 5:52P -0500, Ming Lei wrote: > On Thu, Feb 01, 2024 at 11:43:11PM -0500, Mike Snitzer wrote: > > On Thu, Feb 01 2024 at 9:20P -0500, > > Ming Lei wrote: > > > > > madvise(MADV_POPULATE_READ) tries to populate all page tables in the > > > specific range, so it is usually sequential IO if VMA is backed by > > > file. > > > > > > Set ra_pages as device max request size for the involved readahead in > > > the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ) > > > to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with > > > usual(default) 128KB of read_ahead_kb. > > > > > > Cc: David Hildenbrand > > > Cc: Matthew Wilcox > > > Cc: Alexander Viro > > > Cc: Christian Brauner > > > Cc: Don Dutile > > > Cc: Rafael Aquini > > > Cc: Dave Chinner > > > Cc: Mike Snitzer > > > Cc: Andrew Morton > > > Signed-off-by: Ming Lei > > > --- > > > mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- > > > 1 file changed, 51 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index 912155a94ed5..db5452c8abdd 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > > return -EINVAL; > > > } > > > > > > +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages) > > > +{ > > > + if (*file) { > > > + struct file *f = *file; > > > + > > > + f->f_ra.ra_pages = ra_pages; > > > + fput(f); > > > + *file = NULL; > > > + } > > > +} > > > + > > > +static struct file *madvise_override_ra_win(struct file *f, > > > + unsigned long start, unsigned long end, > > > + unsigned int *old_ra_pages) > > > +{ > > > + unsigned int io_pages; > > > + > > > + if (!f || !f->f_mapping || !f->f_mapping->host) > > > + return NULL; > > > + > > > + io_pages = inode_to_bdi(f->f_mapping->host)->io_pages; > > > + if (((end - start) >> PAGE_SHIFT) < io_pages) > > > + return NULL; > > > + > > > + f = get_file(f); > > > + *old_ra_pages = f->f_ra.ra_pages; > > > + f->f_ra.ra_pages = io_pages; > > > + > > > + return f; > > > +} > > > + > > > > Does this override imply that madvise_populate resorts to calling > > filemap_fault() and here you're just arming it to use the larger > > ->io_pages for the duration of all associated faulting? > > Yes. > > > > > Wouldn't it be better to avoid faulting and build up larger page > > How can we avoid the fault handling? which is needed to build VA->PA mapping. I was wondering if it made sense to add fadvise_populate -- but given my lack of experience with MM I then get handwavvy quick -- I have more work ahead to round out my MM understanding so that I'm more informed. > > vectors that get sent down to the block layer in one go and let the > > filemap_fault() already tries to allocate folio in big size(max order > is MAX_PAGECACHE_ORDER), see page_cache_ra_order() and ra_alloc_folio(). > > > block layer split using the device's limits? (like happens with > > force_page_cache_ra) > > Here filemap code won't deal with block directly because there is VFS & > FS and io mapping is required, and it just calls aops->readahead() or > aops->read_folio(), but block plug & readahead_control are applied for > handling everything in batch. > > > > > I'm concerned that madvise_populate isn't so efficient with filemap > > That is why this patch increases readahead window, then > madvise_populate() performance can be improved by X10 in big file-backed > popluate read. Right, as you know I've tested your patch, the larger readahead window certainly did provide the much more desirable performance. I'll reply to your v2 (with reduced negative checks) with my Reviewed-by and Tested-by. I was just wondering if there an opportunity to plumb in more a specific (and potentially better) fadvise_populate for dealing with file backed pages. > > due to excessive faulting (*BUT* I haven't traced to know, I'm just > > inferring that is why twiddling f->f_ra.ra_pages helps improve > > madvise_populate by having it issue larger IO. Apologies if I'm way > > off base) > > As mentioned, fault handling can't be avoided, but we can improve > involved readahead IO perf. Thanks, and sorry for asking such a naive question (put more pressure on you to educate than I should have). Mike