Received: by 2002:a05:7412:d1aa:b0:fc:a2b0:25d7 with SMTP id ba42csp184423rdb; Sun, 28 Jan 2024 20:56:51 -0800 (PST) X-Google-Smtp-Source: AGHT+IE0iLrnxjJ8Ifxg2WhVVccAEqBgtc1qr4HWXK9/t82Od7r1GtClvNZp9zSDt8vKJzQWYHgr X-Received: by 2002:a25:abd2:0:b0:dbe:f144:e69b with SMTP id v76-20020a25abd2000000b00dbef144e69bmr2006454ybi.84.1706504211439; Sun, 28 Jan 2024 20:56:51 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706504211; cv=pass; d=google.com; s=arc-20160816; b=LPx4LURP5XtuV/eBrYtMEOV0SDcc9fQd89Du68ydRSqJjEnrRsQTeyOFrfS7SoZcHN tdGbuhZt/bXwt0qfN7lPfBg1dk2TvKO0XXxgEclS5/LkQAorsC2F3R6vERGpYpoB1h1V TjrmySpawtyMH0LxOo1MokwTtFl2JHS+Fef81o2HvE3i/w5yxN3I2jwmqnQfHGvt7nFV PjCpx6Y9tvMNHXqH+7U5bv+0Dpm/ulsueJCcqM8GPFNPFTSSaheUY4jJHr/Te63Vr8Ip wbbaZoUMMaZxyd6NW929rg2txS/KHOuQ1ZIhunXQpEHUASXRu0EcJRffbI1NY8+iFS/l Vi1Q== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:subject:cc:to:from:date:dkim-signature; bh=4uVgZZNOGdztcThCjImfEeg1T+w1JfprEYe6mBtACg8=; fh=+jNdzs8dFbRtXKXxUKxWekH/+hXwnzQX5bMLc8or0tI=; b=yCPsYNEDphCR4adRmHKaVcok8939pOKZQh/QfwVbxp35MhpcKkQAyZRTMtl2kwMa0W 0R1rJ+18c/ebRYhT6I7JxHHeCs8F+/INqYQ1DXx5RbX761y5F5moFnMv9KjeDdAUBeUs pFeLM0fUIozBfuKsA7UfoGF1K1qEsTc4fQ42PkgbUDgRHMnQg8yUOz1ev7xRbjtZ/5Gj 4Wt/dzT/Jd2g0J6Uf4TZct0emACFAPu5tDGmiyvzL6qKpyJntaPEUquy6u2WPnufvy1z DD1LgLCuYYyOICNC+a/d3pNN/wlBjBfQTHCVZH7HS0Oy34FoxfhbBXiRIBzaKKPeXq4R DxHQ== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=eMAiFTjr; arc=pass (i=1 spf=pass spfdomain=fromorbit.com dkim=pass dkdomain=fromorbit-com.20230601.gappssmtp.com dmarc=pass fromdomain=fromorbit.com); spf=pass (google.com: domain of linux-kernel+bounces-42164-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-42164-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id w27-20020a63af1b000000b005ce00003530si4993317pge.562.2024.01.28.20.56.51 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 28 Jan 2024 20:56:51 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-42164-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=eMAiFTjr; arc=pass (i=1 spf=pass spfdomain=fromorbit.com dkim=pass dkdomain=fromorbit-com.20230601.gappssmtp.com dmarc=pass fromdomain=fromorbit.com); spf=pass (google.com: domain of linux-kernel+bounces-42164-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-42164-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 0E531281DD3 for ; Mon, 29 Jan 2024 04:56:51 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id E9DC34205B; Mon, 29 Jan 2024 04:56:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="eMAiFTjr" Received: from mail-pf1-f173.google.com (mail-pf1-f173.google.com [209.85.210.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 482903C49D for ; Mon, 29 Jan 2024 04:56:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.173 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706504198; cv=none; b=sKuTjh1RvlIPq8jumLKLFyydQAeJrTIBs66Mu5KoZdUTZtAWV7fCqG/ybwXLC+uqb6pvtMlw/dijWTUMWGk8MbB9fUndd0qtOAPMt/1zlVAdVrXoDxtsHjQ5LZnd7mKzHOoHogsXbfCTw7sKi46MHqLmGMorWPWmZNkBJ7yKcXE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706504198; c=relaxed/simple; bh=JKE/IW5TVkbID/Q/j2NFEsbb8Z3r/k7LT5ELUg8rmMY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=S20Zv0hBkwhGZlw75nRDCjAVsvUQmCEygF/OJQY0R3srzsmk9jERlvO1Xc2VJz75o0eTudXnnD6Pp129eZn+Wc6nWawagsM8V+cII1N3tZqDWzkrZ4ul6EhtGfw2wHqPQkrtG0IEWIu/MlnnIpUI4zFwfXOKZlFNsM7xg7Ul/Sk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=eMAiFTjr; arc=none smtp.client-ip=209.85.210.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Received: by mail-pf1-f173.google.com with SMTP id d2e1a72fcca58-6ddfb0dac4dso1814207b3a.1 for ; Sun, 28 Jan 2024 20:56:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1706504196; x=1707108996; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=4uVgZZNOGdztcThCjImfEeg1T+w1JfprEYe6mBtACg8=; b=eMAiFTjrr/vQjBBlNKqeI5B5NSt9vl4lhaVcImm+hbr1WtchPEmzKuNtWYiqG+pzDm 24yT59EtQwzDTTEiVfMsero+dKkYqYE7os0l56W34qpBwoVYzQtchAWMRiVUE/MW9je9 Zcy0HWtCZCLdavxWi0dYlGUFEqzBqXRv6nY11RiIp7jdEI4BGIlTP27RRES0d1tQlj8q SI/uKOKLOeRK8Xb0kEvnhu+j8P8r5ntkY24/Ej1MHkQu6LgOutx/7eKLOPMN/bdmjEz7 TH6mCYZ+dGRAOmdray4AGapQ1aOGd7geuK4PhTClbbb0LdNmnhSQ8tndQUZCM7PNlKPg KX6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706504196; x=1707108996; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4uVgZZNOGdztcThCjImfEeg1T+w1JfprEYe6mBtACg8=; b=W4lSG6zf0L9BlFqryz9KrwS0PJUPkgpyha7LEAql1KUqDcJ6ps5Zbx17SMKPRiO3dA CYeOjTUdhOqvHQb7UnZIsoP5rev/ZQCelqRtYVIBkSrOqNXPylvs6KXRPEXiFMZ6dvFb ltAhBXuySNRAg/kCy8D0jQK8bHTdF8YxmCLV46jvAEHDRqvfhCDOQaz7VKKSYU1MRPss /CkuoS8i90izQT3zMgY9zOwtYmE3eHrgpSkfMeFHCKSgwbu+rvKcdKY/j6Fvns8mdQJ3 6dy6PRd7OZvKRavq4xQ19xRECpi+tTyl78WpBm8mbbTnd/SrMKav53AV+7e3ukED18/K Ca2A== X-Gm-Message-State: AOJu0YxbV43oI4Qq4xAnKQD9YK5AygJYc9suUkYyktWvExGsgT3VKiAa qO2h9b6vzZHOn1ap6Glx+bAAkaQf5Uql57wmgeDTDfrH9/9sstmvf9TYoYaggto= X-Received: by 2002:a05:6a00:18a1:b0:6dd:c61e:2026 with SMTP id x33-20020a056a0018a100b006ddc61e2026mr3771217pfh.9.1706504196508; Sun, 28 Jan 2024 20:56:36 -0800 (PST) Received: from dread.disaster.area (pa49-181-38-249.pa.nsw.optusnet.com.au. [49.181.38.249]) by smtp.gmail.com with ESMTPSA id c24-20020aa78c18000000b006dddd283526sm4893266pfd.53.2024.01.28.20.56.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 28 Jan 2024 20:56:35 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1rUJgw-00Gj5q-2H; Mon, 29 Jan 2024 15:56:18 +1100 Date: Mon, 29 Jan 2024 15:56:18 +1100 From: Dave Chinner To: Mike Snitzer Cc: Mike Snitzer , Matthew Wilcox , Ming Lei , Andrew Morton , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Don Dutile , Raghavendra K T , Alexander Viro , Christian Brauner Subject: Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range Message-ID: References: <20240128142522.1524741-1-ming.lei@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Sun, Jan 28, 2024 at 09:12:12PM -0500, Mike Snitzer wrote: > On Sun, Jan 28, 2024 at 8:48 PM Dave Chinner wrote: > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox wrote: > > > > > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > > > Matthew Wilcox wrote: > > > > Understood. But ... the application is asking for as much readahead as > > > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > > > a time". So why will we not get a bug report in 1-15 years time saying > > > > "I put a limit on readahead and the kernel is ignoring it"? I think > > > > typically we allow the sysadmin to override application requests, > > > > don't we? > > > > > > The application isn't knowingly asking for readahead. It is asking to > > > mmap the file (and reporter wants it done as quickly as possible.. > > > like occurred before). > > > > .. which we do within the constraints of the given configuration. > > > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > > > request size based on read-ahead setting") -- same logic, just applied > > > to callchain that ends up using madvise(MADV_WILLNEED). > > > > Not really. There is a difference between performing a synchronous > > read IO here that we must complete, compared to optimistic > > asynchronous read-ahead which we can fail or toss away without the > > user ever seeing the data the IO returned. > > > > We want required IO to be done in as few, larger IOs as possible, > > and not be limited by constraints placed on background optimistic > > IOs. > > > > madvise(WILLNEED) is optimistic IO - there is no requirement that it > > complete the data reads successfully. If the data is actually > > required, we'll guarantee completion when the user accesses it, not > > when madvise() is called. IOWs, madvise is async readahead, and so > > really should be constrained by readahead bounds and not user IO > > bounds. > > > > We could change this behaviour for madvise of large ranges that we > > force into the page cache by ignoring device readahead bounds, but > > I'm not sure we want to do this in general. > > > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages > > value in this situation to override the device limit for large > > ranges (for some definition of large - say 10x bdi->ra_pages) and > > restore it once the readahead operation is done. This would make it > > behave less like readahead and more like a user read from an IO > > perspective... > > I'm not going to pretend like I'm an expert in this code or all the > distinctions that are in play. BUT, if you look at the high-level > java reproducer: it is requesting mmap of a finite size, starting from > the beginning of the file: > FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); > MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); Mapping an entire file does not mean "we are going to access the entire file". Lots of code will do this, especially those that do random accesses within the file. > Yet you're talking about the application like it is stabbingly > triggering unbounded async reads that can get dropped, etc, etc. I I don't know what the application actually does. All I see is a microbenchmark that mmaps() a file and walks it sequentially. On a system where readahead has been tuned to de-prioritise sequential IO performance. > just want to make sure the subtlety of (ab)using madvise(WILLNEED) > like this app does isn't incorrectly attributed to something it isn't. > The app really is effectively requesting a user read of a particular > extent in terms of mmap, right? madvise() is an -advisory- interface that does not guarantee any specific behaviour. the man page says: MADV_WILLNEED Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.) It says nothing about guaranteeing that all the data is brought into memory, or that if it does get brought into memory that it will remain there until the application accesses it. It doesn't even imply that IO will even be done immediately. Any application relying on madvise() to fully populate the page cache range before returning is expecting behaviour that is not documented nor guaranteed. Similarly, the fadvise64() man page does not say that WILLNEED will bring the entire file into cache: POSIX_FADV_WILLNEED The specified data will be accessed in the near future. POSIX_FADV_WILLNEED initiates a nonblocking read of the specified region into the page cache. The amount of data read may be de‐ creased by the kernel depending on virtual memory load. (A few megabytes will usually be fully satisfied, and more is rarely use‐ ful.) > BTW, your suggestion to have this app fiddle with ra_pages and then No, I did not suggest that the app fiddle with anything. I was talking about the in-kernel FADV_WILLNEED implementation changing file->f_ra.ra_pages similar to how FADV_RANDOM and FADV_SEQUENTIAL do to change readahead IO behaviour. That then allows subsequent readahead on that vma->file to use a larger value than the default value pulled in off the device. Largely, I think the problem is that the application has set a readahead limit too low for optimal sequential performance. Complaining that readahead is slow when it has been explicitly tuned to be slow doesn't really seem like a problem we can fix with code. -Dave. -- Dave Chinner david@fromorbit.com