On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> On Mon, Jan 31, 2022 at 03:03:53PM +1100, NeilBrown wrote:
> > diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
> > index 182b24a14804..5f74e2585f50 100644
> > --- a/fs/fuse/dax.c
> > +++ b/fs/fuse/dax.c
> > @@ -781,6 +781,9 @@ static int fuse_dax_writepages(struct address_space *mapping,
> > struct inode *inode = mapping->host;
> > struct fuse_conn *fc = get_fuse_conn(inode);
> >
> > + if (wbc->sync_mode == WB_SYNC_NONE &&
> > + fc->num_background >= fc->congestion_threshold)
> > + return 0;
> > return dax_writeback_mapping_range(mapping, fc->dax->dev, wbc);
>
> This makes no sense. Doing writeback for DAX means flushing the
> CPU cache (in a terribly inefficient way), but it's not going to
> be doing anything in the background; it's a sync operation.
Fair enough ... I was just being consistent. I didn't wonder if dax
might be a bit special, but figured the change couldn't hurt.
>
> > +++ b/fs/fuse/file.c
> > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> >
> > if (fuse_is_bad(inode))
> > return;
> > + if (fc->num_background >= fc->congestion_threshold)
> > + return;
>
> This seems like a bad idea to me. If we don't even start reads on
> readahead pages, they'll get ->readpage called on them one at a time
> and the reading thread will block. It's going to lead to some nasty
> performance problems, exactly when you don't want them. Better to
> queue the reads internally and wait for congestion to ease before
> submitting the read.
>
Isn't that exactly what happens now? page_cache_async_ra() sees that
inode_read_congested() returns true, so it doesn't start readahead.
???
NeilBrown
On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > +++ b/fs/fuse/file.c
> > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > >
> > > if (fuse_is_bad(inode))
> > > return;
> > > + if (fc->num_background >= fc->congestion_threshold)
> > > + return;
> >
> > This seems like a bad idea to me. If we don't even start reads on
> > readahead pages, they'll get ->readpage called on them one at a time
> > and the reading thread will block. It's going to lead to some nasty
> > performance problems, exactly when you don't want them. Better to
> > queue the reads internally and wait for congestion to ease before
> > submitting the read.
> >
>
> Isn't that exactly what happens now? page_cache_async_ra() sees that
> inode_read_congested() returns true, so it doesn't start readahead.
> ???
It's rather different. Imagine the readahead window has expanded to
256kB (64 pages). Today, we see congestion and don't do anything.
That means we miss the async readahed opportunity, find a missing
page and end up calling into page_cache_sync_ra(), by which time
we may or may not be congested.
If the inode_read_congested() in page_cache_async_ra() is removed and
the patch above is added to replace it, we'll allocate those 64 pages and
add them to the page cache. But then we'll return without starting IO.
When we hit one of those !uptodate pages, we'll call ->readpage on it,
but we won't do anything to the other 63 pages. So we'll go through a
protracted slow period of sending 64 reads, one at a time, whether or
not congestion has eased. Then we'll hit a missing page and proceed
to the sync ra case as above.
(I'm assuming this is a workload which does a linear scan and so
readahead is actually effective)
On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> > On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > > +++ b/fs/fuse/file.c
> > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > > >
> > > > if (fuse_is_bad(inode))
> > > > return;
> > > > + if (fc->num_background >= fc->congestion_threshold)
> > > > + return;
> > >
> > > This seems like a bad idea to me. If we don't even start reads on
> > > readahead pages, they'll get ->readpage called on them one at a time
> > > and the reading thread will block. It's going to lead to some nasty
> > > performance problems, exactly when you don't want them. Better to
> > > queue the reads internally and wait for congestion to ease before
> > > submitting the read.
> > >
> >
> > Isn't that exactly what happens now? page_cache_async_ra() sees that
> > inode_read_congested() returns true, so it doesn't start readahead.
> > ???
>
> It's rather different. Imagine the readahead window has expanded to
> 256kB (64 pages). Today, we see congestion and don't do anything.
> That means we miss the async readahed opportunity, find a missing
> page and end up calling into page_cache_sync_ra(), by which time
> we may or may not be congested.
>
> If the inode_read_congested() in page_cache_async_ra() is removed and
> the patch above is added to replace it, we'll allocate those 64 pages and
> add them to the page cache. But then we'll return without starting IO.
> When we hit one of those !uptodate pages, we'll call ->readpage on it,
> but we won't do anything to the other 63 pages. So we'll go through a
> protracted slow period of sending 64 reads, one at a time, whether or
> not congestion has eased. Then we'll hit a missing page and proceed
> to the sync ra case as above.
Hmmm... where is all this documented?
The entry for readahead in vfs.rst says:
If the filesystem decides to stop attempting I/O before reaching the
end of the readahead window, it can simply return.
but you are saying that if it simply returns, it'll most likely just get
called again. So maybe it shouldn't say that?
What do other filesystems do?
ext4 sets REQ_RAHEAD, but otherwise just pushes ahead and submits all
requests. btrfs seems much the same.
xfs uses iomp_readahead .. which again sets REQ_RAHEAD but otherwise
just does a normal read.
The effect of REQ_RAHEAD seems to be primarily to avoid retries on
failure.
So it seems that core read-ahead code it not set up to expect readahead
to fail, though it is (begrudgingly) permitted.
The current inode_read_congested() test in page_cache_async_ra() seems
to be just delaying the inevitable (and in fairness, the comment does
say "Defer...."). Maybe just blocking on the congestion is an equally
good way to delay it...
I note that ->readahead isn't told if the read-ahead is async or not, so
my patch will drop sync read-ahead on congestion, which the current code
doesn't do.
So maybe this congestion tracking really is useful, and we really want
to keep it.
I really would like to see that high-level documentation!!
Thanks,
NeilBrown