2020-08-14 01:31:38

by Zhaoyang Huang

[permalink] [raw]
Subject: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

file->f_ra->ra_pages will remain the initialized value since it opend, which may
be NOT equal to bdi->ra_pages as the latter one is updated somehow(etc,
echo xxx > /sys/block/dm/queue/read_ahead_kb).So sync ra->ra_pages to the
updated value when sync read.

Signed-off-by: Zhaoyang Huang <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index d78f577..5c2d7cc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2470,6 +2470,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
struct file *fpin = NULL;
pgoff_t offset = vmf->pgoff;

+ if (ra->ra_pages != inode_to_bdi(mapping->host)->ra_pages)
+ ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
/* If we don't want any read-ahead, don't bother */
if (vmf->vma->vm_flags & VM_RAND_READ)
return fpin;
--
1.9.1


2020-08-14 01:45:13

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

On Fri, Aug 14, 2020 at 09:30:11AM +0800, Zhaoyang Huang wrote:
> file->f_ra->ra_pages will remain the initialized value since it opend, which may
> be NOT equal to bdi->ra_pages as the latter one is updated somehow(etc,
> echo xxx > /sys/block/dm/queue/read_ahead_kb).So sync ra->ra_pages to the
> updated value when sync read.

It still ignores the work done by shrink_readahead_size_eio()
and fadvise(POSIX_FADV_SEQUENTIAL).

2020-08-14 02:09:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

On Fri, Aug 14, 2020 at 02:43:55AM +0100, Matthew Wilcox wrote:
> On Fri, Aug 14, 2020 at 09:30:11AM +0800, Zhaoyang Huang wrote:
> > file->f_ra->ra_pages will remain the initialized value since it opend, which may
> > be NOT equal to bdi->ra_pages as the latter one is updated somehow(etc,
> > echo xxx > /sys/block/dm/queue/read_ahead_kb).So sync ra->ra_pages to the
> > updated value when sync read.
>
> It still ignores the work done by shrink_readahead_size_eio()
> and fadvise(POSIX_FADV_SEQUENTIAL).

... by the way, if you're trying to update one particular file's readahead
state, you can just call fadvise(POSIX_FADV_NORMAL) on it.

If you want to update every open file's ra_pages by writing to sysfs,
then just no. We don't do that.

You haven't said what problem you're facing, so I really can't be more
helpful.

2020-08-14 02:21:35

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

On Fri, Aug 14, 2020 at 10:07 AM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Aug 14, 2020 at 02:43:55AM +0100, Matthew Wilcox wrote:
> > On Fri, Aug 14, 2020 at 09:30:11AM +0800, Zhaoyang Huang wrote:
> > > file->f_ra->ra_pages will remain the initialized value since it opend, which may
> > > be NOT equal to bdi->ra_pages as the latter one is updated somehow(etc,
> > > echo xxx > /sys/block/dm/queue/read_ahead_kb).So sync ra->ra_pages to the
> > > updated value when sync read.
> >
> > It still ignores the work done by shrink_readahead_size_eio()
> > and fadvise(POSIX_FADV_SEQUENTIAL).
>
> ... by the way, if you're trying to update one particular file's readahead
> state, you can just call fadvise(POSIX_FADV_NORMAL) on it.
>
> If you want to update every open file's ra_pages by writing to sysfs,
> then just no. We don't do that.
No, What I want to fix is the file within one process's context keeps
using the initialized value when it is opened and not sync with new
value when bdi->ra_pages changes.
>
> You haven't said what problem you're facing, so I really can't be more
> helpful.

2020-08-14 02:29:34

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

On Fri, Aug 14, 2020 at 10:20 AM Zhaoyang Huang <[email protected]> wrote:
>
> On Fri, Aug 14, 2020 at 10:07 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Fri, Aug 14, 2020 at 02:43:55AM +0100, Matthew Wilcox wrote:
> > > On Fri, Aug 14, 2020 at 09:30:11AM +0800, Zhaoyang Huang wrote:
> > > > file->f_ra->ra_pages will remain the initialized value since it opend, which may
> > > > be NOT equal to bdi->ra_pages as the latter one is updated somehow(etc,
> > > > echo xxx > /sys/block/dm/queue/read_ahead_kb).So sync ra->ra_pages to the
> > > > updated value when sync read.
> > >
> > > It still ignores the work done by shrink_readahead_size_eio()
> > > and fadvise(POSIX_FADV_SEQUENTIAL).
> >
> > ... by the way, if you're trying to update one particular file's readahead
> > state, you can just call fadvise(POSIX_FADV_NORMAL) on it.
> >
> > If you want to update every open file's ra_pages by writing to sysfs,
> > then just no. We don't do that.
> No, What I want to fix is the file within one process's context keeps
> using the initialized value when it is opened and not sync with new
> value when bdi->ra_pages changes.
So you mean it is just the desired behavior as having the opened file
use the initialized value even if bdi->ra_pages changed via sysfs?
> >
> > You haven't said what problem you're facing, so I really can't be more
> > helpful.

2020-08-14 02:34:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

On Fri, Aug 14, 2020 at 10:26:23AM +0800, Zhaoyang Huang wrote:
> On Fri, Aug 14, 2020 at 10:20 AM Zhaoyang Huang <[email protected]> wrote:
> >
> > On Fri, Aug 14, 2020 at 10:07 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Fri, Aug 14, 2020 at 02:43:55AM +0100, Matthew Wilcox wrote:
> > > > On Fri, Aug 14, 2020 at 09:30:11AM +0800, Zhaoyang Huang wrote:
> > > > > file->f_ra->ra_pages will remain the initialized value since it opend, which may
> > > > > be NOT equal to bdi->ra_pages as the latter one is updated somehow(etc,
> > > > > echo xxx > /sys/block/dm/queue/read_ahead_kb).So sync ra->ra_pages to the
> > > > > updated value when sync read.
> > > >
> > > > It still ignores the work done by shrink_readahead_size_eio()
> > > > and fadvise(POSIX_FADV_SEQUENTIAL).
> > >
> > > ... by the way, if you're trying to update one particular file's readahead
> > > state, you can just call fadvise(POSIX_FADV_NORMAL) on it.
> > >
> > > If you want to update every open file's ra_pages by writing to sysfs,
> > > then just no. We don't do that.
> > No, What I want to fix is the file within one process's context keeps
> > using the initialized value when it is opened and not sync with new
> > value when bdi->ra_pages changes.
> So you mean it is just the desired behavior as having the opened file
> use the initialized value even if bdi->ra_pages changed via sysfs?

That's right. If that's not the behaviour you want, call
fadvise(POSIX_FADV_NORMAL).

> > >
> > > You haven't said what problem you're facing, so I really can't be more
> > > helpful.

2020-08-14 02:34:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

On Fri, 14 Aug 2020 10:20:11 +0800 Zhaoyang Huang <[email protected]> wrote:

> On Fri, Aug 14, 2020 at 10:07 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Fri, Aug 14, 2020 at 02:43:55AM +0100, Matthew Wilcox wrote:
> > > On Fri, Aug 14, 2020 at 09:30:11AM +0800, Zhaoyang Huang wrote:
> > > > file->f_ra->ra_pages will remain the initialized value since it opend, which may
> > > > be NOT equal to bdi->ra_pages as the latter one is updated somehow(etc,
> > > > echo xxx > /sys/block/dm/queue/read_ahead_kb).So sync ra->ra_pages to the
> > > > updated value when sync read.
> > >
> > > It still ignores the work done by shrink_readahead_size_eio()
> > > and fadvise(POSIX_FADV_SEQUENTIAL).
> >
> > ... by the way, if you're trying to update one particular file's readahead
> > state, you can just call fadvise(POSIX_FADV_NORMAL) on it.
> >
> > If you want to update every open file's ra_pages by writing to sysfs,
> > then just no. We don't do that.
> No, What I want to fix is the file within one process's context keeps
> using the initialized value when it is opened and not sync with new
> value when bdi->ra_pages changes.

So you're saying that

echo xxx > /sys/block/dm/queue/read_ahead_kb

does not affect presently-open files, and you believe that it should do
so?

I guess that could be a reasonable thing to want - it's reasonable for
a user to expect that writing to a global tunable will take immediate
global effect. I guess.

But as Matthew says, it would help if you were to explain why this is
needed. In full detail. What operational problems is the present
implementation causing?

2020-08-14 02:48:35

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

On Fri, Aug 14, 2020 at 10:33 AM Andrew Morton
<[email protected]> wrote:
>
> On Fri, 14 Aug 2020 10:20:11 +0800 Zhaoyang Huang <[email protected]> wrote:
>
> > On Fri, Aug 14, 2020 at 10:07 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Fri, Aug 14, 2020 at 02:43:55AM +0100, Matthew Wilcox wrote:
> > > > On Fri, Aug 14, 2020 at 09:30:11AM +0800, Zhaoyang Huang wrote:
> > > > > file->f_ra->ra_pages will remain the initialized value since it opend, which may
> > > > > be NOT equal to bdi->ra_pages as the latter one is updated somehow(etc,
> > > > > echo xxx > /sys/block/dm/queue/read_ahead_kb).So sync ra->ra_pages to the
> > > > > updated value when sync read.
> > > >
> > > > It still ignores the work done by shrink_readahead_size_eio()
> > > > and fadvise(POSIX_FADV_SEQUENTIAL).
> > >
> > > ... by the way, if you're trying to update one particular file's readahead
> > > state, you can just call fadvise(POSIX_FADV_NORMAL) on it.
> > >
> > > If you want to update every open file's ra_pages by writing to sysfs,
> > > then just no. We don't do that.
> > No, What I want to fix is the file within one process's context keeps
> > using the initialized value when it is opened and not sync with new
> > value when bdi->ra_pages changes.
>
> So you're saying that
>
> echo xxx > /sys/block/dm/queue/read_ahead_kb
>
> does not affect presently-open files, and you believe that it should do
> so?
>
> I guess that could be a reasonable thing to want - it's reasonable for
> a user to expect that writing to a global tunable will take immediate
> global effect. I guess.
>
> But as Matthew says, it would help if you were to explain why this is
> needed. In full detail. What operational problems is the present
> implementation causing?
The real scenario is some system(like android) will turbo read during
startup via expanding the readahead window and then set it back to
normal(128kb as usual). However, some files in the system process
context will keep to be opened since it is opened up and has no chance
to sync with the updated value as it is almost impossible to change
the files attached to the inode(processes are unaware of these
things). we have to fix it from a kernel perspective.

2020-08-14 03:20:46

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

On Fri, Aug 14, 2020 at 10:45:37AM +0800, Zhaoyang Huang wrote:
> On Fri, Aug 14, 2020 at 10:33 AM Andrew Morton
> <[email protected]> wrote:
> >
> > On Fri, 14 Aug 2020 10:20:11 +0800 Zhaoyang Huang <[email protected]> wrote:
> >
> > > On Fri, Aug 14, 2020 at 10:07 AM Matthew Wilcox <[email protected]> wrote:
> > > >
> > > > On Fri, Aug 14, 2020 at 02:43:55AM +0100, Matthew Wilcox wrote:
> > > > > On Fri, Aug 14, 2020 at 09:30:11AM +0800, Zhaoyang Huang wrote:
> > > > > > file->f_ra->ra_pages will remain the initialized value since it opend, which may
> > > > > > be NOT equal to bdi->ra_pages as the latter one is updated somehow(etc,
> > > > > > echo xxx > /sys/block/dm/queue/read_ahead_kb).So sync ra->ra_pages to the
> > > > > > updated value when sync read.
> > > > >
> > > > > It still ignores the work done by shrink_readahead_size_eio()
> > > > > and fadvise(POSIX_FADV_SEQUENTIAL).
> > > >
> > > > ... by the way, if you're trying to update one particular file's readahead
> > > > state, you can just call fadvise(POSIX_FADV_NORMAL) on it.
> > > >
> > > > If you want to update every open file's ra_pages by writing to sysfs,
> > > > then just no. We don't do that.
> > > No, What I want to fix is the file within one process's context keeps
> > > using the initialized value when it is opened and not sync with new
> > > value when bdi->ra_pages changes.
> >
> > So you're saying that
> >
> > echo xxx > /sys/block/dm/queue/read_ahead_kb
> >
> > does not affect presently-open files, and you believe that it should do
> > so?
> >
> > I guess that could be a reasonable thing to want - it's reasonable for
> > a user to expect that writing to a global tunable will take immediate
> > global effect. I guess.
> >
> > But as Matthew says, it would help if you were to explain why this is
> > needed. In full detail. What operational problems is the present
> > implementation causing?
> The real scenario is some system(like android) will turbo read during
> startup via expanding the readahead window and then set it back to
> normal(128kb as usual). However, some files in the system process
> context will keep to be opened since it is opened up and has no chance
> to sync with the updated value as it is almost impossible to change
> the files attached to the inode(processes are unaware of these
> things). we have to fix it from a kernel perspective.

OK, this is a much more useful description of the problem, thank you!

I can think of two possibilities here. One is that maybe our readahead
heuristics just don't work on modern phone hardware. Perhaps we need
to ramp up more aggressively by default.

The other is that maybe it really is just a "boost at startup" kind
of situation and so we should support _that_. Some interface where
we can set a ra_boost, and then do:

if (ra_boost)
newsize *= 2;

in get_init_ra_size().

2020-08-14 19:42:34

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

On Thu, Aug 13, 2020 at 07:33:07PM -0700, Andrew Morton wrote:
> On Fri, 14 Aug 2020 10:20:11 +0800 Zhaoyang Huang <[email protected]> wrote:
> > No, What I want to fix is the file within one process's context keeps
> > using the initialized value when it is opened and not sync with new
> > value when bdi->ra_pages changes.
>
> So you're saying that
>
> echo xxx > /sys/block/dm/queue/read_ahead_kb
>
> does not affect presently-open files, and you believe that it should do
> so?
>
> I guess that could be a reasonable thing to want - it's reasonable for
> a user to expect that writing to a global tunable will take immediate
> global effect. I guess.

But it's also reasonable for someone to have written an application
assuming that the current behaviour won't change.

As I understand it, if we change net.ipv4.tcp_window_scaling, that will
take effect only for new connections, and not for existing ones.

I think the _real_ problem is that readahead never scales down, except
for EIO.

I don't have time to take on another project right now, but I think this
patch is too simplistic and has too many downsides. Someone needs to
really think the readahead situation through properly.

2020-08-14 23:21:22

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm : update ra->ra_pages if it's NOT equal to bdi->ra_pages

On Fri, Aug 14, 2020 at 04:19:29AM +0100, Matthew Wilcox wrote:
> On Fri, Aug 14, 2020 at 10:45:37AM +0800, Zhaoyang Huang wrote:
> > On Fri, Aug 14, 2020 at 10:33 AM Andrew Morton
> > <[email protected]> wrote:
> > >
> > > On Fri, 14 Aug 2020 10:20:11 +0800 Zhaoyang Huang <[email protected]> wrote:
> > >
> > > > On Fri, Aug 14, 2020 at 10:07 AM Matthew Wilcox <[email protected]> wrote:
> > > > >
> > > > > On Fri, Aug 14, 2020 at 02:43:55AM +0100, Matthew Wilcox wrote:
> > > > > > On Fri, Aug 14, 2020 at 09:30:11AM +0800, Zhaoyang Huang wrote:
> > > > > > > file->f_ra->ra_pages will remain the initialized value since it opend, which may
> > > > > > > be NOT equal to bdi->ra_pages as the latter one is updated somehow(etc,
> > > > > > > echo xxx > /sys/block/dm/queue/read_ahead_kb).So sync ra->ra_pages to the
> > > > > > > updated value when sync read.
> > > > > >
> > > > > > It still ignores the work done by shrink_readahead_size_eio()
> > > > > > and fadvise(POSIX_FADV_SEQUENTIAL).
> > > > >
> > > > > ... by the way, if you're trying to update one particular file's readahead
> > > > > state, you can just call fadvise(POSIX_FADV_NORMAL) on it.
> > > > >
> > > > > If you want to update every open file's ra_pages by writing to sysfs,
> > > > > then just no. We don't do that.
> > > > No, What I want to fix is the file within one process's context keeps
> > > > using the initialized value when it is opened and not sync with new
> > > > value when bdi->ra_pages changes.
> > >
> > > So you're saying that
> > >
> > > echo xxx > /sys/block/dm/queue/read_ahead_kb
> > >
> > > does not affect presently-open files, and you believe that it should do
> > > so?
> > >
> > > I guess that could be a reasonable thing to want - it's reasonable for
> > > a user to expect that writing to a global tunable will take immediate
> > > global effect. I guess.
> > >
> > > But as Matthew says, it would help if you were to explain why this is
> > > needed. In full detail. What operational problems is the present
> > > implementation causing?
> > The real scenario is some system(like android) will turbo read during
> > startup via expanding the readahead window and then set it back to
> > normal(128kb as usual). However, some files in the system process
> > context will keep to be opened since it is opened up and has no chance
> > to sync with the updated value as it is almost impossible to change
> > the files attached to the inode(processes are unaware of these
> > things). we have to fix it from a kernel perspective.
>
> OK, this is a much more useful description of the problem, thank you!

It's not the first time we brought up the issue.
https://patchwork.kernel.org/patch/10866161/
Hopefully, we have some solution at this time.

>
> I can think of two possibilities here. One is that maybe our readahead
> heuristics just don't work on modern phone hardware. Perhaps we need
> to ramp up more aggressively by default.
>
> The other is that maybe it really is just a "boost at startup" kind
> of situation and so we should support _that_. Some interface where
> we can set a ra_boost, and then do:
>
> if (ra_boost)
> newsize *= 2;
>
> in get_init_ra_size().

With kernel boot paramter, it sounds good idea to me.