2012-10-23 12:47:11

by YingHang Zhu

[permalink] [raw]
Subject: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

Hi,
Recently we ran into the bug that an opened file's ra_pages does not
synchronize with it's backing device's when the latter is changed
with blockdev --setra, the application needs to reopen the file
to know the change, which is inappropriate under our circumstances.
This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s
README file.
This patch tries to unify the ra_pages in struct file_ra_state
and struct backing_dev_info. Basically current readahead algorithm
will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the
read mode is sequential. Then all files sharing the same backing device
have the same max value bdi.ra_pages set in file_ra_state.
Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL
in fadivse will only set file reading mode without signifying the
max readahead size of the file. The current apporach adds no additional
overhead in read IO path, IMHO is the simplest solution.
Any comments are welcome, thanks in advance.

Thanks,
Ying Zhu

Signed-off-by: Ying Zhu <[email protected]>
---
include/linux/fs.h | 1 -
mm/fadvise.c | 2 --
mm/filemap.c | 17 +++++++++++------
mm/readahead.c | 8 ++++----
4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17fd887..36303a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -991,7 +991,6 @@ struct file_ra_state {
unsigned int async_size; /* do asynchronous readahead when
there are only # of pages ahead */

- unsigned int ra_pages; /* Maximum readahead window */
unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
loff_t prev_pos; /* Cache last read() position */
};
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e..75e2378 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)

switch (advice) {
case POSIX_FADV_NORMAL:
- file->f_ra.ra_pages = bdi->ra_pages;
spin_lock(&file->f_lock);
file->f_mode &= ~FMODE_RANDOM;
spin_unlock(&file->f_lock);
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
spin_unlock(&file->f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
- file->f_ra.ra_pages = bdi->ra_pages * 2;
spin_lock(&file->f_lock);
file->f_mode &= ~FMODE_RANDOM;
spin_unlock(&file->f_lock);
diff --git a/mm/filemap.c b/mm/filemap.c
index a4a5260..e7e4409 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait);
* readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
*
* It is going insane. Fix it by quickly scaling down the readahead size.
+ * It's hard to estimate how the bad sectors lay out, so to be conservative,
+ * set the read mode in random.
*/
static void shrink_readahead_size_eio(struct file *filp,
struct file_ra_state *ra)
{
- ra->ra_pages /= 4;
+ spin_lock(&filp->f_lock);
+ filp->f_mode |= FMODE_RANDOM;
+ spin_unlock(&filp->f_lock);
}

/**
@@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
/* If we don't want any read-ahead, don't bother */
if (VM_RandomReadHint(vma))
return;
- if (!ra->ra_pages)
+ if (!mapping->backing_dev_info->ra_pages)
return;

if (VM_SequentialReadHint(vma)) {
- page_cache_sync_readahead(mapping, ra, file, offset,
- ra->ra_pages);
+ page_cache_sync_readahead(mapping, ra, file, offset,
+ mapping->backing_dev_info->ra_pages);
return;
}

@@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
/*
* mmap read-around
*/
- ra_pages = max_sane_readahead(ra->ra_pages);
+ ra_pages = max_sane_readahead(mapping->backing_dev_info->ra_pages);
ra->start = max_t(long, 0, offset - ra_pages / 2);
ra->size = ra_pages;
ra->async_size = ra_pages / 4;
@@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma,
ra->mmap_miss--;
if (PageReadahead(page))
page_cache_async_readahead(mapping, ra, file,
- page, offset, ra->ra_pages);
+ page, offset,
+ mapping->backing_dev_info->ra_pages);
}

/**
diff --git a/mm/readahead.c b/mm/readahead.c
index ea8f8fa..6ea5999 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -27,7 +27,6 @@
void
file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
{
- ra->ra_pages = mapping->backing_dev_info->ra_pages;
ra->prev_pos = -1;
}
EXPORT_SYMBOL_GPL(file_ra_state_init);
@@ -400,7 +399,8 @@ ondemand_readahead(struct address_space *mapping,
bool hit_readahead_marker, pgoff_t offset,
unsigned long req_size)
{
- unsigned long max = max_sane_readahead(ra->ra_pages);
+ struct backing_dev_info *bdi = mapping->backing_dev_info;
+ unsigned long max = max_sane_readahead(bdi->ra_pages);

/*
* start of file
@@ -507,7 +507,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
pgoff_t offset, unsigned long req_size)
{
/* no read-ahead */
- if (!ra->ra_pages)
+ if (!mapping->backing_dev_info->ra_pages)
return;

/* be dumb */
@@ -543,7 +543,7 @@ page_cache_async_readahead(struct address_space *mapping,
unsigned long req_size)
{
/* no read-ahead */
- if (!ra->ra_pages)
+ if (!mapping->backing_dev_info->ra_pages)
return;

/*
--
1.7.1


2012-10-23 13:21:33

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/23/2012 08:46 PM, Ying Zhu wrote:
> Hi,
> Recently we ran into the bug that an opened file's ra_pages does not
> synchronize with it's backing device's when the latter is changed
> with blockdev --setra, the application needs to reopen the file
> to know the change, which is inappropriate under our circumstances.

Could you tell me in which function do this synchronize stuff?

> This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s
> README file.
> This patch tries to unify the ra_pages in struct file_ra_state
> and struct backing_dev_info. Basically current readahead algorithm
> will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the

You mean ondemand readahead algorithm will do this? I don't think so.
file_ra_state_init only called in btrfs path, correct?

> read mode is sequential. Then all files sharing the same backing device
> have the same max value bdi.ra_pages set in file_ra_state.

why remove file_ra_state? If one file is read sequential and another
file is read ramdom, how can use the global bdi.ra_pages to indicate the
max readahead window of each file?

> Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL
> in fadivse will only set file reading mode without signifying the
> max readahead size of the file. The current apporach adds no additional
> overhead in read IO path, IMHO is the simplest solution.
> Any comments are welcome, thanks in advance.

Could you show me how you test this patch?

>
> Thanks,
> Ying Zhu
>
> Signed-off-by: Ying Zhu <[email protected]>
> ---
> include/linux/fs.h | 1 -
> mm/fadvise.c | 2 --
> mm/filemap.c | 17 +++++++++++------
> mm/readahead.c | 8 ++++----
> 4 files changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 17fd887..36303a5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -991,7 +991,6 @@ struct file_ra_state {
> unsigned int async_size; /* do asynchronous readahead when
> there are only # of pages ahead */
>
> - unsigned int ra_pages; /* Maximum readahead window */
> unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
> loff_t prev_pos; /* Cache last read() position */
> };
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 469491e..75e2378 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>
> switch (advice) {
> case POSIX_FADV_NORMAL:
> - file->f_ra.ra_pages = bdi->ra_pages;
> spin_lock(&file->f_lock);
> file->f_mode &= ~FMODE_RANDOM;
> spin_unlock(&file->f_lock);
> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> spin_unlock(&file->f_lock);
> break;
> case POSIX_FADV_SEQUENTIAL:
> - file->f_ra.ra_pages = bdi->ra_pages * 2;
> spin_lock(&file->f_lock);
> file->f_mode &= ~FMODE_RANDOM;
> spin_unlock(&file->f_lock);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a4a5260..e7e4409 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait);
> * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
> *
> * It is going insane. Fix it by quickly scaling down the readahead size.
> + * It's hard to estimate how the bad sectors lay out, so to be conservative,
> + * set the read mode in random.
> */
> static void shrink_readahead_size_eio(struct file *filp,
> struct file_ra_state *ra)
> {
> - ra->ra_pages /= 4;
> + spin_lock(&filp->f_lock);
> + filp->f_mode |= FMODE_RANDOM;
> + spin_unlock(&filp->f_lock);
> }
>
> /**
> @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
> /* If we don't want any read-ahead, don't bother */
> if (VM_RandomReadHint(vma))
> return;
> - if (!ra->ra_pages)
> + if (!mapping->backing_dev_info->ra_pages)
> return;
>
> if (VM_SequentialReadHint(vma)) {
> - page_cache_sync_readahead(mapping, ra, file, offset,
> - ra->ra_pages);
> + page_cache_sync_readahead(mapping, ra, file, offset,
> + mapping->backing_dev_info->ra_pages);
> return;
> }
>
> @@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
> /*
> * mmap read-around
> */
> - ra_pages = max_sane_readahead(ra->ra_pages);
> + ra_pages = max_sane_readahead(mapping->backing_dev_info->ra_pages);
> ra->start = max_t(long, 0, offset - ra_pages / 2);
> ra->size = ra_pages;
> ra->async_size = ra_pages / 4;
> @@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma,
> ra->mmap_miss--;
> if (PageReadahead(page))
> page_cache_async_readahead(mapping, ra, file,
> - page, offset, ra->ra_pages);
> + page, offset,
> + mapping->backing_dev_info->ra_pages);
> }
>
> /**
> diff --git a/mm/readahead.c b/mm/readahead.c
> index ea8f8fa..6ea5999 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -27,7 +27,6 @@
> void
> file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
> {
> - ra->ra_pages = mapping->backing_dev_info->ra_pages;
> ra->prev_pos = -1;
> }
> EXPORT_SYMBOL_GPL(file_ra_state_init);
> @@ -400,7 +399,8 @@ ondemand_readahead(struct address_space *mapping,
> bool hit_readahead_marker, pgoff_t offset,
> unsigned long req_size)
> {
> - unsigned long max = max_sane_readahead(ra->ra_pages);
> + struct backing_dev_info *bdi = mapping->backing_dev_info;
> + unsigned long max = max_sane_readahead(bdi->ra_pages);
>
> /*
> * start of file
> @@ -507,7 +507,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
> pgoff_t offset, unsigned long req_size)
> {
> /* no read-ahead */
> - if (!ra->ra_pages)
> + if (!mapping->backing_dev_info->ra_pages)
> return;
>
> /* be dumb */
> @@ -543,7 +543,7 @@ page_cache_async_readahead(struct address_space *mapping,
> unsigned long req_size)
> {
> /* no read-ahead */
> - if (!ra->ra_pages)
> + if (!mapping->backing_dev_info->ra_pages)
> return;
>
> /*

2012-10-23 13:41:55

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

Sorry for the annoying, I forgot ccs in the previous mail.
Thanks,
Ying Zhu
Hi Chen,

On Tue, Oct 23, 2012 at 9:21 PM, Ni zhan Chen <[email protected]> wrote:
> On 10/23/2012 08:46 PM, Ying Zhu wrote:
>>
>> Hi,
>> Recently we ran into the bug that an opened file's ra_pages does not
>> synchronize with it's backing device's when the latter is changed
>> with blockdev --setra, the application needs to reopen the file
>> to know the change, which is inappropriate under our circumstances.
>
>
> Could you tell me in which function do this synchronize stuff?
With this patch we use bdi.ra_pages directly, so change bdi.ra_pages also
change an opened file's ra_pages.
>
>
>> This bug is also mentioned in scst (generic SCSI target subsystem for
>> Linux)'s
>> README file.
>> This patch tries to unify the ra_pages in struct file_ra_state
>> and struct backing_dev_info. Basically current readahead algorithm
>> will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the
>
>
> You mean ondemand readahead algorithm will do this? I don't think so.
> file_ra_state_init only called in btrfs path, correct?
No, it's also called in do_dentry_open.
>
>
>> read mode is sequential. Then all files sharing the same backing device
>> have the same max value bdi.ra_pages set in file_ra_state.
>
>
> why remove file_ra_state? If one file is read sequential and another file is
> read ramdom, how can use the global bdi.ra_pages to indicate the max
> readahead window of each file?
This patch does not remove file_ra_state, an file's readahead window
is determined
by it's backing device.
>
>
>> Applying this means the flags POSIX_FADV_NORMAL and
>> POSIX_FADV_SEQUENTIAL
>> in fadivse will only set file reading mode without signifying the
>> max readahead size of the file. The current apporach adds no additional
>> overhead in read IO path, IMHO is the simplest solution.
>> Any comments are welcome, thanks in advance.
>
>
> Could you show me how you test this patch?
This patch brings no perfmance gain, just fixs some functional bugs.
By reading a 500MB file, the default max readahead size of the
backing device was 128KB, after applying this patch, the read file's
max ra_pages
changed when I tuned the device's read ahead size with blockdev.
>
>
>>
>> Thanks,
>> Ying Zhu
>>
>> Signed-off-by: Ying Zhu <[email protected]>
>> ---
>> include/linux/fs.h | 1 -
>> mm/fadvise.c | 2 --
>> mm/filemap.c | 17 +++++++++++------
>> mm/readahead.c | 8 ++++----
>> 4 files changed, 15 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 17fd887..36303a5 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -991,7 +991,6 @@ struct file_ra_state {
>> unsigned int async_size; /* do asynchronous readahead when
>> there are only # of pages ahead
>> */
>> - unsigned int ra_pages; /* Maximum readahead window */
>> unsigned int mmap_miss; /* Cache miss stat for mmap
>> accesses */
>> loff_t prev_pos; /* Cache last read() position */
>> };
>> diff --git a/mm/fadvise.c b/mm/fadvise.c
>> index 469491e..75e2378 100644
>> --- a/mm/fadvise.c
>> +++ b/mm/fadvise.c
>> @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
>> loff_t len, int advice)
>> switch (advice) {
>> case POSIX_FADV_NORMAL:
>> - file->f_ra.ra_pages = bdi->ra_pages;
>> spin_lock(&file->f_lock);
>> file->f_mode &= ~FMODE_RANDOM;
>> spin_unlock(&file->f_lock);
>> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
>> loff_t len, int advice)
>> spin_unlock(&file->f_lock);
>> break;
>> case POSIX_FADV_SEQUENTIAL:
>> - file->f_ra.ra_pages = bdi->ra_pages * 2;
>> spin_lock(&file->f_lock);
>> file->f_mode &= ~FMODE_RANDOM;
>> spin_unlock(&file->f_lock);
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index a4a5260..e7e4409 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait);
>> * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
>> *
>> * It is going insane. Fix it by quickly scaling down the readahead
>> size.
>> + * It's hard to estimate how the bad sectors lay out, so to be
>> conservative,
>> + * set the read mode in random.
>> */
>> static void shrink_readahead_size_eio(struct file *filp,
>> struct file_ra_state *ra)
>> {
>> - ra->ra_pages /= 4;
>> + spin_lock(&filp->f_lock);
>> + filp->f_mode |= FMODE_RANDOM;
>> + spin_unlock(&filp->f_lock);
>> }
>> /**
>> @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct
>> vm_area_struct *vma,
>> /* If we don't want any read-ahead, don't bother */
>> if (VM_RandomReadHint(vma))
>> return;
>> - if (!ra->ra_pages)
>> + if (!mapping->backing_dev_info->ra_pages)
>> return;
>> if (VM_SequentialReadHint(vma)) {
>> - page_cache_sync_readahead(mapping, ra, file, offset,
>> - ra->ra_pages);
>> + page_cache_sync_readahead(mapping, ra, file, offset,
>> +
>> mapping->backing_dev_info->ra_pages);
>> return;
>> }
>> @@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct
>> vm_area_struct *vma,
>> /*
>> * mmap read-around
>> */
>> - ra_pages = max_sane_readahead(ra->ra_pages);
>> + ra_pages =
>> max_sane_readahead(mapping->backing_dev_info->ra_pages);
>> ra->start = max_t(long, 0, offset - ra_pages / 2);
>> ra->size = ra_pages;
>> ra->async_size = ra_pages / 4;
>> @@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct
>> vm_area_struct *vma,
>> ra->mmap_miss--;
>> if (PageReadahead(page))
>> page_cache_async_readahead(mapping, ra, file,
>> - page, offset, ra->ra_pages);
>> + page, offset,
>> +
>> mapping->backing_dev_info->ra_pages);
>> }
>> /**
>> diff --git a/mm/readahead.c b/mm/readahead.c
>> index ea8f8fa..6ea5999 100644
>> --- a/mm/readahead.c
>> +++ b/mm/readahead.c
>> @@ -27,7 +27,6 @@
>> void
>> file_ra_state_init(struct file_ra_state *ra, struct address_space
>> *mapping)
>> {
>> - ra->ra_pages = mapping->backing_dev_info->ra_pages;
>> ra->prev_pos = -1;
>> }
>> EXPORT_SYMBOL_GPL(file_ra_state_init);
>> @@ -400,7 +399,8 @@ ondemand_readahead(struct address_space *mapping,
>> bool hit_readahead_marker, pgoff_t offset,
>> unsigned long req_size)
>> {
>> - unsigned long max = max_sane_readahead(ra->ra_pages);
>> + struct backing_dev_info *bdi = mapping->backing_dev_info;
>> + unsigned long max = max_sane_readahead(bdi->ra_pages);
>> /*
>> * start of file
>> @@ -507,7 +507,7 @@ void page_cache_sync_readahead(struct address_space
>> *mapping,
>> pgoff_t offset, unsigned long req_size)
>> {
>> /* no read-ahead */
>> - if (!ra->ra_pages)
>> + if (!mapping->backing_dev_info->ra_pages)
>> return;
>> /* be dumb */
>> @@ -543,7 +543,7 @@ page_cache_async_readahead(struct address_space
>> *mapping,
>> unsigned long req_size)
>> {
>> /* no read-ahead */
>> - if (!ra->ra_pages)
>> + if (!mapping->backing_dev_info->ra_pages)
>> return;
>> /*
>
>

2012-10-23 22:47:11

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
> Hi,
> Recently we ran into the bug that an opened file's ra_pages does not
> synchronize with it's backing device's when the latter is changed
> with blockdev --setra, the application needs to reopen the file
> to know the change,

or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
window to the (new) bdi default.

> which is inappropriate under our circumstances.

Which are? We don't know your circumstances, so you need to tell us
why you need this and why existing methods of handling such changes
are insufficient...

Optimal readahead windows tend to be a physical property of the
storage and that does not tend to change dynamically. Hence block
device readahead should only need to be set up once, and generally
that can be done before the filesystem is mounted and files are
opened (e.g. via udev rules). Hence you need to explain why you need
to change the default block device readahead on the fly, and why
fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
windows to the new defaults.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-10-23 23:54:01

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

Hi Dave,
On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner <[email protected]> wrote:
> On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
>> Hi,
>> Recently we ran into the bug that an opened file's ra_pages does not
>> synchronize with it's backing device's when the latter is changed
>> with blockdev --setra, the application needs to reopen the file
>> to know the change,
>
> or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
> window to the (new) bdi default.
>
>> which is inappropriate under our circumstances.
>
> Which are? We don't know your circumstances, so you need to tell us
> why you need this and why existing methods of handling such changes
> are insufficient...
>
> Optimal readahead windows tend to be a physical property of the
> storage and that does not tend to change dynamically. Hence block
> device readahead should only need to be set up once, and generally
> that can be done before the filesystem is mounted and files are
> opened (e.g. via udev rules). Hence you need to explain why you need
> to change the default block device readahead on the fly, and why
> fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
> windows to the new defaults.
Our system is a fuse-based file system, fuse creates a
pseudo backing device for the user space file systems, the default readahead
size is 128KB and it can't fully utilize the backing storage's read ability,
so we should tune it.
The above third-party application using our file system maintains
some long-opened files, we does not have any chances
to force them to call fadvise(POSIX_FADV_NORMAL). :(
Thanks,
Ying Zhu
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2012-10-24 01:02:33

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/23/2012 09:41 PM, YingHang Zhu wrote:
> Sorry for the annoying, I forgot ccs in the previous mail.
> Thanks,
> Ying Zhu
> Hi Chen,
>
> On Tue, Oct 23, 2012 at 9:21 PM, Ni zhan Chen <[email protected]> wrote:
>> On 10/23/2012 08:46 PM, Ying Zhu wrote:
>>> Hi,
>>> Recently we ran into the bug that an opened file's ra_pages does not
>>> synchronize with it's backing device's when the latter is changed
>>> with blockdev --setra, the application needs to reopen the file
>>> to know the change, which is inappropriate under our circumstances.
>>
>> Could you tell me in which function do this synchronize stuff?
> With this patch we use bdi.ra_pages directly, so change bdi.ra_pages also
> change an opened file's ra_pages.
>>
>>> This bug is also mentioned in scst (generic SCSI target subsystem for
>>> Linux)'s
>>> README file.
>>> This patch tries to unify the ra_pages in struct file_ra_state
>>> and struct backing_dev_info. Basically current readahead algorithm
>>> will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the
>>
>> You mean ondemand readahead algorithm will do this? I don't think so.
>> file_ra_state_init only called in btrfs path, correct?
> No, it's also called in do_dentry_open.
>>
>>> read mode is sequential. Then all files sharing the same backing device
>>> have the same max value bdi.ra_pages set in file_ra_state.
>>
>> why remove file_ra_state? If one file is read sequential and another file is
>> read ramdom, how can use the global bdi.ra_pages to indicate the max
>> readahead window of each file?
> This patch does not remove file_ra_state, an file's readahead window
> is determined
> by it's backing device.

As Dave said, backing device readahead window doesn't tend to change
dynamically, but file readahead window does, it will change when
sequential read, random read, thrash, interleaved read and so on occur.

>>
>>> Applying this means the flags POSIX_FADV_NORMAL and
>>> POSIX_FADV_SEQUENTIAL
>>> in fadivse will only set file reading mode without signifying the
>>> max readahead size of the file. The current apporach adds no additional
>>> overhead in read IO path, IMHO is the simplest solution.
>>> Any comments are welcome, thanks in advance.
>>
>> Could you show me how you test this patch?
> This patch brings no perfmance gain, just fixs some functional bugs.
> By reading a 500MB file, the default max readahead size of the
> backing device was 128KB, after applying this patch, the read file's
> max ra_pages
> changed when I tuned the device's read ahead size with blockdev.
>>
>>> Thanks,
>>> Ying Zhu
>>>
>>> Signed-off-by: Ying Zhu <[email protected]>
>>> ---
>>> include/linux/fs.h | 1 -
>>> mm/fadvise.c | 2 --
>>> mm/filemap.c | 17 +++++++++++------
>>> mm/readahead.c | 8 ++++----
>>> 4 files changed, 15 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 17fd887..36303a5 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -991,7 +991,6 @@ struct file_ra_state {
>>> unsigned int async_size; /* do asynchronous readahead when
>>> there are only # of pages ahead
>>> */
>>> - unsigned int ra_pages; /* Maximum readahead window */
>>> unsigned int mmap_miss; /* Cache miss stat for mmap
>>> accesses */
>>> loff_t prev_pos; /* Cache last read() position */
>>> };
>>> diff --git a/mm/fadvise.c b/mm/fadvise.c
>>> index 469491e..75e2378 100644
>>> --- a/mm/fadvise.c
>>> +++ b/mm/fadvise.c
>>> @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
>>> loff_t len, int advice)
>>> switch (advice) {
>>> case POSIX_FADV_NORMAL:
>>> - file->f_ra.ra_pages = bdi->ra_pages;
>>> spin_lock(&file->f_lock);
>>> file->f_mode &= ~FMODE_RANDOM;
>>> spin_unlock(&file->f_lock);
>>> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
>>> loff_t len, int advice)
>>> spin_unlock(&file->f_lock);
>>> break;
>>> case POSIX_FADV_SEQUENTIAL:
>>> - file->f_ra.ra_pages = bdi->ra_pages * 2;
>>> spin_lock(&file->f_lock);
>>> file->f_mode &= ~FMODE_RANDOM;
>>> spin_unlock(&file->f_lock);
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index a4a5260..e7e4409 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait);
>>> * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
>>> *
>>> * It is going insane. Fix it by quickly scaling down the readahead
>>> size.
>>> + * It's hard to estimate how the bad sectors lay out, so to be
>>> conservative,
>>> + * set the read mode in random.
>>> */
>>> static void shrink_readahead_size_eio(struct file *filp,
>>> struct file_ra_state *ra)
>>> {
>>> - ra->ra_pages /= 4;
>>> + spin_lock(&filp->f_lock);
>>> + filp->f_mode |= FMODE_RANDOM;
>>> + spin_unlock(&filp->f_lock);
>>> }
>>> /**
>>> @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct
>>> vm_area_struct *vma,
>>> /* If we don't want any read-ahead, don't bother */
>>> if (VM_RandomReadHint(vma))
>>> return;
>>> - if (!ra->ra_pages)
>>> + if (!mapping->backing_dev_info->ra_pages)
>>> return;
>>> if (VM_SequentialReadHint(vma)) {
>>> - page_cache_sync_readahead(mapping, ra, file, offset,
>>> - ra->ra_pages);
>>> + page_cache_sync_readahead(mapping, ra, file, offset,
>>> +
>>> mapping->backing_dev_info->ra_pages);
>>> return;
>>> }
>>> @@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct
>>> vm_area_struct *vma,
>>> /*
>>> * mmap read-around
>>> */
>>> - ra_pages = max_sane_readahead(ra->ra_pages);
>>> + ra_pages =
>>> max_sane_readahead(mapping->backing_dev_info->ra_pages);
>>> ra->start = max_t(long, 0, offset - ra_pages / 2);
>>> ra->size = ra_pages;
>>> ra->async_size = ra_pages / 4;
>>> @@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct
>>> vm_area_struct *vma,
>>> ra->mmap_miss--;
>>> if (PageReadahead(page))
>>> page_cache_async_readahead(mapping, ra, file,
>>> - page, offset, ra->ra_pages);
>>> + page, offset,
>>> +
>>> mapping->backing_dev_info->ra_pages);
>>> }
>>> /**
>>> diff --git a/mm/readahead.c b/mm/readahead.c
>>> index ea8f8fa..6ea5999 100644
>>> --- a/mm/readahead.c
>>> +++ b/mm/readahead.c
>>> @@ -27,7 +27,6 @@
>>> void
>>> file_ra_state_init(struct file_ra_state *ra, struct address_space
>>> *mapping)
>>> {
>>> - ra->ra_pages = mapping->backing_dev_info->ra_pages;
>>> ra->prev_pos = -1;
>>> }
>>> EXPORT_SYMBOL_GPL(file_ra_state_init);
>>> @@ -400,7 +399,8 @@ ondemand_readahead(struct address_space *mapping,
>>> bool hit_readahead_marker, pgoff_t offset,
>>> unsigned long req_size)
>>> {
>>> - unsigned long max = max_sane_readahead(ra->ra_pages);
>>> + struct backing_dev_info *bdi = mapping->backing_dev_info;
>>> + unsigned long max = max_sane_readahead(bdi->ra_pages);
>>> /*
>>> * start of file
>>> @@ -507,7 +507,7 @@ void page_cache_sync_readahead(struct address_space
>>> *mapping,
>>> pgoff_t offset, unsigned long req_size)
>>> {
>>> /* no read-ahead */
>>> - if (!ra->ra_pages)
>>> + if (!mapping->backing_dev_info->ra_pages)
>>> return;
>>> /* be dumb */
>>> @@ -543,7 +543,7 @@ page_cache_async_readahead(struct address_space
>>> *mapping,
>>> unsigned long req_size)
>>> {
>>> /* no read-ahead */
>>> - if (!ra->ra_pages)
>>> + if (!mapping->backing_dev_info->ra_pages)
>>> return;
>>> /*
>>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-10-24 01:33:52

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

Hi Chen,

On Wed, Oct 24, 2012 at 9:02 AM, Ni zhan Chen <[email protected]> wrote:
> On 10/23/2012 09:41 PM, YingHang Zhu wrote:
>>
>> Sorry for the annoying, I forgot ccs in the previous mail.
>> Thanks,
>> Ying Zhu
>> Hi Chen,
>>
>> On Tue, Oct 23, 2012 at 9:21 PM, Ni zhan Chen <[email protected]>
>> wrote:
>>>
>>> On 10/23/2012 08:46 PM, Ying Zhu wrote:
>>>>
>>>> Hi,
>>>> Recently we ran into the bug that an opened file's ra_pages does not
>>>> synchronize with it's backing device's when the latter is changed
>>>> with blockdev --setra, the application needs to reopen the file
>>>> to know the change, which is inappropriate under our circumstances.
>>>
>>>
>>> Could you tell me in which function do this synchronize stuff?
>>
>> With this patch we use bdi.ra_pages directly, so change bdi.ra_pages also
>> change an opened file's ra_pages.
>>>
>>>
>>>> This bug is also mentioned in scst (generic SCSI target subsystem for
>>>> Linux)'s
>>>> README file.
>>>> This patch tries to unify the ra_pages in struct file_ra_state
>>>> and struct backing_dev_info. Basically current readahead algorithm
>>>> will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the
>>>
>>>
>>> You mean ondemand readahead algorithm will do this? I don't think so.
>>> file_ra_state_init only called in btrfs path, correct?
>>
>> No, it's also called in do_dentry_open.
>>>
>>>
>>>> read mode is sequential. Then all files sharing the same backing device
>>>> have the same max value bdi.ra_pages set in file_ra_state.
>>>
>>>
>>> why remove file_ra_state? If one file is read sequential and another file
>>> is
>>> read ramdom, how can use the global bdi.ra_pages to indicate the max
>>> readahead window of each file?
>>
>> This patch does not remove file_ra_state, an file's readahead window
>> is determined
>> by it's backing device.
>
>
> As Dave said, backing device readahead window doesn't tend to change
> dynamically, but file readahead window does, it will change when sequential
> read, random read, thrash, interleaved read and so on occur.
>
I agree about what you and Dave said totally and the point here is
not about how readahead
algorithm does. It's about those file systems using a abstract bdi
instead of the actual
devices, thus the bdi.ra_pages does not really reflect the backing
storage's read IO bandwidth.
AFAIK btrfs also has a abstract bdi to manage the many backing
devices, so I think btrfs also
has this problem. As for a further comment, btrfs may spread a file's
contents across the managed
backing devices, so maybe offset (0, x) and (x+1, y) land on different
disks and have different readahead
abilities, in this case the file's max readahead pages should change
accordingly.
Perhaps in reality we seldom meet this heterogenous stroage architecture.

Thanks,
Ying Zhu
>>>
>>>> Applying this means the flags POSIX_FADV_NORMAL and
>>>> POSIX_FADV_SEQUENTIAL
>>>> in fadivse will only set file reading mode without signifying the
>>>> max readahead size of the file. The current apporach adds no additional
>>>> overhead in read IO path, IMHO is the simplest solution.
>>>> Any comments are welcome, thanks in advance.
>>>
>>>
>>> Could you show me how you test this patch?
>>
>> This patch brings no perfmance gain, just fixs some functional bugs.
>> By reading a 500MB file, the default max readahead size of the
>> backing device was 128KB, after applying this patch, the read file's
>> max ra_pages
>> changed when I tuned the device's read ahead size with blockdev.
>>>
>>>
>>>> Thanks,
>>>> Ying Zhu
>>>>
>>>> Signed-off-by: Ying Zhu <[email protected]>
>>>> ---
>>>> include/linux/fs.h | 1 -
>>>> mm/fadvise.c | 2 --
>>>> mm/filemap.c | 17 +++++++++++------
>>>> mm/readahead.c | 8 ++++----
>>>> 4 files changed, 15 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>> index 17fd887..36303a5 100644
>>>> --- a/include/linux/fs.h
>>>> +++ b/include/linux/fs.h
>>>> @@ -991,7 +991,6 @@ struct file_ra_state {
>>>> unsigned int async_size; /* do asynchronous readahead
>>>> when
>>>> there are only # of pages
>>>> ahead
>>>> */
>>>> - unsigned int ra_pages; /* Maximum readahead window */
>>>> unsigned int mmap_miss; /* Cache miss stat for mmap
>>>> accesses */
>>>> loff_t prev_pos; /* Cache last read() position
>>>> */
>>>> };
>>>> diff --git a/mm/fadvise.c b/mm/fadvise.c
>>>> index 469491e..75e2378 100644
>>>> --- a/mm/fadvise.c
>>>> +++ b/mm/fadvise.c
>>>> @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
>>>> loff_t len, int advice)
>>>> switch (advice) {
>>>> case POSIX_FADV_NORMAL:
>>>> - file->f_ra.ra_pages = bdi->ra_pages;
>>>> spin_lock(&file->f_lock);
>>>> file->f_mode &= ~FMODE_RANDOM;
>>>> spin_unlock(&file->f_lock);
>>>> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
>>>> loff_t len, int advice)
>>>> spin_unlock(&file->f_lock);
>>>> break;
>>>> case POSIX_FADV_SEQUENTIAL:
>>>> - file->f_ra.ra_pages = bdi->ra_pages * 2;
>>>> spin_lock(&file->f_lock);
>>>> file->f_mode &= ~FMODE_RANDOM;
>>>> spin_unlock(&file->f_lock);
>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>> index a4a5260..e7e4409 100644
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait);
>>>> * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
>>>> *
>>>> * It is going insane. Fix it by quickly scaling down the readahead
>>>> size.
>>>> + * It's hard to estimate how the bad sectors lay out, so to be
>>>> conservative,
>>>> + * set the read mode in random.
>>>> */
>>>> static void shrink_readahead_size_eio(struct file *filp,
>>>> struct file_ra_state *ra)
>>>> {
>>>> - ra->ra_pages /= 4;
>>>> + spin_lock(&filp->f_lock);
>>>> + filp->f_mode |= FMODE_RANDOM;
>>>> + spin_unlock(&filp->f_lock);
>>>> }
>>>> /**
>>>> @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct
>>>> vm_area_struct *vma,
>>>> /* If we don't want any read-ahead, don't bother */
>>>> if (VM_RandomReadHint(vma))
>>>> return;
>>>> - if (!ra->ra_pages)
>>>> + if (!mapping->backing_dev_info->ra_pages)
>>>> return;
>>>> if (VM_SequentialReadHint(vma)) {
>>>> - page_cache_sync_readahead(mapping, ra, file, offset,
>>>> - ra->ra_pages);
>>>> + page_cache_sync_readahead(mapping, ra, file, offset,
>>>> +
>>>> mapping->backing_dev_info->ra_pages);
>>>> return;
>>>> }
>>>> @@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct
>>>> vm_area_struct *vma,
>>>> /*
>>>> * mmap read-around
>>>> */
>>>> - ra_pages = max_sane_readahead(ra->ra_pages);
>>>> + ra_pages =
>>>> max_sane_readahead(mapping->backing_dev_info->ra_pages);
>>>> ra->start = max_t(long, 0, offset - ra_pages / 2);
>>>> ra->size = ra_pages;
>>>> ra->async_size = ra_pages / 4;
>>>> @@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct
>>>> vm_area_struct *vma,
>>>> ra->mmap_miss--;
>>>> if (PageReadahead(page))
>>>> page_cache_async_readahead(mapping, ra, file,
>>>> - page, offset, ra->ra_pages);
>>>> + page, offset,
>>>> +
>>>> mapping->backing_dev_info->ra_pages);
>>>> }
>>>> /**
>>>> diff --git a/mm/readahead.c b/mm/readahead.c
>>>> index ea8f8fa..6ea5999 100644
>>>> --- a/mm/readahead.c
>>>> +++ b/mm/readahead.c
>>>> @@ -27,7 +27,6 @@
>>>> void
>>>> file_ra_state_init(struct file_ra_state *ra, struct address_space
>>>> *mapping)
>>>> {
>>>> - ra->ra_pages = mapping->backing_dev_info->ra_pages;
>>>> ra->prev_pos = -1;
>>>> }
>>>> EXPORT_SYMBOL_GPL(file_ra_state_init);
>>>> @@ -400,7 +399,8 @@ ondemand_readahead(struct address_space *mapping,
>>>> bool hit_readahead_marker, pgoff_t offset,
>>>> unsigned long req_size)
>>>> {
>>>> - unsigned long max = max_sane_readahead(ra->ra_pages);
>>>> + struct backing_dev_info *bdi = mapping->backing_dev_info;
>>>> + unsigned long max = max_sane_readahead(bdi->ra_pages);
>>>> /*
>>>> * start of file
>>>> @@ -507,7 +507,7 @@ void page_cache_sync_readahead(struct address_space
>>>> *mapping,
>>>> pgoff_t offset, unsigned long req_size)
>>>> {
>>>> /* no read-ahead */
>>>> - if (!ra->ra_pages)
>>>> + if (!mapping->backing_dev_info->ra_pages)
>>>> return;
>>>> /* be dumb */
>>>> @@ -543,7 +543,7 @@ page_cache_async_readahead(struct address_space
>>>> *mapping,
>>>> unsigned long req_size)
>>>> {
>>>> /* no read-ahead */
>>>> - if (!ra->ra_pages)
>>>> + if (!mapping->backing_dev_info->ra_pages)
>>>> return;
>>>> /*
>>>
>>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>
>

2012-10-24 20:19:27

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
> Hi Dave,
> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner <[email protected]> wrote:
> > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
> >> Hi,
> >> Recently we ran into the bug that an opened file's ra_pages does not
> >> synchronize with it's backing device's when the latter is changed
> >> with blockdev --setra, the application needs to reopen the file
> >> to know the change,
> >
> > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
> > window to the (new) bdi default.
> >
> >> which is inappropriate under our circumstances.
> >
> > Which are? We don't know your circumstances, so you need to tell us
> > why you need this and why existing methods of handling such changes
> > are insufficient...
> >
> > Optimal readahead windows tend to be a physical property of the
> > storage and that does not tend to change dynamically. Hence block
> > device readahead should only need to be set up once, and generally
> > that can be done before the filesystem is mounted and files are
> > opened (e.g. via udev rules). Hence you need to explain why you need
> > to change the default block device readahead on the fly, and why
> > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
> > windows to the new defaults.
> Our system is a fuse-based file system, fuse creates a
> pseudo backing device for the user space file systems, the default readahead
> size is 128KB and it can't fully utilize the backing storage's read ability,
> so we should tune it.

Sure, but that doesn't tell me anything about why you can't do this
at mount time before the application opens any files. i.e. you've
simply stated the reason why readahead is tunable, not why you need
to be fully dynamic.....

> The above third-party application using our file system maintains
> some long-opened files, we does not have any chances
> to force them to call fadvise(POSIX_FADV_NORMAL). :(

So raise a bug/feature request with the third party. Modifying
kernel code because you can't directly modify the application isn't
the best solution for anyone. This really is an application problem
- the kernel already provides the mechanisms to solve this
problem... :/

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-10-25 00:17:10

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner <[email protected]> wrote:
> On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
>> Hi Dave,
>> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner <[email protected]> wrote:
>> > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
>> >> Hi,
>> >> Recently we ran into the bug that an opened file's ra_pages does not
>> >> synchronize with it's backing device's when the latter is changed
>> >> with blockdev --setra, the application needs to reopen the file
>> >> to know the change,
>> >
>> > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
>> > window to the (new) bdi default.
>> >
>> >> which is inappropriate under our circumstances.
>> >
>> > Which are? We don't know your circumstances, so you need to tell us
>> > why you need this and why existing methods of handling such changes
>> > are insufficient...
>> >
>> > Optimal readahead windows tend to be a physical property of the
>> > storage and that does not tend to change dynamically. Hence block
>> > device readahead should only need to be set up once, and generally
>> > that can be done before the filesystem is mounted and files are
>> > opened (e.g. via udev rules). Hence you need to explain why you need
>> > to change the default block device readahead on the fly, and why
>> > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
>> > windows to the new defaults.
>> Our system is a fuse-based file system, fuse creates a
>> pseudo backing device for the user space file systems, the default readahead
>> size is 128KB and it can't fully utilize the backing storage's read ability,
>> so we should tune it.
>
> Sure, but that doesn't tell me anything about why you can't do this
> at mount time before the application opens any files. i.e. you've
> simply stated the reason why readahead is tunable, not why you need
> to be fully dynamic.....
We store our file system's data on different disks so we need to change ra_pages
dynamically according to where the data resides, it can't be fixed at mount time
or when we open files.
The abstract bdi of fuse and btrfs provides some dynamically changing
bdi.ra_pages
based on the real backing device. IMHO this should not be ignored.
>
>> The above third-party application using our file system maintains
>> some long-opened files, we does not have any chances
>> to force them to call fadvise(POSIX_FADV_NORMAL). :(
>
> So raise a bug/feature request with the third party. Modifying
> kernel code because you can't directly modify the application isn't
> the best solution for anyone. This really is an application problem
> - the kernel already provides the mechanisms to solve this
> problem... :/
Thanks for advice, I will consult the above application's developers
for more information.
Now from the code itself should we merge the gap between the real
device's ra_pages and the file's?
Obviously the ra_pages is duplicated, otherwise each time we run into this
problem, someone will do the same work as I have done here.

Thanks,
Ying Zhu
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2012-10-25 01:49:00

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/25/2012 08:17 AM, YingHang Zhu wrote:
> On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner <[email protected]> wrote:
>> On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
>>> Hi Dave,
>>> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner <[email protected]> wrote:
>>>> On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
>>>>> Hi,
>>>>> Recently we ran into the bug that an opened file's ra_pages does not
>>>>> synchronize with it's backing device's when the latter is changed
>>>>> with blockdev --setra, the application needs to reopen the file
>>>>> to know the change,
>>>> or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
>>>> window to the (new) bdi default.
>>>>
>>>>> which is inappropriate under our circumstances.
>>>> Which are? We don't know your circumstances, so you need to tell us
>>>> why you need this and why existing methods of handling such changes
>>>> are insufficient...
>>>>
>>>> Optimal readahead windows tend to be a physical property of the
>>>> storage and that does not tend to change dynamically. Hence block
>>>> device readahead should only need to be set up once, and generally
>>>> that can be done before the filesystem is mounted and files are
>>>> opened (e.g. via udev rules). Hence you need to explain why you need
>>>> to change the default block device readahead on the fly, and why
>>>> fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
>>>> windows to the new defaults.
>>> Our system is a fuse-based file system, fuse creates a
>>> pseudo backing device for the user space file systems, the default readahead
>>> size is 128KB and it can't fully utilize the backing storage's read ability,
>>> so we should tune it.
>> Sure, but that doesn't tell me anything about why you can't do this
>> at mount time before the application opens any files. i.e. you've
>> simply stated the reason why readahead is tunable, not why you need
>> to be fully dynamic.....
> We store our file system's data on different disks so we need to change ra_pages
> dynamically according to where the data resides, it can't be fixed at mount time
> or when we open files.
> The abstract bdi of fuse and btrfs provides some dynamically changing
> bdi.ra_pages
> based on the real backing device. IMHO this should not be ignored.

And how to tune ra_pages if one big file distribution in different
disks, I think Fengguang Wu can answer these questions,

Hi Fengguang,

>>> The above third-party application using our file system maintains
>>> some long-opened files, we does not have any chances
>>> to force them to call fadvise(POSIX_FADV_NORMAL). :(
>> So raise a bug/feature request with the third party. Modifying
>> kernel code because you can't directly modify the application isn't
>> the best solution for anyone. This really is an application problem
>> - the kernel already provides the mechanisms to solve this
>> problem... :/
> Thanks for advice, I will consult the above application's developers
> for more information.
> Now from the code itself should we merge the gap between the real
> device's ra_pages and the file's?
> Obviously the ra_pages is duplicated, otherwise each time we run into this
> problem, someone will do the same work as I have done here.
>
> Thanks,
> Ying Zhu
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> [email protected]
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-10-25 01:50:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote:
> On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner <[email protected]> wrote:
> > On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
> >> Hi Dave,
> >> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner <[email protected]> wrote:
> >> > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
> >> >> Hi,
> >> >> Recently we ran into the bug that an opened file's ra_pages does not
> >> >> synchronize with it's backing device's when the latter is changed
> >> >> with blockdev --setra, the application needs to reopen the file
> >> >> to know the change,
> >> >
> >> > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
> >> > window to the (new) bdi default.
> >> >
> >> >> which is inappropriate under our circumstances.
> >> >
> >> > Which are? We don't know your circumstances, so you need to tell us
> >> > why you need this and why existing methods of handling such changes
> >> > are insufficient...
> >> >
> >> > Optimal readahead windows tend to be a physical property of the
> >> > storage and that does not tend to change dynamically. Hence block
> >> > device readahead should only need to be set up once, and generally
> >> > that can be done before the filesystem is mounted and files are
> >> > opened (e.g. via udev rules). Hence you need to explain why you need
> >> > to change the default block device readahead on the fly, and why
> >> > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
> >> > windows to the new defaults.
> >> Our system is a fuse-based file system, fuse creates a
> >> pseudo backing device for the user space file systems, the default readahead
> >> size is 128KB and it can't fully utilize the backing storage's read ability,
> >> so we should tune it.
> >
> > Sure, but that doesn't tell me anything about why you can't do this
> > at mount time before the application opens any files. i.e. you've
> > simply stated the reason why readahead is tunable, not why you need
> > to be fully dynamic.....
> We store our file system's data on different disks so we need to change ra_pages
> dynamically according to where the data resides, it can't be fixed at mount time
> or when we open files.

That doesn't make a whole lot of sense to me. let me try to get this
straight.

There is data that resides on two devices (A + B), and a fuse
filesystem to access that data. There is a single file in the fuse
fs has data on both devices. An app has the file open, and when the
data it is accessing is on device A you need to set the readahead to
what is best for device A? And when the app tries to access data for
that file that is on device B, you need to set the readahead to what
is best for device B? And you are changing the fuse BDI readahead
settings according to where the data in the back end lies?

It seems to me that you should be setting the fuse readahead to the
maximum of the readahead windows the data devices have configured at
mount time and leaving it at that....

> The abstract bdi of fuse and btrfs provides some dynamically changing
> bdi.ra_pages
> based on the real backing device. IMHO this should not be ignored.

btrfs simply takes into account the number of disks it has for a
given storage pool when setting up the default bdi ra_pages during
mount. This is basically doing what I suggested above. Same with
the generic fuse code - it's simply setting a sensible default value
for the given fuse configuration.

Neither are dynamic in the sense you are talking about, though.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-10-25 02:04:13

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Thu, Oct 25, 2012 at 9:50 AM, Dave Chinner <[email protected]> wrote:
> On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote:
>> On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner <[email protected]> wrote:
>> > On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
>> >> Hi Dave,
>> >> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner <[email protected]> wrote:
>> >> > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
>> >> >> Hi,
>> >> >> Recently we ran into the bug that an opened file's ra_pages does not
>> >> >> synchronize with it's backing device's when the latter is changed
>> >> >> with blockdev --setra, the application needs to reopen the file
>> >> >> to know the change,
>> >> >
>> >> > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
>> >> > window to the (new) bdi default.
>> >> >
>> >> >> which is inappropriate under our circumstances.
>> >> >
>> >> > Which are? We don't know your circumstances, so you need to tell us
>> >> > why you need this and why existing methods of handling such changes
>> >> > are insufficient...
>> >> >
>> >> > Optimal readahead windows tend to be a physical property of the
>> >> > storage and that does not tend to change dynamically. Hence block
>> >> > device readahead should only need to be set up once, and generally
>> >> > that can be done before the filesystem is mounted and files are
>> >> > opened (e.g. via udev rules). Hence you need to explain why you need
>> >> > to change the default block device readahead on the fly, and why
>> >> > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
>> >> > windows to the new defaults.
>> >> Our system is a fuse-based file system, fuse creates a
>> >> pseudo backing device for the user space file systems, the default readahead
>> >> size is 128KB and it can't fully utilize the backing storage's read ability,
>> >> so we should tune it.
>> >
>> > Sure, but that doesn't tell me anything about why you can't do this
>> > at mount time before the application opens any files. i.e. you've
>> > simply stated the reason why readahead is tunable, not why you need
>> > to be fully dynamic.....
>> We store our file system's data on different disks so we need to change ra_pages
>> dynamically according to where the data resides, it can't be fixed at mount time
>> or when we open files.
>
> That doesn't make a whole lot of sense to me. let me try to get this
> straight.
>
> There is data that resides on two devices (A + B), and a fuse
> filesystem to access that data. There is a single file in the fuse
> fs has data on both devices. An app has the file open, and when the
> data it is accessing is on device A you need to set the readahead to
> what is best for device A? And when the app tries to access data for
> that file that is on device B, you need to set the readahead to what
> is best for device B? And you are changing the fuse BDI readahead
> settings according to where the data in the back end lies?
>
> It seems to me that you should be setting the fuse readahead to the
> maximum of the readahead windows the data devices have configured at
> mount time and leaving it at that....
Then it may not fully utilize some device's read IO bandwidth and put too much
burden on other devices.
>
>> The abstract bdi of fuse and btrfs provides some dynamically changing
>> bdi.ra_pages
>> based on the real backing device. IMHO this should not be ignored.
>
> btrfs simply takes into account the number of disks it has for a
> given storage pool when setting up the default bdi ra_pages during
> mount. This is basically doing what I suggested above. Same with
> the generic fuse code - it's simply setting a sensible default value
> for the given fuse configuration.
>
> Neither are dynamic in the sense you are talking about, though.
Actually I've talked about it with Fengguang, he advised we should unify the
ra_pages in struct bdi and file_ra_state and leave the issue that
spreading data
across disks as it is.
Fengguang, what's you opinion about this?

Thanks,
Ying Zhu
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2012-10-25 02:12:23

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/25/2012 10:04 AM, YingHang Zhu wrote:
> On Thu, Oct 25, 2012 at 9:50 AM, Dave Chinner <[email protected]> wrote:
>> On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote:
>>> On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner <[email protected]> wrote:
>>>> On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
>>>>> Hi Dave,
>>>>> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner <[email protected]> wrote:
>>>>>> On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
>>>>>>> Hi,
>>>>>>> Recently we ran into the bug that an opened file's ra_pages does not
>>>>>>> synchronize with it's backing device's when the latter is changed
>>>>>>> with blockdev --setra, the application needs to reopen the file
>>>>>>> to know the change,
>>>>>> or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
>>>>>> window to the (new) bdi default.
>>>>>>
>>>>>>> which is inappropriate under our circumstances.
>>>>>> Which are? We don't know your circumstances, so you need to tell us
>>>>>> why you need this and why existing methods of handling such changes
>>>>>> are insufficient...
>>>>>>
>>>>>> Optimal readahead windows tend to be a physical property of the
>>>>>> storage and that does not tend to change dynamically. Hence block
>>>>>> device readahead should only need to be set up once, and generally
>>>>>> that can be done before the filesystem is mounted and files are
>>>>>> opened (e.g. via udev rules). Hence you need to explain why you need
>>>>>> to change the default block device readahead on the fly, and why
>>>>>> fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
>>>>>> windows to the new defaults.
>>>>> Our system is a fuse-based file system, fuse creates a
>>>>> pseudo backing device for the user space file systems, the default readahead
>>>>> size is 128KB and it can't fully utilize the backing storage's read ability,
>>>>> so we should tune it.
>>>> Sure, but that doesn't tell me anything about why you can't do this
>>>> at mount time before the application opens any files. i.e. you've
>>>> simply stated the reason why readahead is tunable, not why you need
>>>> to be fully dynamic.....
>>> We store our file system's data on different disks so we need to change ra_pages
>>> dynamically according to where the data resides, it can't be fixed at mount time
>>> or when we open files.
>> That doesn't make a whole lot of sense to me. let me try to get this
>> straight.
>>
>> There is data that resides on two devices (A + B), and a fuse
>> filesystem to access that data. There is a single file in the fuse
>> fs has data on both devices. An app has the file open, and when the
>> data it is accessing is on device A you need to set the readahead to
>> what is best for device A? And when the app tries to access data for
>> that file that is on device B, you need to set the readahead to what
>> is best for device B? And you are changing the fuse BDI readahead
>> settings according to where the data in the back end lies?
>>
>> It seems to me that you should be setting the fuse readahead to the
>> maximum of the readahead windows the data devices have configured at
>> mount time and leaving it at that....
> Then it may not fully utilize some device's read IO bandwidth and put too much
> burden on other devices.
>>> The abstract bdi of fuse and btrfs provides some dynamically changing
>>> bdi.ra_pages
>>> based on the real backing device. IMHO this should not be ignored.
>> btrfs simply takes into account the number of disks it has for a
>> given storage pool when setting up the default bdi ra_pages during
>> mount. This is basically doing what I suggested above. Same with
>> the generic fuse code - it's simply setting a sensible default value
>> for the given fuse configuration.
>>
>> Neither are dynamic in the sense you are talking about, though.
> Actually I've talked about it with Fengguang, he advised we should unify the

But how can bdi related ra_pages reflect different files' readahead
window? Maybe these different files are sequential read, random read and
so on.

> ra_pages in struct bdi and file_ra_state and leave the issue that
> spreading data
> across disks as it is.
> Fengguang, what's you opinion about this?
>
> Thanks,
> Ying Zhu
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> [email protected]
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-10-25 02:31:34

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Thu, Oct 25, 2012 at 10:12 AM, Ni zhan Chen <[email protected]> wrote:
> On 10/25/2012 10:04 AM, YingHang Zhu wrote:
>>
>> On Thu, Oct 25, 2012 at 9:50 AM, Dave Chinner <[email protected]> wrote:
>>>
>>> On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote:
>>>>
>>>> On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner <[email protected]>
>>>> wrote:
>>>>>
>>>>> On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
>>>>>>
>>>>>> Hi Dave,
>>>>>> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner <[email protected]>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> Recently we ran into the bug that an opened file's ra_pages does
>>>>>>>> not
>>>>>>>> synchronize with it's backing device's when the latter is changed
>>>>>>>> with blockdev --setra, the application needs to reopen the file
>>>>>>>> to know the change,
>>>>>>>
>>>>>>> or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
>>>>>>> window to the (new) bdi default.
>>>>>>>
>>>>>>>> which is inappropriate under our circumstances.
>>>>>>>
>>>>>>> Which are? We don't know your circumstances, so you need to tell us
>>>>>>> why you need this and why existing methods of handling such changes
>>>>>>> are insufficient...
>>>>>>>
>>>>>>> Optimal readahead windows tend to be a physical property of the
>>>>>>> storage and that does not tend to change dynamically. Hence block
>>>>>>> device readahead should only need to be set up once, and generally
>>>>>>> that can be done before the filesystem is mounted and files are
>>>>>>> opened (e.g. via udev rules). Hence you need to explain why you need
>>>>>>> to change the default block device readahead on the fly, and why
>>>>>>> fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
>>>>>>> windows to the new defaults.
>>>>>>
>>>>>> Our system is a fuse-based file system, fuse creates a
>>>>>> pseudo backing device for the user space file systems, the default
>>>>>> readahead
>>>>>> size is 128KB and it can't fully utilize the backing storage's read
>>>>>> ability,
>>>>>> so we should tune it.
>>>>>
>>>>> Sure, but that doesn't tell me anything about why you can't do this
>>>>> at mount time before the application opens any files. i.e. you've
>>>>> simply stated the reason why readahead is tunable, not why you need
>>>>> to be fully dynamic.....
>>>>
>>>> We store our file system's data on different disks so we need to change
>>>> ra_pages
>>>> dynamically according to where the data resides, it can't be fixed at
>>>> mount time
>>>> or when we open files.
>>>
>>> That doesn't make a whole lot of sense to me. let me try to get this
>>> straight.
>>>
>>> There is data that resides on two devices (A + B), and a fuse
>>> filesystem to access that data. There is a single file in the fuse
>>> fs has data on both devices. An app has the file open, and when the
>>> data it is accessing is on device A you need to set the readahead to
>>> what is best for device A? And when the app tries to access data for
>>> that file that is on device B, you need to set the readahead to what
>>> is best for device B? And you are changing the fuse BDI readahead
>>> settings according to where the data in the back end lies?
>>>
>>> It seems to me that you should be setting the fuse readahead to the
>>> maximum of the readahead windows the data devices have configured at
>>> mount time and leaving it at that....
>>
>> Then it may not fully utilize some device's read IO bandwidth and put too
>> much
>> burden on other devices.
>>>>
>>>> The abstract bdi of fuse and btrfs provides some dynamically changing
>>>> bdi.ra_pages
>>>> based on the real backing device. IMHO this should not be ignored.
>>>
>>> btrfs simply takes into account the number of disks it has for a
>>> given storage pool when setting up the default bdi ra_pages during
>>> mount. This is basically doing what I suggested above. Same with
>>> the generic fuse code - it's simply setting a sensible default value
>>> for the given fuse configuration.
>>>
>>> Neither are dynamic in the sense you are talking about, though.
>>
>> Actually I've talked about it with Fengguang, he advised we should unify
>> the
>
>
> But how can bdi related ra_pages reflect different files' readahead window?
> Maybe these different files are sequential read, random read and so on.
I think you mean the dynamic tuning of readahead window, that's exactly the job
of readahead algorithm and it's reflected by file_ra_state.sync_size and
file_ra_state.async_size.
The ra_pages in struct file_ra_state only means the max readahead ability.

Thanks,
Ying Zhu
>
>> ra_pages in struct bdi and file_ra_state and leave the issue that
>> spreading data
>> across disks as it is.
>> Fengguang, what's you opinion about this?
>>
>> Thanks,
>> Ying Zhu
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> [email protected]
>>
>> --
>>
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>
>

2012-10-25 02:38:13

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

Hi YingHang,

> Actually I've talked about it with Fengguang, he advised we should unify the
> ra_pages in struct bdi and file_ra_state and leave the issue that
> spreading data
> across disks as it is.
> Fengguang, what's you opinion about this?

Yeah the two ra_pages may run out of sync for already opened files,
which could be a problem for long opened files. However as Dave put
it, a device's max readahead size is typically a static value that can
be set at mount time. So, the question is: do you really hurt from the
old behavior that deserves this code change?

I agree with Dave that the multi-disk case is not a valid concern. In
fact, how can the patch help that case? I mean, if it's two fuse files
lying in two disks, it *was* not a problem at all. If it's one big
file spreading to two disks, it's a too complex scheme to be
practically manageable which I doubt if you have such a setup.

Thanks,
Fengguang

2012-10-25 03:08:20

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Thu, Oct 25, 2012 at 10:38 AM, Fengguang Wu <[email protected]> wrote:
> Hi YingHang,
>
>> Actually I've talked about it with Fengguang, he advised we should unify the
>> ra_pages in struct bdi and file_ra_state and leave the issue that
>> spreading data
>> across disks as it is.
>> Fengguang, what's you opinion about this?
>
> Yeah the two ra_pages may run out of sync for already opened files,
> which could be a problem for long opened files. However as Dave put
> it, a device's max readahead size is typically a static value that can
> be set at mount time. So, the question is: do you really hurt from the
> old behavior that deserves this code change?
We could advise the above application to reopen files.
As I mentioned previously the many scst users also have this problem:
[quote]
Note2: you need to restart SCST after you changed read-ahead settings
on the target. It is a limitation of the Linux read ahead
implementation. It reads RA values for each file only when the file
is open and not updates them when the global RA parameters changed.
Hence, the need for vdisk to reopen all its files/devices.
[/quote]
So IMHO it's a functional bug in kernel that brings inconvenience to the
application developers.
>
> I agree with Dave that the multi-disk case is not a valid concern. In
> fact, how can the patch help that case? I mean, if it's two fuse files
> lying in two disks, it *was* not a problem at all. If it's one big
> file spreading to two disks, it's a too complex scheme to be
> practically manageable which I doubt if you have such a setup.
Yes this patch does not solve the issue here. I'm just push the discussion
a little further, in reality we may never meet such setup.

Thanks,
Ying Hang
>
> Thanks,
> Fengguang

2012-10-25 03:12:36

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Thu, Oct 25, 2012 at 10:58 AM, Fengguang Wu <[email protected]> wrote:
> Hi Chen,
>
>> But how can bdi related ra_pages reflect different files' readahead
>> window? Maybe these different files are sequential read, random read
>> and so on.
>
> It's simple: sequential reads will get ra_pages readahead size while
> random reads will not get readahead at all.
>
> Talking about the below chunk, it might hurt someone that explicitly
> takes advantage of the behavior, however the ra_pages*2 seems more
> like a hack than general solution to me: if the user will need
> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
> improving IO performance, then why not just increase bdi->ra_pages and
> benefit all reads? One may argue that it offers some differential
> behavior to specific applications, however it may also present as a
> counter-optimization: if the root already tuned bdi->ra_pages to the
> optimal size, the doubled readahead size will only cost more memory
> and perhaps IO latency.
I agree, we should choose the reasonable solution here.

Thanks,
Ying Zhu
>
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> spin_unlock(&file->f_lock);
> break;
> case POSIX_FADV_SEQUENTIAL:
> - file->f_ra.ra_pages = bdi->ra_pages * 2;
> spin_lock(&file->f_lock);
> file->f_mode &= ~FMODE_RANDOM;
> spin_unlock(&file->f_lock);
>
> Thanks,
> Fengguang

2012-10-25 04:29:15

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

Hi Chen,

> But how can bdi related ra_pages reflect different files' readahead
> window? Maybe these different files are sequential read, random read
> and so on.

It's simple: sequential reads will get ra_pages readahead size while
random reads will not get readahead at all.

Talking about the below chunk, it might hurt someone that explicitly
takes advantage of the behavior, however the ra_pages*2 seems more
like a hack than general solution to me: if the user will need
POSIX_FADV_SEQUENTIAL to double the max readahead window size for
improving IO performance, then why not just increase bdi->ra_pages and
benefit all reads? One may argue that it offers some differential
behavior to specific applications, however it may also present as a
counter-optimization: if the root already tuned bdi->ra_pages to the
optimal size, the doubled readahead size will only cost more memory
and perhaps IO latency.

--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
spin_unlock(&file->f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
- file->f_ra.ra_pages = bdi->ra_pages * 2;
spin_lock(&file->f_lock);
file->f_mode &= ~FMODE_RANDOM;
spin_unlock(&file->f_lock);

Thanks,
Fengguang

2012-10-26 00:26:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
> Hi Chen,
>
> > But how can bdi related ra_pages reflect different files' readahead
> > window? Maybe these different files are sequential read, random read
> > and so on.
>
> It's simple: sequential reads will get ra_pages readahead size while
> random reads will not get readahead at all.
>
> Talking about the below chunk, it might hurt someone that explicitly
> takes advantage of the behavior, however the ra_pages*2 seems more
> like a hack than general solution to me: if the user will need
> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
> improving IO performance, then why not just increase bdi->ra_pages and
> benefit all reads? One may argue that it offers some differential
> behavior to specific applications, however it may also present as a
> counter-optimization: if the root already tuned bdi->ra_pages to the
> optimal size, the doubled readahead size will only cost more memory
> and perhaps IO latency.
>
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> spin_unlock(&file->f_lock);
> break;
> case POSIX_FADV_SEQUENTIAL:
> - file->f_ra.ra_pages = bdi->ra_pages * 2;

I think we really have to reset file->f_ra.ra_pages here as it is
not a set-and-forget value. e.g. shrink_readahead_size_eio() can
reduce ra_pages as a result of IO errors. Hence if you have had io
errors, telling the kernel that you are now going to do sequential
IO should reset the readahead to the maximum ra_pages value
supported....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-10-26 01:38:07

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote:
> On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
> > Hi Chen,
> >
> > > But how can bdi related ra_pages reflect different files' readahead
> > > window? Maybe these different files are sequential read, random read
> > > and so on.
> >
> > It's simple: sequential reads will get ra_pages readahead size while
> > random reads will not get readahead at all.
> >
> > Talking about the below chunk, it might hurt someone that explicitly
> > takes advantage of the behavior, however the ra_pages*2 seems more
> > like a hack than general solution to me: if the user will need
> > POSIX_FADV_SEQUENTIAL to double the max readahead window size for
> > improving IO performance, then why not just increase bdi->ra_pages and
> > benefit all reads? One may argue that it offers some differential
> > behavior to specific applications, however it may also present as a
> > counter-optimization: if the root already tuned bdi->ra_pages to the
> > optimal size, the doubled readahead size will only cost more memory
> > and perhaps IO latency.
> >
> > --- a/mm/fadvise.c
> > +++ b/mm/fadvise.c
> > @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> > spin_unlock(&file->f_lock);
> > break;
> > case POSIX_FADV_SEQUENTIAL:
> > - file->f_ra.ra_pages = bdi->ra_pages * 2;
>
> I think we really have to reset file->f_ra.ra_pages here as it is
> not a set-and-forget value. e.g. shrink_readahead_size_eio() can
> reduce ra_pages as a result of IO errors. Hence if you have had io
> errors, telling the kernel that you are now going to do sequential
> IO should reset the readahead to the maximum ra_pages value
> supported....

Good point!

.... but wait .... this patch removes file->f_ra.ra_pages in all other
places too, so there will be no file->f_ra.ra_pages to be reset here...

Thanks,
Fengguang

2012-10-26 02:25:11

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/26/2012 08:25 AM, Dave Chinner wrote:
> On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
>> Hi Chen,
>>
>>> But how can bdi related ra_pages reflect different files' readahead
>>> window? Maybe these different files are sequential read, random read
>>> and so on.
>> It's simple: sequential reads will get ra_pages readahead size while
>> random reads will not get readahead at all.
>>
>> Talking about the below chunk, it might hurt someone that explicitly
>> takes advantage of the behavior, however the ra_pages*2 seems more
>> like a hack than general solution to me: if the user will need
>> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
>> improving IO performance, then why not just increase bdi->ra_pages and
>> benefit all reads? One may argue that it offers some differential
>> behavior to specific applications, however it may also present as a
>> counter-optimization: if the root already tuned bdi->ra_pages to the
>> optimal size, the doubled readahead size will only cost more memory
>> and perhaps IO latency.
>>
>> --- a/mm/fadvise.c
>> +++ b/mm/fadvise.c
>> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>> spin_unlock(&file->f_lock);
>> break;
>> case POSIX_FADV_SEQUENTIAL:
>> - file->f_ra.ra_pages = bdi->ra_pages * 2;
> I think we really have to reset file->f_ra.ra_pages here as it is
> not a set-and-forget value. e.g. shrink_readahead_size_eio() can
> reduce ra_pages as a result of IO errors. Hence if you have had io
> errors, telling the kernel that you are now going to do sequential
> IO should reset the readahead to the maximum ra_pages value
> supported....

Good catch!

>
> Cheers,
>
> Dave.

2012-10-26 02:30:20

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/26/2012 09:27 AM, Fengguang Wu wrote:
> On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote:
>> On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
>>> Hi Chen,
>>>
>>>> But how can bdi related ra_pages reflect different files' readahead
>>>> window? Maybe these different files are sequential read, random read
>>>> and so on.
>>> It's simple: sequential reads will get ra_pages readahead size while
>>> random reads will not get readahead at all.
>>>
>>> Talking about the below chunk, it might hurt someone that explicitly
>>> takes advantage of the behavior, however the ra_pages*2 seems more
>>> like a hack than general solution to me: if the user will need
>>> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
>>> improving IO performance, then why not just increase bdi->ra_pages and
>>> benefit all reads? One may argue that it offers some differential
>>> behavior to specific applications, however it may also present as a
>>> counter-optimization: if the root already tuned bdi->ra_pages to the
>>> optimal size, the doubled readahead size will only cost more memory
>>> and perhaps IO latency.
>>>
>>> --- a/mm/fadvise.c
>>> +++ b/mm/fadvise.c
>>> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>>> spin_unlock(&file->f_lock);
>>> break;
>>> case POSIX_FADV_SEQUENTIAL:
>>> - file->f_ra.ra_pages = bdi->ra_pages * 2;
>> I think we really have to reset file->f_ra.ra_pages here as it is
>> not a set-and-forget value. e.g. shrink_readahead_size_eio() can
>> reduce ra_pages as a result of IO errors. Hence if you have had io
>> errors, telling the kernel that you are now going to do sequential
>> IO should reset the readahead to the maximum ra_pages value
>> supported....
> Good point!
>
> .... but wait .... this patch removes file->f_ra.ra_pages in all other
> places too, so there will be no file->f_ra.ra_pages to be reset here...

In his patch,

static void shrink_readahead_size_eio(struct file *filp,
struct file_ra_state *ra)
{
- ra->ra_pages /= 4;
+ spin_lock(&filp->f_lock);
+ filp->f_mode |= FMODE_RANDOM;
+ spin_unlock(&filp->f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

>
> Thanks,
> Fengguang
>

2012-10-26 03:28:12

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Fri, Oct 26, 2012 at 10:30 AM, Ni zhan Chen <[email protected]> wrote:
> On 10/26/2012 09:27 AM, Fengguang Wu wrote:
>>
>> On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote:
>>>
>>> On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
>>>>
>>>> Hi Chen,
>>>>
>>>>> But how can bdi related ra_pages reflect different files' readahead
>>>>> window? Maybe these different files are sequential read, random read
>>>>> and so on.
>>>>
>>>> It's simple: sequential reads will get ra_pages readahead size while
>>>> random reads will not get readahead at all.
>>>>
>>>> Talking about the below chunk, it might hurt someone that explicitly
>>>> takes advantage of the behavior, however the ra_pages*2 seems more
>>>> like a hack than general solution to me: if the user will need
>>>> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
>>>> improving IO performance, then why not just increase bdi->ra_pages and
>>>> benefit all reads? One may argue that it offers some differential
>>>> behavior to specific applications, however it may also present as a
>>>> counter-optimization: if the root already tuned bdi->ra_pages to the
>>>> optimal size, the doubled readahead size will only cost more memory
>>>> and perhaps IO latency.
>>>>
>>>> --- a/mm/fadvise.c
>>>> +++ b/mm/fadvise.c
>>>> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
>>>> loff_t len, int advice)
>>>> spin_unlock(&file->f_lock);
>>>> break;
>>>> case POSIX_FADV_SEQUENTIAL:
>>>> - file->f_ra.ra_pages = bdi->ra_pages * 2;
>>>
>>> I think we really have to reset file->f_ra.ra_pages here as it is
>>> not a set-and-forget value. e.g. shrink_readahead_size_eio() can
>>> reduce ra_pages as a result of IO errors. Hence if you have had io
>>> errors, telling the kernel that you are now going to do sequential
>>> IO should reset the readahead to the maximum ra_pages value
>>> supported....
>>
>> Good point!
>>
>> .... but wait .... this patch removes file->f_ra.ra_pages in all other
>> places too, so there will be no file->f_ra.ra_pages to be reset here...
>
>
> In his patch,
>
>
> static void shrink_readahead_size_eio(struct file *filp,
> struct file_ra_state *ra)
> {
> - ra->ra_pages /= 4;
> + spin_lock(&filp->f_lock);
> + filp->f_mode |= FMODE_RANDOM;
> + spin_unlock(&filp->f_lock);
>
> As the example in comment above this function, the read maybe still
> sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
> directly.
I've considered about this. On the first try I modified file_ra_state.size and
file_ra_state.async_size directly, like

file_ra_state.async_size = 0;
file_ra_state.size /= 4;

but as what I comment here, we can not
predict whether the bad sectors will trash the readahead window, maybe the
following sectors after current one are ok to go in normal readahead,
it's hard to know,
the current approach gives us a chance to slow down softly.

Thanks,
Ying Zhu
>
>>
>> Thanks,
>> Fengguang
>>
>

2012-10-26 03:38:13

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Fri, Oct 26, 2012 at 8:25 AM, Dave Chinner <[email protected]> wrote:
> On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
>> Hi Chen,
>>
>> > But how can bdi related ra_pages reflect different files' readahead
>> > window? Maybe these different files are sequential read, random read
>> > and so on.
>>
>> It's simple: sequential reads will get ra_pages readahead size while
>> random reads will not get readahead at all.
>>
>> Talking about the below chunk, it might hurt someone that explicitly
>> takes advantage of the behavior, however the ra_pages*2 seems more
>> like a hack than general solution to me: if the user will need
>> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
>> improving IO performance, then why not just increase bdi->ra_pages and
>> benefit all reads? One may argue that it offers some differential
>> behavior to specific applications, however it may also present as a
>> counter-optimization: if the root already tuned bdi->ra_pages to the
>> optimal size, the doubled readahead size will only cost more memory
>> and perhaps IO latency.
>>
>> --- a/mm/fadvise.c
>> +++ b/mm/fadvise.c
>> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>> spin_unlock(&file->f_lock);
>> break;
>> case POSIX_FADV_SEQUENTIAL:
>> - file->f_ra.ra_pages = bdi->ra_pages * 2;
>
> I think we really have to reset file->f_ra.ra_pages here as it is
> not a set-and-forget value. e.g. shrink_readahead_size_eio() can
> reduce ra_pages as a result of IO errors. Hence if you have had io
> errors, telling the kernel that you are now going to do sequential
> IO should reset the readahead to the maximum ra_pages value
> supported....
If we unify file->f_ra.ra_pages and its' bdi->ra_pages, then the error-prone
device's readahead can be directly tuned or turned off with blockdev
thus affect all files
using the device and without bring more complexity...

Thanks,
Ying Zhu
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2012-10-26 03:52:07

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/26/2012 11:28 AM, YingHang Zhu wrote:
> On Fri, Oct 26, 2012 at 10:30 AM, Ni zhan Chen <[email protected]> wrote:
>> On 10/26/2012 09:27 AM, Fengguang Wu wrote:
>>> On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote:
>>>> On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
>>>>> Hi Chen,
>>>>>
>>>>>> But how can bdi related ra_pages reflect different files' readahead
>>>>>> window? Maybe these different files are sequential read, random read
>>>>>> and so on.
>>>>> It's simple: sequential reads will get ra_pages readahead size while
>>>>> random reads will not get readahead at all.
>>>>>
>>>>> Talking about the below chunk, it might hurt someone that explicitly
>>>>> takes advantage of the behavior, however the ra_pages*2 seems more
>>>>> like a hack than general solution to me: if the user will need
>>>>> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
>>>>> improving IO performance, then why not just increase bdi->ra_pages and
>>>>> benefit all reads? One may argue that it offers some differential
>>>>> behavior to specific applications, however it may also present as a
>>>>> counter-optimization: if the root already tuned bdi->ra_pages to the
>>>>> optimal size, the doubled readahead size will only cost more memory
>>>>> and perhaps IO latency.
>>>>>
>>>>> --- a/mm/fadvise.c
>>>>> +++ b/mm/fadvise.c
>>>>> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
>>>>> loff_t len, int advice)
>>>>> spin_unlock(&file->f_lock);
>>>>> break;
>>>>> case POSIX_FADV_SEQUENTIAL:
>>>>> - file->f_ra.ra_pages = bdi->ra_pages * 2;
>>>> I think we really have to reset file->f_ra.ra_pages here as it is
>>>> not a set-and-forget value. e.g. shrink_readahead_size_eio() can
>>>> reduce ra_pages as a result of IO errors. Hence if you have had io
>>>> errors, telling the kernel that you are now going to do sequential
>>>> IO should reset the readahead to the maximum ra_pages value
>>>> supported....
>>> Good point!
>>>
>>> .... but wait .... this patch removes file->f_ra.ra_pages in all other
>>> places too, so there will be no file->f_ra.ra_pages to be reset here...
>>
>> In his patch,
>>
>>
>> static void shrink_readahead_size_eio(struct file *filp,
>> struct file_ra_state *ra)
>> {
>> - ra->ra_pages /= 4;
>> + spin_lock(&filp->f_lock);
>> + filp->f_mode |= FMODE_RANDOM;
>> + spin_unlock(&filp->f_lock);
>>
>> As the example in comment above this function, the read maybe still
>> sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
>> directly.
> I've considered about this. On the first try I modified file_ra_state.size and
> file_ra_state.async_size directly, like
>
> file_ra_state.async_size = 0;
> file_ra_state.size /= 4;
>
> but as what I comment here, we can not
> predict whether the bad sectors will trash the readahead window, maybe the
> following sectors after current one are ok to go in normal readahead,
> it's hard to know,
> the current approach gives us a chance to slow down softly.

Then when will check filp->f_mode |= FMODE_RANDOM; ? Does it will
influence ra->ra_pages?

>
> Thanks,
> Ying Zhu
>>> Thanks,
>>> Fengguang
>>>

2012-10-26 03:55:55

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Fri, Oct 26, 2012 at 11:38:11AM +0800, YingHang Zhu wrote:
> On Fri, Oct 26, 2012 at 8:25 AM, Dave Chinner <[email protected]> wrote:
> > On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
> >> Hi Chen,
> >>
> >> > But how can bdi related ra_pages reflect different files' readahead
> >> > window? Maybe these different files are sequential read, random read
> >> > and so on.
> >>
> >> It's simple: sequential reads will get ra_pages readahead size while
> >> random reads will not get readahead at all.
> >>
> >> Talking about the below chunk, it might hurt someone that explicitly
> >> takes advantage of the behavior, however the ra_pages*2 seems more
> >> like a hack than general solution to me: if the user will need
> >> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
> >> improving IO performance, then why not just increase bdi->ra_pages and
> >> benefit all reads? One may argue that it offers some differential
> >> behavior to specific applications, however it may also present as a
> >> counter-optimization: if the root already tuned bdi->ra_pages to the
> >> optimal size, the doubled readahead size will only cost more memory
> >> and perhaps IO latency.
> >>
> >> --- a/mm/fadvise.c
> >> +++ b/mm/fadvise.c
> >> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> >> spin_unlock(&file->f_lock);
> >> break;
> >> case POSIX_FADV_SEQUENTIAL:
> >> - file->f_ra.ra_pages = bdi->ra_pages * 2;
> >
> > I think we really have to reset file->f_ra.ra_pages here as it is
> > not a set-and-forget value. e.g. shrink_readahead_size_eio() can
> > reduce ra_pages as a result of IO errors. Hence if you have had io
> > errors, telling the kernel that you are now going to do sequential
> > IO should reset the readahead to the maximum ra_pages value
> > supported....
> If we unify file->f_ra.ra_pages and its' bdi->ra_pages, then the error-prone
> device's readahead can be directly tuned or turned off with blockdev
> thus affect all files
> using the device and without bring more complexity...

It's not really feasible/convenient for the end users to hand tune
blockdev readahead size on IO errors. Even many administrators are
totally unaware of the readahead size parameter.

Thanks,
Fengguang

2012-10-26 04:35:29

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Fri, Oct 26, 2012 at 11:51 AM, Ni zhan Chen <[email protected]> wrote:
> On 10/26/2012 11:28 AM, YingHang Zhu wrote:
>>
>> On Fri, Oct 26, 2012 at 10:30 AM, Ni zhan Chen <[email protected]>
>> wrote:
>>>
>>> On 10/26/2012 09:27 AM, Fengguang Wu wrote:
>>>>
>>>> On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote:
>>>>>
>>>>> On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
>>>>>>
>>>>>> Hi Chen,
>>>>>>
>>>>>>> But how can bdi related ra_pages reflect different files' readahead
>>>>>>> window? Maybe these different files are sequential read, random read
>>>>>>> and so on.
>>>>>>
>>>>>> It's simple: sequential reads will get ra_pages readahead size while
>>>>>> random reads will not get readahead at all.
>>>>>>
>>>>>> Talking about the below chunk, it might hurt someone that explicitly
>>>>>> takes advantage of the behavior, however the ra_pages*2 seems more
>>>>>> like a hack than general solution to me: if the user will need
>>>>>> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
>>>>>> improving IO performance, then why not just increase bdi->ra_pages and
>>>>>> benefit all reads? One may argue that it offers some differential
>>>>>> behavior to specific applications, however it may also present as a
>>>>>> counter-optimization: if the root already tuned bdi->ra_pages to the
>>>>>> optimal size, the doubled readahead size will only cost more memory
>>>>>> and perhaps IO latency.
>>>>>>
>>>>>> --- a/mm/fadvise.c
>>>>>> +++ b/mm/fadvise.c
>>>>>> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
>>>>>> loff_t len, int advice)
>>>>>> spin_unlock(&file->f_lock);
>>>>>> break;
>>>>>> case POSIX_FADV_SEQUENTIAL:
>>>>>> - file->f_ra.ra_pages = bdi->ra_pages * 2;
>>>>>
>>>>> I think we really have to reset file->f_ra.ra_pages here as it is
>>>>> not a set-and-forget value. e.g. shrink_readahead_size_eio() can
>>>>> reduce ra_pages as a result of IO errors. Hence if you have had io
>>>>> errors, telling the kernel that you are now going to do sequential
>>>>> IO should reset the readahead to the maximum ra_pages value
>>>>> supported....
>>>>
>>>> Good point!
>>>>
>>>> .... but wait .... this patch removes file->f_ra.ra_pages in all other
>>>> places too, so there will be no file->f_ra.ra_pages to be reset here...
>>>
>>>
>>> In his patch,
>>>
>>>
>>> static void shrink_readahead_size_eio(struct file *filp,
>>> struct file_ra_state *ra)
>>> {
>>> - ra->ra_pages /= 4;
>>> + spin_lock(&filp->f_lock);
>>> + filp->f_mode |= FMODE_RANDOM;
>>> + spin_unlock(&filp->f_lock);
>>>
>>> As the example in comment above this function, the read maybe still
>>> sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
>>> directly.
>>
>> I've considered about this. On the first try I modified file_ra_state.size
>> and
>> file_ra_state.async_size directly, like
>>
>> file_ra_state.async_size = 0;
>> file_ra_state.size /= 4;
>>
>> but as what I comment here, we can not
>> predict whether the bad sectors will trash the readahead window, maybe the
>> following sectors after current one are ok to go in normal readahead,
>> it's hard to know,
>> the current approach gives us a chance to slow down softly.
>
>
> Then when will check filp->f_mode |= FMODE_RANDOM; ? Does it will influence
> ra->ra_pages?
You can find the relevant information in function page_cache_sync_readahead.

Thanks,
Ying Zhu
>
>
>>
>> Thanks,
>> Ying Zhu
>>>>
>>>> Thanks,
>>>> Fengguang
>>>>
>

2012-10-26 05:00:45

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Fri, Oct 26, 2012 at 11:55 AM, Fengguang Wu <[email protected]> wrote:
> On Fri, Oct 26, 2012 at 11:38:11AM +0800, YingHang Zhu wrote:
>> On Fri, Oct 26, 2012 at 8:25 AM, Dave Chinner <[email protected]> wrote:
>> > On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
>> >> Hi Chen,
>> >>
>> >> > But how can bdi related ra_pages reflect different files' readahead
>> >> > window? Maybe these different files are sequential read, random read
>> >> > and so on.
>> >>
>> >> It's simple: sequential reads will get ra_pages readahead size while
>> >> random reads will not get readahead at all.
>> >>
>> >> Talking about the below chunk, it might hurt someone that explicitly
>> >> takes advantage of the behavior, however the ra_pages*2 seems more
>> >> like a hack than general solution to me: if the user will need
>> >> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
>> >> improving IO performance, then why not just increase bdi->ra_pages and
>> >> benefit all reads? One may argue that it offers some differential
>> >> behavior to specific applications, however it may also present as a
>> >> counter-optimization: if the root already tuned bdi->ra_pages to the
>> >> optimal size, the doubled readahead size will only cost more memory
>> >> and perhaps IO latency.
>> >>
>> >> --- a/mm/fadvise.c
>> >> +++ b/mm/fadvise.c
>> >> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>> >> spin_unlock(&file->f_lock);
>> >> break;
>> >> case POSIX_FADV_SEQUENTIAL:
>> >> - file->f_ra.ra_pages = bdi->ra_pages * 2;
>> >
>> > I think we really have to reset file->f_ra.ra_pages here as it is
>> > not a set-and-forget value. e.g. shrink_readahead_size_eio() can
>> > reduce ra_pages as a result of IO errors. Hence if you have had io
>> > errors, telling the kernel that you are now going to do sequential
>> > IO should reset the readahead to the maximum ra_pages value
>> > supported....
>> If we unify file->f_ra.ra_pages and its' bdi->ra_pages, then the error-prone
>> device's readahead can be directly tuned or turned off with blockdev
>> thus affect all files
>> using the device and without bring more complexity...
>
> It's not really feasible/convenient for the end users to hand tune
> blockdev readahead size on IO errors. Even many administrators are
> totally unaware of the readahead size parameter.
You are right, so the problem comes in this way:
If one file's read failure will affect other files? I mean for
rotating disks and discs,
a file's read failure may be due to the bad sectors which tend to be consecutive
and won't affect other files' reading status. However for tape drive
the read failure
usually indicates data corruption and other file's reading may also fail.
In other words, should we consider how many files failed to read data and
where they failed as a factor to indicate the status of the backing device,
or treat these files independently?
If we choose the previous one we can accumulate the statistics and
change bdi.ra_pages,
otherwise we may do some check for FMODE_RANDOM before we change the readahead
window.
I may missed something, please point it out.
Thanks,
Ying Zhu
>
> Thanks,
> Fengguang

2012-10-26 06:59:19

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

> static void shrink_readahead_size_eio(struct file *filp,
> struct file_ra_state *ra)
> {
> - ra->ra_pages /= 4;
> + spin_lock(&filp->f_lock);
> + filp->f_mode |= FMODE_RANDOM;
> + spin_unlock(&filp->f_lock);
>
> As the example in comment above this function, the read maybe still
> sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
> directly.

Yes immediately disabling readahead may hurt IO performance, the
original '/ 4' may perform better when there are only 1-3 IO errors
encountered.

Thanks,
Fengguang

2012-10-26 07:03:24

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/26/2012 02:58 PM, Fengguang Wu wrote:
>> static void shrink_readahead_size_eio(struct file *filp,
>> struct file_ra_state *ra)
>> {
>> - ra->ra_pages /= 4;
>> + spin_lock(&filp->f_lock);
>> + filp->f_mode |= FMODE_RANDOM;
>> + spin_unlock(&filp->f_lock);
>>
>> As the example in comment above this function, the read maybe still
>> sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
>> directly.
> Yes immediately disabling readahead may hurt IO performance, the
> original '/ 4' may perform better when there are only 1-3 IO errors
> encountered.

Hi Fengguang,

Why the number should be 1-3?

Regards,
Chen

>
> Thanks,
> Fengguang
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-10-26 07:09:42

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:
> On 10/26/2012 02:58 PM, Fengguang Wu wrote:
> >> static void shrink_readahead_size_eio(struct file *filp,
> >> struct file_ra_state *ra)
> >> {
> >>- ra->ra_pages /= 4;
> >>+ spin_lock(&filp->f_lock);
> >>+ filp->f_mode |= FMODE_RANDOM;
> >>+ spin_unlock(&filp->f_lock);
> >>
> >>As the example in comment above this function, the read maybe still
> >>sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
> >>directly.
> >Yes immediately disabling readahead may hurt IO performance, the
> >original '/ 4' may perform better when there are only 1-3 IO errors
> >encountered.
>
> Hi Fengguang,
>
> Why the number should be 1-3?

The original behavior is '/= 4' on each error.

After 1 errors, readahead size will be shrinked by 1/4
After 2 errors, readahead size will be shrinked by 1/16
After 3 errors, readahead size will be shrinked by 1/64
After 4 errors, readahead size will be effectively 0 (disabled)

Thanks,
Fengguang

2012-10-26 07:20:14

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/26/2012 03:09 PM, Fengguang Wu wrote:
> On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:
>> On 10/26/2012 02:58 PM, Fengguang Wu wrote:
>>>> static void shrink_readahead_size_eio(struct file *filp,
>>>> struct file_ra_state *ra)
>>>> {
>>>> - ra->ra_pages /= 4;
>>>> + spin_lock(&filp->f_lock);
>>>> + filp->f_mode |= FMODE_RANDOM;
>>>> + spin_unlock(&filp->f_lock);
>>>>
>>>> As the example in comment above this function, the read maybe still
>>>> sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
>>>> directly.
>>> Yes immediately disabling readahead may hurt IO performance, the
>>> original '/ 4' may perform better when there are only 1-3 IO errors
>>> encountered.
>> Hi Fengguang,
>>
>> Why the number should be 1-3?
> The original behavior is '/= 4' on each error.
>
> After 1 errors, readahead size will be shrinked by 1/4
> After 2 errors, readahead size will be shrinked by 1/16
> After 3 errors, readahead size will be shrinked by 1/64
> After 4 errors, readahead size will be effectively 0 (disabled)

But from function shrink_readahead_size_eio and its caller filemap_fault
I can't find the behavior you mentioned. How you figure out it?

Regards,
Chen

>
> Thanks,
> Fengguang
>

2012-10-26 07:36:35

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote:
> On 10/26/2012 03:09 PM, Fengguang Wu wrote:
> >On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:
> >>On 10/26/2012 02:58 PM, Fengguang Wu wrote:
> >>>> static void shrink_readahead_size_eio(struct file *filp,
> >>>> struct file_ra_state *ra)
> >>>> {
> >>>>- ra->ra_pages /= 4;
> >>>>+ spin_lock(&filp->f_lock);
> >>>>+ filp->f_mode |= FMODE_RANDOM;
> >>>>+ spin_unlock(&filp->f_lock);
> >>>>
> >>>>As the example in comment above this function, the read maybe still
> >>>>sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
> >>>>directly.
> >>>Yes immediately disabling readahead may hurt IO performance, the
> >>>original '/ 4' may perform better when there are only 1-3 IO errors
> >>>encountered.
> >>Hi Fengguang,
> >>
> >>Why the number should be 1-3?
> >The original behavior is '/= 4' on each error.
> >
> >After 1 errors, readahead size will be shrinked by 1/4
> >After 2 errors, readahead size will be shrinked by 1/16
> >After 3 errors, readahead size will be shrinked by 1/64
> >After 4 errors, readahead size will be effectively 0 (disabled)
>
> But from function shrink_readahead_size_eio and its caller
> filemap_fault I can't find the behavior you mentioned. How you
> figure out it?

It's this line in shrink_readahead_size_eio():

ra->ra_pages /= 4;

That ra_pages will keep shrinking by 4 on each error. The only way to
restore it is to reopen the file, or POSIX_FADV_SEQUENTIAL.

Thanks,
Fengguang

2012-10-26 07:47:36

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/26/2012 03:36 PM, Fengguang Wu wrote:
> On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote:
>> On 10/26/2012 03:09 PM, Fengguang Wu wrote:
>>> On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:
>>>> On 10/26/2012 02:58 PM, Fengguang Wu wrote:
>>>>>> static void shrink_readahead_size_eio(struct file *filp,
>>>>>> struct file_ra_state *ra)
>>>>>> {
>>>>>> - ra->ra_pages /= 4;
>>>>>> + spin_lock(&filp->f_lock);
>>>>>> + filp->f_mode |= FMODE_RANDOM;
>>>>>> + spin_unlock(&filp->f_lock);
>>>>>>
>>>>>> As the example in comment above this function, the read maybe still
>>>>>> sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
>>>>>> directly.
>>>>> Yes immediately disabling readahead may hurt IO performance, the
>>>>> original '/ 4' may perform better when there are only 1-3 IO errors
>>>>> encountered.
>>>> Hi Fengguang,
>>>>
>>>> Why the number should be 1-3?
>>> The original behavior is '/= 4' on each error.
>>>
>>> After 1 errors, readahead size will be shrinked by 1/4
>>> After 2 errors, readahead size will be shrinked by 1/16
>>> After 3 errors, readahead size will be shrinked by 1/64
>>> After 4 errors, readahead size will be effectively 0 (disabled)
>> But from function shrink_readahead_size_eio and its caller
>> filemap_fault I can't find the behavior you mentioned. How you
>> figure out it?
> It's this line in shrink_readahead_size_eio():
>
> ra->ra_pages /= 4;

Yeah, I mean why the 4th readahead size will be 0(disabled)? What's the
original value of ra->ra_pages? How can guarantee the 4th shrink
readahead size can be 0?

Regards,
Chen

>
> That ra_pages will keep shrinking by 4 on each error. The only way to
> restore it is to reopen the file, or POSIX_FADV_SEQUENTIAL.
>
> Thanks,
> Fengguang
>

2012-10-26 08:02:46

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On Fri, Oct 26, 2012 at 03:47:19PM +0800, Ni zhan Chen wrote:
> On 10/26/2012 03:36 PM, Fengguang Wu wrote:
> >On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote:
> >>On 10/26/2012 03:09 PM, Fengguang Wu wrote:
> >>>On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:
> >>>>On 10/26/2012 02:58 PM, Fengguang Wu wrote:
> >>>>>> static void shrink_readahead_size_eio(struct file *filp,
> >>>>>> struct file_ra_state *ra)
> >>>>>> {
> >>>>>>- ra->ra_pages /= 4;
> >>>>>>+ spin_lock(&filp->f_lock);
> >>>>>>+ filp->f_mode |= FMODE_RANDOM;
> >>>>>>+ spin_unlock(&filp->f_lock);
> >>>>>>
> >>>>>>As the example in comment above this function, the read maybe still
> >>>>>>sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
> >>>>>>directly.
> >>>>>Yes immediately disabling readahead may hurt IO performance, the
> >>>>>original '/ 4' may perform better when there are only 1-3 IO errors
> >>>>>encountered.
> >>>>Hi Fengguang,
> >>>>
> >>>>Why the number should be 1-3?
> >>>The original behavior is '/= 4' on each error.
> >>>
> >>>After 1 errors, readahead size will be shrinked by 1/4
> >>>After 2 errors, readahead size will be shrinked by 1/16
> >>>After 3 errors, readahead size will be shrinked by 1/64
> >>>After 4 errors, readahead size will be effectively 0 (disabled)
> >>But from function shrink_readahead_size_eio and its caller
> >>filemap_fault I can't find the behavior you mentioned. How you
> >>figure out it?
> >It's this line in shrink_readahead_size_eio():
> >
> > ra->ra_pages /= 4;
>
> Yeah, I mean why the 4th readahead size will be 0(disabled)? What's
> the original value of ra->ra_pages? How can guarantee the 4th shrink
> readahead size can be 0?

Ah OK, I'm talking about the typical case. The default readahead size
is 128k, which will become 0 after / 256. The reasonable good ra size
for hard disks is 1MB=256pages, which also becomes 1page after 4 errors.

Thanks,
Fengguang

2012-10-26 08:08:31

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

On 10/26/2012 04:02 PM, Fengguang Wu wrote:
> On Fri, Oct 26, 2012 at 03:47:19PM +0800, Ni zhan Chen wrote:
>> On 10/26/2012 03:36 PM, Fengguang Wu wrote:
>>> On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote:
>>>> On 10/26/2012 03:09 PM, Fengguang Wu wrote:
>>>>> On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:
>>>>>> On 10/26/2012 02:58 PM, Fengguang Wu wrote:
>>>>>>>> static void shrink_readahead_size_eio(struct file *filp,
>>>>>>>> struct file_ra_state *ra)
>>>>>>>> {
>>>>>>>> - ra->ra_pages /= 4;
>>>>>>>> + spin_lock(&filp->f_lock);
>>>>>>>> + filp->f_mode |= FMODE_RANDOM;
>>>>>>>> + spin_unlock(&filp->f_lock);
>>>>>>>>
>>>>>>>> As the example in comment above this function, the read maybe still
>>>>>>>> sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
>>>>>>>> directly.
>>>>>>> Yes immediately disabling readahead may hurt IO performance, the
>>>>>>> original '/ 4' may perform better when there are only 1-3 IO errors
>>>>>>> encountered.
>>>>>> Hi Fengguang,
>>>>>>
>>>>>> Why the number should be 1-3?
>>>>> The original behavior is '/= 4' on each error.
>>>>>
>>>>> After 1 errors, readahead size will be shrinked by 1/4
>>>>> After 2 errors, readahead size will be shrinked by 1/16
>>>>> After 3 errors, readahead size will be shrinked by 1/64
>>>>> After 4 errors, readahead size will be effectively 0 (disabled)
>>>> But from function shrink_readahead_size_eio and its caller
>>>> filemap_fault I can't find the behavior you mentioned. How you
>>>> figure out it?
>>> It's this line in shrink_readahead_size_eio():
>>>
>>> ra->ra_pages /= 4;
>> Yeah, I mean why the 4th readahead size will be 0(disabled)? What's
>> the original value of ra->ra_pages? How can guarantee the 4th shrink
>> readahead size can be 0?
> Ah OK, I'm talking about the typical case. The default readahead size
> is 128k, which will become 0 after / 256. The reasonable good ra size
> for hard disks is 1MB=256pages, which also becomes 1page after 4 errors.

Then why default size is not set to reasonable size?

Regards,
Chen

>
> Thanks,
> Fengguang
>

2012-10-26 08:13:38

by YingHang Zhu

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

Hi Fengguang,

On Fri, Oct 26, 2012 at 4:02 PM, Fengguang Wu <[email protected]> wrote:
> On Fri, Oct 26, 2012 at 03:47:19PM +0800, Ni zhan Chen wrote:
>> On 10/26/2012 03:36 PM, Fengguang Wu wrote:
>> >On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote:
>> >>On 10/26/2012 03:09 PM, Fengguang Wu wrote:
>> >>>On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:
>> >>>>On 10/26/2012 02:58 PM, Fengguang Wu wrote:
>> >>>>>> static void shrink_readahead_size_eio(struct file *filp,
>> >>>>>> struct file_ra_state *ra)
>> >>>>>> {
>> >>>>>>- ra->ra_pages /= 4;
>> >>>>>>+ spin_lock(&filp->f_lock);
>> >>>>>>+ filp->f_mode |= FMODE_RANDOM;
>> >>>>>>+ spin_unlock(&filp->f_lock);
>> >>>>>>
>> >>>>>>As the example in comment above this function, the read maybe still
>> >>>>>>sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
>> >>>>>>directly.
>> >>>>>Yes immediately disabling readahead may hurt IO performance, the
>> >>>>>original '/ 4' may perform better when there are only 1-3 IO errors
>> >>>>>encountered.
>> >>>>Hi Fengguang,
>> >>>>
>> >>>>Why the number should be 1-3?
>> >>>The original behavior is '/= 4' on each error.
>> >>>
>> >>>After 1 errors, readahead size will be shrinked by 1/4
>> >>>After 2 errors, readahead size will be shrinked by 1/16
>> >>>After 3 errors, readahead size will be shrinked by 1/64
>> >>>After 4 errors, readahead size will be effectively 0 (disabled)
>> >>But from function shrink_readahead_size_eio and its caller
>> >>filemap_fault I can't find the behavior you mentioned. How you
>> >>figure out it?
>> >It's this line in shrink_readahead_size_eio():
>> >
>> > ra->ra_pages /= 4;
>>
>> Yeah, I mean why the 4th readahead size will be 0(disabled)? What's
>> the original value of ra->ra_pages? How can guarantee the 4th shrink
>> readahead size can be 0?
>
> Ah OK, I'm talking about the typical case. The default readahead size
> is 128k, which will become 0 after / 256. The reasonable good ra size
> for hard disks is 1MB=256pages, which also becomes 1page after 4 errors.

How do you feel about my previous mail of error statistics, in fact I
prefer treating these files independently, so do some check for FMODE_RANDOM
before we change the readahead window to avoid trashing the read ahead window,
If the user applications call fadvise but the media turn out to be
error-prone, after one
read we know the situation and set the file in FMODE_RANDOM, this should solve
the issue Dave raised.
Do I miss something? Thanks in advance.

>
> Thanks,
> Fengguang



--
Thanks,
Ying Zhu