2011-04-27 18:13:58

by Andrea Righi

[permalink] [raw]
Subject: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS

Introduce a new fadvise flag to drop page cache pages of a single
filesystem.

At the moment it is possible to drop page cache pages via
/proc/sys/vm/drop_pagecache or via posix_fadvise(POSIX_FADV_DONTNEED).

The first method drops the whole page cache while the second can be used
to drop page cache pages of a single file descriptor. However, there's
not a simple way to drop all the pages of a filesystem (we could scan
all the file descriptors and use posix_fadvise(POSIX_FADV_DONTNEED), but
this solution obviously doesn't scale well).

This functionality requires root privilege to avoid potential DoS in the
system (i.e., a hard loop of posix_fadvise(POSIX_FADV_DONTNEED_FS) on
the root filesystem).

A practical example:

# ls -lh /mnt/sda/zero /mnt/sdb/zero
-rw-r--r-- 1 root root 16M 2011-04-20 10:20 /mnt/sda/zero
-rw-r--r-- 1 root root 16M 2011-04-20 10:20 /mnt/sdb/zero

$ grep ^Cached /proc/meminfo
Cached: 5660 kB
$ md5sum /mnt/sda/zero /mnt/sdb/zero
2c7ab85a893283e98c931e9511add182 /mnt/sda/zero
2c7ab85a893283e98c931e9511add182 /mnt/sdb/zero
$ grep ^Cached /proc/meminfo
Cached: 38544 kB
$ sudo ./drop-pagecache /mnt/sda/
$ grep ^Cached /proc/meminfo
Cached: 22440 kB
$ sudo ./drop-pagecache /mnt/sdb/
$ grep ^Cached /proc/meminfo
Cached: 5056 kB

A previous RFC about this topic can be found here:
http://marc.info/?l=linux-kernel&m=130385374902114&w=2

ChangeLog (v1 -> v2):

* use the same value for POSIX_FADV_DONTNEED_FS on all architectures
* check CAP_SYS_ADMIN capability instead of checking the EUID value

Signed-off-by: Andrea Righi <[email protected]>
---
fs/drop_caches.c | 2 +-
include/linux/fadvise.h | 1 +
include/linux/mm.h | 2 ++
mm/fadvise.c | 7 +++++++
4 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 98b77c8..59d6caa 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -13,7 +13,7 @@
/* A global variable is a bit ugly, but it keeps the code simple */
int sysctl_drop_caches;

-static void drop_pagecache_sb(struct super_block *sb, void *unused)
+void drop_pagecache_sb(struct super_block *sb, void *unused)
{
struct inode *inode, *toput_inode = NULL;

diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h
index e8e7471..ab39117 100644
--- a/include/linux/fadvise.h
+++ b/include/linux/fadvise.h
@@ -17,5 +17,6 @@
#define POSIX_FADV_DONTNEED 4 /* Don't need these pages. */
#define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */
#endif
+#define POSIX_FADV_DONTNEED_FS 8 /* Don't need these filesystem pages. */

#endif /* FADVISE_H_INCLUDED */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 692dbae..004cdbc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -21,6 +21,7 @@ struct anon_vma;
struct file_ra_state;
struct user_struct;
struct writeback_control;
+struct super_block;

#ifndef CONFIG_DISCONTIGMEM /* Don't use mapnrs, do it properly */
extern unsigned long max_mapnr;
@@ -1602,6 +1603,7 @@ int in_gate_area_no_mm(unsigned long addr);
#define in_gate_area(mm, addr) ({(void)mm; in_gate_area_no_mm(addr);})
#endif /* __HAVE_ARCH_GATE_AREA */

+void drop_pagecache_sb(struct super_block *sb, void *unused);
int drop_caches_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 8d723c9..15155e7 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -57,6 +57,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
case POSIX_FADV_WILLNEED:
case POSIX_FADV_NOREUSE:
case POSIX_FADV_DONTNEED:
+ case POSIX_FADV_DONTNEED_FS:
/* no bad return value, but ignore advice */
break;
default:
@@ -127,6 +128,12 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
invalidate_mapping_pages(mapping, start_index,
end_index);
break;
+ case POSIX_FADV_DONTNEED_FS:
+ if (capable(CAP_SYS_ADMIN))
+ drop_pagecache_sb(file->f_dentry->d_sb, NULL);
+ else
+ ret = -EPERM;
+ break;
default:
ret = -EINVAL;
}
--
1.7.1


2011-04-27 18:25:40

by Mike Frysinger

[permalink] [raw]
Subject: Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS

On Wed, Apr 27, 2011 at 14:13, Andrea Righi wrote:
> Introduce a new fadvise flag to drop page cache pages of a single
> filesystem.
>
> At the moment it is possible to drop page cache pages via
> /proc/sys/vm/drop_pagecache or via posix_fadvise(POSIX_FADV_DONTNEED).
>
> The first method drops the whole page cache while the second can be used
> to drop page cache pages of a single file descriptor. However, there's
> not a simple way to drop all the pages of a filesystem (we could scan
> all the file descriptors and use posix_fadvise(POSIX_FADV_DONTNEED), but
> this solution obviously doesn't scale well).

what if you open the mount point and use POSIX_FADV_DONTNEED on that
dir handle ? if you required write access for that level, it'd also
implicitly take care of the permission issue. but maybe this is just
trying to fit existing code in the wrong way.
-mike

2011-04-27 18:33:12

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS

On Wed, Apr 27, 2011 at 08:13:47PM +0200, Andrea Righi wrote:
> @@ -127,6 +128,12 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> invalidate_mapping_pages(mapping, start_index,
> end_index);
> break;
> + case POSIX_FADV_DONTNEED_FS:
> + if (capable(CAP_SYS_ADMIN))
> + drop_pagecache_sb(file->f_dentry->d_sb, NULL);
> + else
> + ret = -EPERM;
> + break;
> default:
> ret = -EINVAL;
> }

Mmm ... what if I open /dev/sdxyz and call fadvise() on it? I think
you end up flushing /dev's page cache entries, instead of the filesystem
which is on /dev/sdxyz.

If I understand correctly, you want mapping->host->i_sb instead of
file->f_dentry->d_sb.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2011-04-27 18:40:17

by Mike Frysinger

[permalink] [raw]
Subject: Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS

On Wed, Apr 27, 2011 at 14:33, Matthew Wilcox wrote:
> On Wed, Apr 27, 2011 at 08:13:47PM +0200, Andrea Righi wrote:
>> @@ -127,6 +128,12 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>>                       invalidate_mapping_pages(mapping, start_index,
>>                                               end_index);
>>               break;
>> +     case POSIX_FADV_DONTNEED_FS:
>> +             if (capable(CAP_SYS_ADMIN))
>> +                     drop_pagecache_sb(file->f_dentry->d_sb, NULL);
>> +             else
>> +                     ret = -EPERM;
>> +             break;
>>       default:
>>               ret = -EINVAL;
>>       }
>
> Mmm ... what if I open /dev/sdxyz and call fadvise() on it?  I think
> you end up flushing /dev's page cache entries, instead of the filesystem
> which is on /dev/sdxyz.

i was thinking of that, but was trying to come up with situations
where there might not have a node to work on. fs's in a file go
through loop devs, dm/lvm have ones created, and flash fs's still have
a mtd block. how about network based fs's ? how you going to signal
dropping of pages for nfs or cifs or fuse ones ?
-mike

2011-04-27 18:48:00

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS

On Wed, Apr 27, 2011 at 02:39:53PM -0400, Mike Frysinger wrote:
> > Mmm ... what if I open /dev/sdxyz and call fadvise() on it? ??I think
> > you end up flushing /dev's page cache entries, instead of the filesystem
> > which is on /dev/sdxyz.
>
> i was thinking of that, but was trying to come up with situations
> where there might not have a node to work on. fs's in a file go
> through loop devs, dm/lvm have ones created, and flash fs's still have
> a mtd block. how about network based fs's ? how you going to signal
> dropping of pages for nfs or cifs or fuse ones ?

For a regular file, mapping->host->i_sb points to the superblock this
file is on. For a device, mapping->host->i_sb points to the superblock
corresponding to this device. So it's always what we want.

(hm, what about block devices not currently mounted? do we need to check
whether mapping->host is NULL?)

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2011-04-27 18:49:48

by Mike Frysinger

[permalink] [raw]
Subject: Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS

On Wed, Apr 27, 2011 at 14:47, Matthew Wilcox wrote:
> On Wed, Apr 27, 2011 at 02:39:53PM -0400, Mike Frysinger wrote:
>> > Mmm ... what if I open /dev/sdxyz and call fadvise() on it? ??I think
>> > you end up flushing /dev's page cache entries, instead of the filesystem
>> > which is on /dev/sdxyz.
>>
>> i was thinking of that, but was trying to come up with situations
>> where there might not have a node to work on. ?fs's in a file go
>> through loop devs, dm/lvm have ones created, and flash fs's still have
>> a mtd block. ?how about network based fs's ? ?how you going to signal
>> dropping of pages for nfs or cifs or fuse ones ?
>
> For a regular file, mapping->host->i_sb points to the superblock this
> file is on. ?For a device, mapping->host->i_sb points to the superblock
> corresponding to this device. ?So it's always what we want.

sorry, wrong question. i misread your original post (suggesting we
should be calling fadvise on the block instead of an arbitrary dir
handle).
-mike

2011-04-28 09:35:48

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS

On Wed, Apr 27, 2011 at 02:25:17PM -0400, Mike Frysinger wrote:
> On Wed, Apr 27, 2011 at 14:13, Andrea Righi wrote:
> > Introduce a new fadvise flag to drop page cache pages of a single
> > filesystem.
> >
> > At the moment it is possible to drop page cache pages via
> > /proc/sys/vm/drop_pagecache or via posix_fadvise(POSIX_FADV_DONTNEED).
> >
> > The first method drops the whole page cache while the second can be used
> > to drop page cache pages of a single file descriptor. However, there's
> > not a simple way to drop all the pages of a filesystem (we could scan
> > all the file descriptors and use posix_fadvise(POSIX_FADV_DONTNEED), but
> > this solution obviously doesn't scale well).
>
> what if you open the mount point and use POSIX_FADV_DONTNEED on that
> dir handle ? if you required write access for that level, it'd also
> implicitly take care of the permission issue. but maybe this is just
> trying to fit existing code in the wrong way.
> -mike

I still prefer the capability check. I think it's much more simple from
the userspace point of view to be able to specify any file or directory
instead of being forced to retrieve the mountpoint.

However, an advantage with the approach you're proposing is that a
non-privileged user can drop the page cache of a filesystem if it has
write permission in the root of that filesystem.

mmmh.. I don't see big problems also with the interface you propose, if
you all think it's better I can implement this in the next version.

Thanks,
-Andrea

2011-04-28 09:35:49

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS

On Wed, Apr 27, 2011 at 12:33:08PM -0600, Matthew Wilcox wrote:
> On Wed, Apr 27, 2011 at 08:13:47PM +0200, Andrea Righi wrote:
> > @@ -127,6 +128,12 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> > invalidate_mapping_pages(mapping, start_index,
> > end_index);
> > break;
> > + case POSIX_FADV_DONTNEED_FS:
> > + if (capable(CAP_SYS_ADMIN))
> > + drop_pagecache_sb(file->f_dentry->d_sb, NULL);
> > + else
> > + ret = -EPERM;
> > + break;
> > default:
> > ret = -EINVAL;
> > }
>
> Mmm ... what if I open /dev/sdxyz and call fadvise() on it? I think
> you end up flushing /dev's page cache entries, instead of the filesystem
> which is on /dev/sdxyz.
>
> If I understand correctly, you want mapping->host->i_sb instead of
> file->f_dentry->d_sb.

I did some tests, but I don't get the expected behaviour. In all cases
both if I use mapping->host->i_sb or file->f_dentry->d_sb when I call
fadvise() on any block device all the blockdev pages are dropped
("Buffers" from /proc/meminfo), but page cache pages are not touched:

# df -hT /
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 29G 20G 7.4G 73% /
# grep "^Cached\|Buffers" /proc/meminfo
Buffers: 79772 kB
Cached: 32440 kB
# sudo drop-pagecache /dev/sda1
# grep "^Cached\|Buffers" /proc/meminfo
Buffers: 228 kB
Cached: 32440 kB
# sudo drop-pagecache /
# grep "^Cached\|Buffers" /proc/meminfo
Buffers: 228 kB
Cached: 4884 kB

Thanks,
-Andrea

2011-05-04 21:45:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS

On Wed, 27 Apr 2011 20:13:47 +0200
Andrea Righi <[email protected]> wrote:

> Introduce a new fadvise flag to drop page cache pages of a single
> filesystem.

I'm going to object to this on general principle. We shouldn't toss
new features into the kernel API just because we can. Each feature
needs a really good argument to justify its inclusion, and I don't
believe that the case has been made for this feature.

IOW, who's going to die if we don't merge it?

2011-05-04 22:09:30

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v2] fadvise: introduce POSIX_FADV_DONTNEED_FS

On Wed, May 04, 2011 at 02:44:11PM -0700, Andrew Morton wrote:
> On Wed, 27 Apr 2011 20:13:47 +0200
> Andrea Righi <[email protected]> wrote:
>
> > Introduce a new fadvise flag to drop page cache pages of a single
> > filesystem.
>
> I'm going to object to this on general principle. We shouldn't toss
> new features into the kernel API just because we can. Each feature
> needs a really good argument to justify its inclusion, and I don't
> believe that the case has been made for this feature.
>
> IOW, who's going to die if we don't merge it?

OK. :)

-Andrea