Some file systems (including ext4, xfs, ramfs ...) have the following
problem as I've described in the commit message of the 1/4 patch.
The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
removed almost all locks in llseek() including SEEK_END. It based on the
idea that write() updates size atomically. But in fact, write() can be
divided into two or more parts in generic_perform_write() when pos
straddles over the PAGE_SIZE, which results in updating size multiple
times in one write(). It means that llseek() can see the size being
updated during write().
This race changes behavior of some applications. 'tail' is one of those
applications. It reads range [pos, pos_end] where pos_end is obtained
via llseek() SEEK_END. Sometimes, a read line could be broken.
reproducer:
$ while true; do echo 123456 >> out; done
$ while true; do tail out | grep -v 123456 ; done
example output(take 30 secs):
12345
1
1234
1
12
1234
Note: Some file systems which indivisually implements llseek() and hold
inode mutex lock in it are not affected. ex:) btrfs, ocfs2
Patch 1: re-implements inode lock for SEEK_END in generic_file_llseek()
Patch 2 to 4: implement inode lock for SEEK_END in each file systems
Eiichi Tsukata (4):
vfs: fix race between llseek SEEK_END and write
ext4: fix race between llseek SEEK_END and write
f2fs: fix race between llseek SEEK_END and write
overlayfs: fix race between llseek SEEK_END and write
fs/btrfs/file.c | 2 +-
fs/ext4/file.c | 10 ++++++++++
fs/f2fs/file.c | 6 +-----
fs/fuse/file.c | 5 +++--
fs/gfs2/file.c | 3 ++-
fs/overlayfs/file.c | 23 ++++++++++++++++++++---
fs/read_write.c | 37 ++++++++++++++++++++++++++++++++++---
include/linux/fs.h | 2 ++
8 files changed, 73 insertions(+), 15 deletions(-)
--
2.19.1
This patch itself seems to be just a cleanup but with the
commit b25bd1d9fd87 ("vfs: fix race between llseek SEEK_END and write")
it fixes race.
Signed-off-by: Eiichi Tsukata <[email protected]>
---
fs/f2fs/file.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 88b124677189..14a923607eff 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -447,15 +447,11 @@ static loff_t f2fs_seek_block(struct file *file, loff_t offset, int whence)
static loff_t f2fs_llseek(struct file *file, loff_t offset, int whence)
{
- struct inode *inode = file->f_mapping->host;
- loff_t maxbytes = inode->i_sb->s_maxbytes;
-
switch (whence) {
case SEEK_SET:
case SEEK_CUR:
case SEEK_END:
- return generic_file_llseek_size(file, offset, whence,
- maxbytes, i_size_read(inode));
+ return generic_file_llseek(file, offset, whence);
case SEEK_DATA:
case SEEK_HOLE:
if (offset < 0)
--
2.19.1
On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote:
> Some file systems (including ext4, xfs, ramfs ...) have the following
> problem as I've described in the commit message of the 1/4 patch.
>
> The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
> removed almost all locks in llseek() including SEEK_END. It based on the
> idea that write() updates size atomically. But in fact, write() can be
> divided into two or more parts in generic_perform_write() when pos
> straddles over the PAGE_SIZE, which results in updating size multiple
> times in one write(). It means that llseek() can see the size being
> updated during write().
And? Who has ever promised anything that insane? write(2) can take an arbitrary
amount of time; another process doing lseek() on independently opened descriptor
is *not* going to wait for that (e.g. page-in of the buffer being written, which
just happens to be mmapped from a file on NFS over RFC1149 link).
The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
removed almost all locks in llseek() including SEEK_END. It based on the
idea that write() updates size atomically. But in fact, write() can be
divided into two or more parts in generic_perform_write() when pos
straddles over the PAGE_SIZE, which results in updating size multiple
times in one write(). It means that llseek() can see the size being
updated during write().
This race changes behavior of some applications. 'tail' is one of those
applications. It reads range [pos, pos_end] where pos_end is obtained
via llseek() SEEK_END. Sometimes, a read line could be broken.
reproducer:
$ while true; do echo 123456 >> out; done
$ while true; do tail out | grep -v 123456 ; done
example output(take 30 secs):
12345
1
1234
1
12
1234
This patch re-introduces generic_file_llseek_unlocked() and implements a
lock for SEEK_END/DATA/HOLE in generic_file_llseek(). I replaced all
generic_file_llseek() callers with _unlocked() if they are called with a
inode lock.
All file systems which call generic_file_llseek_size() directly
are fixed in the later commits.
Fixes: ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
Signed-off-by: Eiichi Tsukata <[email protected]>
---
fs/btrfs/file.c | 2 +-
fs/fuse/file.c | 5 +++--
fs/gfs2/file.c | 3 ++-
fs/read_write.c | 37 ++++++++++++++++++++++++++++++++++---
include/linux/fs.h | 2 ++
5 files changed, 42 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a3c22e16509b..ec932fa0f8a9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3256,7 +3256,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
switch (whence) {
case SEEK_END:
case SEEK_CUR:
- offset = generic_file_llseek(file, offset, whence);
+ offset = generic_file_llseek_unlocked(file, offset, whence);
goto out;
case SEEK_DATA:
case SEEK_HOLE:
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b52f9baaa3e7..e220b848929b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2336,13 +2336,14 @@ static loff_t fuse_file_llseek(struct file *file, loff_t offset, int whence)
case SEEK_SET:
case SEEK_CUR:
/* No i_mutex protection necessary for SEEK_CUR and SEEK_SET */
- retval = generic_file_llseek(file, offset, whence);
+ retval = generic_file_llseek_unlocked(file, offset, whence);
break;
case SEEK_END:
inode_lock(inode);
retval = fuse_update_attributes(inode, file);
if (!retval)
- retval = generic_file_llseek(file, offset, whence);
+ retval = generic_file_llseek_unlocked(file, offset,
+ whence);
inode_unlock(inode);
break;
case SEEK_HOLE:
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 45a17b770d97..171df9550c27 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -66,7 +66,8 @@ static loff_t gfs2_llseek(struct file *file, loff_t offset, int whence)
error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY,
&i_gh);
if (!error) {
- error = generic_file_llseek(file, offset, whence);
+ error = generic_file_llseek_unlocked(file, offset,
+ whence);
gfs2_glock_dq_uninit(&i_gh);
}
break;
diff --git a/fs/read_write.c b/fs/read_write.c
index bfcb4ced5664..859dbac5b2f6 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -131,6 +131,24 @@ generic_file_llseek_size(struct file *file, loff_t offset, int whence,
}
EXPORT_SYMBOL(generic_file_llseek_size);
+/**
+ * generic_file_llseek_unlocked - lockless generic llseek implementation
+ * @file: file structure to seek on
+ * @offset: file offset to seek to
+ * @whence: type of seek
+ *
+ */
+loff_t generic_file_llseek_unlocked(struct file *file, loff_t offset,
+ int whence)
+{
+ struct inode *inode = file->f_mapping->host;
+
+ return generic_file_llseek_size(file, offset, whence,
+ inode->i_sb->s_maxbytes,
+ i_size_read(inode));
+}
+EXPORT_SYMBOL(generic_file_llseek_unlocked);
+
/**
* generic_file_llseek - generic llseek implementation for regular files
* @file: file structure to seek on
@@ -144,10 +162,23 @@ EXPORT_SYMBOL(generic_file_llseek_size);
loff_t generic_file_llseek(struct file *file, loff_t offset, int whence)
{
struct inode *inode = file->f_mapping->host;
+ loff_t retval;
- return generic_file_llseek_size(file, offset, whence,
- inode->i_sb->s_maxbytes,
- i_size_read(inode));
+ switch (whence) {
+ default:
+ return generic_file_llseek_unlocked(file, offset, whence);
+ case SEEK_END:
+ case SEEK_DATA:
+ case SEEK_HOLE:
+ /*
+ * protects against inode size race with write so that llseek
+ * doesn't see inode size being updated in write.
+ */
+ inode_lock_shared(inode);
+ retval = generic_file_llseek_unlocked(file, offset, whence);
+ inode_unlock_shared(inode);
+ return retval;
+ }
}
EXPORT_SYMBOL(generic_file_llseek);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c95c0807471f..ee35d7c013cb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3054,6 +3054,8 @@ extern loff_t noop_llseek(struct file *file, loff_t offset, int whence);
extern loff_t no_llseek(struct file *file, loff_t offset, int whence);
extern loff_t vfs_setpos(struct file *file, loff_t offset, loff_t maxsize);
extern loff_t generic_file_llseek(struct file *file, loff_t offset, int whence);
+extern loff_t generic_file_llseek_unlocked(struct file *file, loff_t offset,
+ int whence);
extern loff_t generic_file_llseek_size(struct file *file, loff_t offset,
int whence, loff_t maxsize, loff_t eof);
extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
--
2.19.1
2018年11月21日(水) 13:54 Al Viro <[email protected]>:
>
> On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote:
> > Some file systems (including ext4, xfs, ramfs ...) have the following
> > problem as I've described in the commit message of the 1/4 patch.
> >
> > The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
> > removed almost all locks in llseek() including SEEK_END. It based on the
> > idea that write() updates size atomically. But in fact, write() can be
> > divided into two or more parts in generic_perform_write() when pos
> > straddles over the PAGE_SIZE, which results in updating size multiple
> > times in one write(). It means that llseek() can see the size being
> > updated during write().
>
> And? Who has ever promised anything that insane? write(2) can take an arbitrary
> amount of time; another process doing lseek() on independently opened descriptor
> is *not* going to wait for that (e.g. page-in of the buffer being written, which
> just happens to be mmapped from a file on NFS over RFC1149 link).
Thanks.
The lock I added in NFS was nothing but slow down lseek() because a file size is
updated atomically. Even `spin_lock(&inode->i_lock)` is unnecessary.
I'll fix the commit message which only refers to specific local file
systems that use
generic_perform_write() and remove unnecessary locks in some
distributed file systems
(e.g. nfs, cifs, or more) by replacing generic_file_llseek() with
generic_file_llseek_unlocked()
so that `tail` don't have to wait for avian carriers.
Implement individual lock for SEEK_END for ext4 which directly calls
generic_file_llseek_size().
Signed-off-by: Eiichi Tsukata <[email protected]>
---
fs/ext4/file.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 69d65d49837b..6479f3066043 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -477,6 +477,16 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
default:
return generic_file_llseek_size(file, offset, whence,
maxbytes, i_size_read(inode));
+ case SEEK_END:
+ /*
+ * protects against inode size race with write so that llseek
+ * doesn't see inode size being updated in generic_perform_write
+ */
+ inode_lock_shared(inode);
+ offset = generic_file_llseek_size(file, offset, whence,
+ maxbytes, i_size_read(inode));
+ inode_unlock_shared(inode);
+ return offset;
case SEEK_HOLE:
inode_lock_shared(inode);
offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
--
2.19.1
On Thu, Nov 22, 2018 at 02:40:50PM +0900, Eiichi Tsukata wrote:
> 2018年11月21日(水) 13:54 Al Viro <[email protected]>:
> >
> > On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote:
> > > Some file systems (including ext4, xfs, ramfs ...) have the following
> > > problem as I've described in the commit message of the 1/4 patch.
> > >
> > > The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
> > > removed almost all locks in llseek() including SEEK_END. It based on the
> > > idea that write() updates size atomically. But in fact, write() can be
> > > divided into two or more parts in generic_perform_write() when pos
> > > straddles over the PAGE_SIZE, which results in updating size multiple
> > > times in one write(). It means that llseek() can see the size being
> > > updated during write().
> >
> > And? Who has ever promised anything that insane? write(2) can take an arbitrary
> > amount of time; another process doing lseek() on independently opened descriptor
> > is *not* going to wait for that (e.g. page-in of the buffer being written, which
> > just happens to be mmapped from a file on NFS over RFC1149 link).
>
> Thanks.
>
> The lock I added in NFS was nothing but slow down lseek() because a file size is
> updated atomically. Even `spin_lock(&inode->i_lock)` is unnecessary.
>
> I'll fix the commit message which only refers to specific local file
> systems that use
> generic_perform_write() and remove unnecessary locks in some
> distributed file systems
> (e.g. nfs, cifs, or more) by replacing generic_file_llseek() with
> generic_file_llseek_unlocked()
> so that `tail` don't have to wait for avian carriers.
fd = open("/mnt/sloooow/foo", O_RDONLY);
p = mmap(NULL, 8192, PROT_READ, MAP_SHARED, fd, 0);
local_fd = open("/tmp/foo", O_RDWR);
write(local_fd, p, 8192);
and there you go - extremely slow write(2). To file on a local filesystem.
Can you show me where does POSIX/SuS/whatever it's called these days promise
that kind of atomicity? TBH, I would be very surprised if any Unix promised
to have file size updated no more than once per write(2). Any userland code
that relies on such property is, as minimum, non-portable and I would argue
that it is outright broken.
Note, BTW, that your example depends upon rather non-obvious details of echo
implementation, starting with the bufferization logics used by particular
shell. And AFAICS ash(1) ends up with possibility of more than one write(2)
anyway - get SIGSTOP delivered, followed by SIGCONT and you've got just that.
Transparently for echo(1). I'm fairly sure that the same holds for anything
that isn't outright broken - write(2) *IS* interruptible (must be, for obvious
reasons) and that pair of signals will lead to short write if it comes at the
right time. The only thing userland can do (and does) is to issue further
write(2) on anything that hadn't been written the last time around.
2018年11月22日(木) 16:06 Al Viro <[email protected]>:
>
> Can you show me where does POSIX/SuS/whatever it's called these days promise
> that kind of atomicity?
No. I couldn't found it.
That's why I previously posted RFC Patch:
https://marc.info/?t=154237277900001&r=1&w=2
I wasn't sure this is a bug in the kernel or not.
> that kind of atomicity? TBH, I would be very surprised if any Unix promised
> to have file size updated no more than once per write(2). Any userland code
> that relies on such property is, as minimum, non-portable and I would argue
> that it is outright broken.
Thanks. Now It's clear. It is not a bug in the kernel, but in
userspace if `tail` assumes such
atomicity.
> Note, BTW, that your example depends upon rather non-obvious details of echo
> implementation, starting with the bufferization logics used by particular
> shell. And AFAICS ash(1) ends up with possibility of more than one write(2)
I've never imagined such a difference in echo implementation, thanks.
Implement individual lock for SEEK_END for overlayfs which directly calls
generic_file_llseek_size().
Signed-off-by: Eiichi Tsukata <[email protected]>
---
fs/overlayfs/file.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 84dd957efa24..57bc6538eea8 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -146,10 +146,27 @@ static int ovl_release(struct inode *inode, struct file *file)
static loff_t ovl_llseek(struct file *file, loff_t offset, int whence)
{
struct inode *realinode = ovl_inode_real(file_inode(file));
+ loff_t ret;
- return generic_file_llseek_size(file, offset, whence,
- realinode->i_sb->s_maxbytes,
- i_size_read(realinode));
+ switch (whence) {
+ default:
+ return generic_file_llseek_size(file, offset, whence,
+ realinode->i_sb->s_maxbytes,
+ i_size_read(realinode));
+ case SEEK_END:
+ case SEEK_DATA:
+ case SEEK_HOLE:
+ /*
+ * protects against inode size race with write so that llseek
+ * doesn't see inode size being updated in individual fs write
+ */
+ inode_lock(realinode);
+ ret = generic_file_llseek_size(file, offset, whence,
+ realinode->i_sb->s_maxbytes,
+ i_size_read(realinode));
+ inode_unlock(realinode);
+ return ret;
+ }
}
static void ovl_file_accessed(struct file *file)
--
2.19.1
2018年11月21日(水) 18:23 Christoph Hellwig <[email protected]>:
>
> On Wed, Nov 21, 2018 at 11:43:59AM +0900, Eiichi Tsukata wrote:
> > This patch itself seems to be just a cleanup but with the
> > commit b25bd1d9fd87 ("vfs: fix race between llseek SEEK_END and write")
> > it fixes race.
>
> Please move this patch to the beginning of the series and replace
> the commit log with something like the one below. Note that your
> commit id is different from the one that will appear once applied
> upstream, so the aboe isn't too helpful.
>
> ---
> f2fs: use generic_file_llseek
>
> f2fs always passes inode->i_sb->s_maxbytes to generic_file_llseek_size,
> and thus should simply use generic_file_llseek. For now this is a just
> a cleanup, but it will allow f2fs to pick up a race fix in
> generic_file_llseek for free.
Thanks, I'll fix it in v2.
On Wed, Nov 21, 2018 at 11:43:59AM +0900, Eiichi Tsukata wrote:
> This patch itself seems to be just a cleanup but with the
> commit b25bd1d9fd87 ("vfs: fix race between llseek SEEK_END and write")
> it fixes race.
Please move this patch to the beginning of the series and replace
the commit log with something like the one below. Note that your
commit id is different from the one that will appear once applied
upstream, so the aboe isn't too helpful.
---
f2fs: use generic_file_llseek
f2fs always passes inode->i_sb->s_maxbytes to generic_file_llseek_size,
and thus should simply use generic_file_llseek. For now this is a just
a cleanup, but it will allow f2fs to pick up a race fix in
generic_file_llseek for free.