2009-11-18 16:07:37

by Jan Blunck

[permalink] [raw]
Subject: [PATCH 1/2] BKL: Remove BKL from default_llseek()

Using the BKL in llseek() does not protect the inode's i_size from
modification since the i_size is protected by a seqlock nowadays. Since
default_llseek() is already using the i_size_read() wrapper it is not the
BKL which is serializing the access here.
The access to file->f_pos is not protected by the BKL either since its
access in vfs_write()/vfs_read() is not protected by any lock. If the BKL
is not protecting anything here it can clearly get removed.

Signed-off-by: Jan Blunck <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Frédéric Weisbecker <[email protected]>
Cc: John Kacur <[email protected]>
---
fs/read_write.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 3ac2898..0e491cc 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -107,7 +107,6 @@ loff_t default_llseek(struct file *file, loff_t offset, int origin)
{
loff_t retval;

- lock_kernel();
switch (origin) {
case SEEK_END:
offset += i_size_read(file->f_path.dentry->d_inode);
@@ -128,7 +127,6 @@ loff_t default_llseek(struct file *file, loff_t offset, int origin)
retval = offset;
}
out:
- unlock_kernel();
return retval;
}
EXPORT_SYMBOL(default_llseek);
--
1.6.4.2


2009-11-18 16:07:46

by Jan Blunck

[permalink] [raw]
Subject: [PATCH 2/2] BKL: Update documentation on llseek(  )

The inode's i_size is not protected by the big kernel lock. Therefore it
does not make sense to recommend taking the BKL in filesystems llseek
operations. Instead it should use the inode's mutex or use just use
i_size_read() instead. Add a note that this is not protecting file->f_pos.

Signed-off-by: Jan Blunck <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Frédéric Weisbecker <[email protected]>
Cc: John Kacur <[email protected]>
---
Documentation/filesystems/Locking | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 18b9d0c..25159d4 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -429,8 +429,9 @@ check_flags: no
implementations. If your fs is not using generic_file_llseek, you
need to acquire and release the appropriate locks in your ->llseek().
For many filesystems, it is probably safe to acquire the inode
-semaphore. Note some filesystems (i.e. remote ones) provide no
-protection for i_size so you will need to use the BKL.
+mutex or just to use i_size_read() instead.
+Note: this does not protect the file->f_pos against concurrent modifications
+since this is something the userspace has to take care about.

Note: ext2_release() was *the* source of contention on fs-intensive
loads and dropping BKL on ->release() helps to get rid of that (we still
--
1.6.4.2

2009-11-18 17:14:13

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

> Using the BKL in llseek() does not protect the inode's i_size from
> modification since the i_size is protected by a seqlock nowadays. Since
> default_llseek() is already using the i_size_read() wrapper it is not the
> BKL which is serializing the access here.
> The access to file->f_pos is not protected by the BKL either since its
> access in vfs_write()/vfs_read() is not protected by any lock. If the BKL
> is not protecting anything here it can clearly get removed.

No. Your logic is flawed

The BKL is protected something here - it protects the change of offset
with respect to other BKL users within drivers. The question is what if
anything in any other driver code depends upon the BKL and uses it to
protect f_pos. Probably very little if anything but a grep for f_pos
through the drivers might not be a bad idea before assuming this. Very
few touch f_pos except in their own llseek method.

2009-11-18 17:27:31

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

Alan Cox wrote:
> > Using the BKL in llseek() does not protect the inode's i_size from
> > modification since the i_size is protected by a seqlock nowadays. Since
> > default_llseek() is already using the i_size_read() wrapper it is not the
> > BKL which is serializing the access here.
> > The access to file->f_pos is not protected by the BKL either since its
> > access in vfs_write()/vfs_read() is not protected by any lock. If the BKL
> > is not protecting anything here it can clearly get removed.
>
> No. Your logic is flawed
>
> The BKL is protected something here - it protects the change of offset
> with respect to other BKL users within drivers. The question is what if
> anything in any other driver code depends upon the BKL and uses it to
> protect f_pos. Probably very little if anything but a grep for f_pos
> through the drivers might not be a bad idea before assuming this. Very
> few touch f_pos except in their own llseek method.

Of course, drivers shouldn't be using f_pos outside their llseek
method, as they should all behave the same with pread/pwrite as with
llseek+read/write.

Is that mistaken?

-- Jamie

2009-11-18 17:34:03

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

On Wednesday 18 November 2009, Jamie Lokier wrote:
> Alan Cox wrote:
> > > Using the BKL in llseek() does not protect the inode's i_size from
> > > modification since the i_size is protected by a seqlock nowadays. Since
> > > default_llseek() is already using the i_size_read() wrapper it is not the
> > > BKL which is serializing the access here.
> > > The access to file->f_pos is not protected by the BKL either since its
> > > access in vfs_write()/vfs_read() is not protected by any lock. If the BKL
> > > is not protecting anything here it can clearly get removed.
> >
> > No. Your logic is flawed
> >
> > The BKL is protected something here - it protects the change of offset
> > with respect to other BKL users within drivers. The question is what if
> > anything in any other driver code depends upon the BKL and uses it to
> > protect f_pos. Probably very little if anything but a grep for f_pos
> > through the drivers might not be a bad idea before assuming this. Very
> > few touch f_pos except in their own llseek method.
>
> Of course, drivers shouldn't be using f_pos outside their llseek
> method, as they should all behave the same with pread/pwrite as with
> llseek+read/write.
>
> Is that mistaken?

There are drivers touching f_pos in ioctl() methods, which is vaguely
reasonable. There are also driver touching it in their read()/write()
methods, which has no effect whatsoever.

I started grepping through the kernel trying to find any instances
of the first case that uses the BKL, but I only found three instances
of the second case and got heavily demotivated by that.

Arnd <><

2009-11-18 17:35:52

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

Am Mittwoch, 18. November 2009 18:27:30 schrieb Jamie Lokier:
> > No. Your logic is flawed
> >
> > The BKL is protected something here - it protects the change of offset
> > with respect to other BKL users within drivers. The question is what if
> > anything in any other driver code depends upon the BKL and uses it to
> > protect f_pos. Probably very little if anything but a grep for f_pos
> > through the drivers might not be a bad idea before assuming this. Very
> > few touch f_pos except in their own llseek method.
>
> Of course, drivers shouldn't be using f_pos outside their llseek
> method, as they should all behave the same with pread/pwrite as with
> llseek+read/write.

Might not a driver update f_pos after read/write?

Regards
Oliver

2009-11-18 17:50:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

Oliver Neukum wrote:
> Am Mittwoch, 18. November 2009 18:27:30 schrieb Jamie Lokier:
> > > No. Your logic is flawed
> > >
> > > The BKL is protected something here - it protects the change of offset
> > > with respect to other BKL users within drivers. The question is what if
> > > anything in any other driver code depends upon the BKL and uses it to
> > > protect f_pos. Probably very little if anything but a grep for f_pos
> > > through the drivers might not be a bad idea before assuming this. Very
> > > few touch f_pos except in their own llseek method.
> >
> > Of course, drivers shouldn't be using f_pos outside their llseek
> > method, as they should all behave the same with pread/pwrite as with
> > llseek+read/write.
>
> Might not a driver update f_pos after read/write?

It could indirectly, through *ppos.

There should be no direct accesses to f_pos outseek llseek. If there
are still, those might indicate driver bugs. (I'm not 100% sure about
this - hence asking).

Drivers used to update f_pos indirectly through *ppos, and for this,
Alan's observation about BKL protecting the value from changing does apply.

But nowadays, even that doesn't happen. sys_read() and sys_write()
make a copy of f_pos using file_pos_read(), so drivers cannot see the
value change during the call - except for their own change.

I find myself wondering why the VFS isn't responsible for the position
update instead of the driver... Would it be a valid cleanup to move
it from the driver to VFS?

-- Jamie

2009-11-18 17:53:08

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

On Wed, Nov 18, Alan Cox wrote:

> > Using the BKL in llseek() does not protect the inode's i_size from
> > modification since the i_size is protected by a seqlock nowadays. Since
> > default_llseek() is already using the i_size_read() wrapper it is not the
> > BKL which is serializing the access here.
> > The access to file->f_pos is not protected by the BKL either since its
> > access in vfs_write()/vfs_read() is not protected by any lock. If the BKL
> > is not protecting anything here it can clearly get removed.
>
> No. Your logic is flawed
>
> The BKL is protected something here - it protects the change of offset
> with respect to other BKL users within drivers. The question is what if
> anything in any other driver code depends upon the BKL and uses it to
> protect f_pos. Probably very little if anything but a grep for f_pos
> through the drivers might not be a bad idea before assuming this. Very
> few touch f_pos except in their own llseek method.

As I said, f_pos is changed without holding BKL in the VFS already. Therefore
even if the driver tries to protect f_pos by holding the BKL it is racing
against concurrent read()/write() anyway on f_pos.

Regards,
Jan

--
Jan Blunck <[email protected]>

2009-11-18 17:54:30

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

> > Of course, drivers shouldn't be using f_pos outside their llseek
> > method, as they should all behave the same with pread/pwrite as with
> > llseek+read/write.
>
> Might not a driver update f_pos after read/write?

The driver updates the passed pointer to an offset. It has no idea how to
lock that and that is isolated and handled higher up the stack.

There are no obvious reasons to peer at f_pos. I've so far checked all of
drivers/char and that is clean (as well as being the most likely suspect
for old code). A sweep of the other driver subsystems should be
enough to find any offenders

Alan

2009-11-18 17:55:38

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

On Wed, Nov 18, Oliver Neukum wrote:

> Am Mittwoch, 18. November 2009 18:27:30 schrieb Jamie Lokier:
> > > No. Your logic is flawed
> > >
> > > The BKL is protected something here - it protects the change of offset
> > > with respect to other BKL users within drivers. The question is what if
> > > anything in any other driver code depends upon the BKL and uses it to
> > > protect f_pos. Probably very little if anything but a grep for f_pos
> > > through the drivers might not be a bad idea before assuming this. Very
> > > few touch f_pos except in their own llseek method.
> >
> > Of course, drivers shouldn't be using f_pos outside their llseek
> > method, as they should all behave the same with pread/pwrite as with
> > llseek+read/write.
>
> Might not a driver update f_pos after read/write?

Actually the VFS is doing that for you in read/write syscall. So anything you
change in your driver gets corrected (or overwritten anyway).

Regards,
Jan

--
Jan Blunck <[email protected]>

2009-11-18 17:59:24

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

> There are drivers touching f_pos in ioctl() methods, which is vaguely
> reasonable. There are also driver touching it in their read()/write()
> methods, which has no effect whatsoever.

osst is the obvious offender. The ioctl ones like mtdchar seem to be
broken but they have their own locking (or lack thereof) in there own
lseek so its an internal proble,.

>
> I started grepping through the kernel trying to find any instances
> of the first case that uses the BKL, but I only found three instances
> of the second case and got heavily demotivated by that.

osst probably wants to get an && BROKEN

2009-11-18 18:07:01

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

On Wednesday 18 November 2009, Alan Cox wrote:
> > There are drivers touching f_pos in ioctl() methods, which is vaguely
> > reasonable. There are also driver touching it in their read()/write()
> > methods, which has no effect whatsoever.
>
> osst is the obvious offender. The ioctl ones like mtdchar seem to be
> broken but they have their own locking (or lack thereof) in there own
> lseek so its an internal proble,.
>

The other ones I found before giving up are drivers/s390/char/tape_char.c
and drivers/sbus/char/flash.c.

Arnd <><

2009-11-18 18:13:13

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

Complete list of offenders I can find

Stuff that just doesn't look like it can work and ought to be marked
BROKEN as it must have been non-working for ages.

drivers/scsi/osst
drives/s390/char/s390tape
arch/frv/kernel/sysctl.c
arch/cris/arch-v10/drivers/eeprom.c

Stuff that looks trivial to salvage and works if you don't pread/write I
imagine

drivers/sbus/char/flash.c

Bits in staging

rt2860
vt6656

plus some obvious mega abusers that will need cleaning up anyway in that
area.

2009-11-18 18:14:49

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

O> But nowadays, even that doesn't happen. sys_read() and sys_write()
> make a copy of f_pos using file_pos_read(), so drivers cannot see the
> value change during the call - except for their own change.
>
> I find myself wondering why the VFS isn't responsible for the position
> update instead of the driver... Would it be a valid cleanup to move
> it from the driver to VFS?

And how would you adjust it. Not all devices have a bytes read == offset
relationship. The VFS doesn't know enough.

2009-11-18 18:15:50

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

> As I said, f_pos is changed without holding BKL in the VFS already. Therefore

IFF the VFS paths that change it can be invoked in parallel.

> even if the driver tries to protect f_pos by holding the BKL it is racing
> against concurrent read()/write() anyway on f_pos.

It may simply be a matter of internal consistency. Anyway I've now made a
list of the offenders so we can simply fix/delete them and continue.

2009-11-18 19:18:19

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

On Wednesday 18 November 2009, Alan Cox wrote:
> Stuff that looks trivial to salvage and works if you don't pread/write I
> imagine
>
> drivers/sbus/char/flash.c

It also breaks if you do consecutive read() operations without an lseek
between them, because read() does not update *pos. It should work if you
read the whole thing with a single read() operation though, like
the other drivers you mentioned.

Arnd <><

2009-11-18 20:02:25

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

On Wed, Nov 18, Alan Cox wrote:

> > As I said, f_pos is changed without holding BKL in the VFS already. Therefore
>
> IFF the VFS paths that change it can be invoked in parallel.
>
> > even if the driver tries to protect f_pos by holding the BKL it is racing
> > against concurrent read()/write() anyway on f_pos.
>
> It may simply be a matter of internal consistency. Anyway I've now made a
> list of the offenders so we can simply fix/delete them and continue.

Right. Thanks for that btw. I'll work on fixing them and followup with patches
tomorrow.

Thanks,
Jan

--
Jan Blunck <[email protected]>

2009-11-19 02:40:24

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

Alan Cox wrote:
> O> But nowadays, even that doesn't happen. sys_read() and sys_write()
> > make a copy of f_pos using file_pos_read(), so drivers cannot see the
> > value change during the call - except for their own change.
> >
> > I find myself wondering why the VFS isn't responsible for the position
> > update instead of the driver... Would it be a valid cleanup to move
> > it from the driver to VFS?
>
> And how would you adjust it. Not all devices have a bytes read == offset
> relationship. The VFS doesn't know enough.

That was implicit in my question: Are there any seekable devices where
bytes read != offset delta, and if yes, is that correct behaviour, a
bug, or a silly interface that should go away?

-- Jamie

2009-11-19 10:12:41

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/2] BKL: Remove BKL from default_llseek()

> That was implicit in my question: Are there any seekable devices where
> bytes read != offset delta, and if yes, is that correct behaviour, a
> bug, or a silly interface that should go away?

It's neither a bug nor silly, although some of our users of it are a bit
strange.

Consider a block based device. It makes complete sense then that a short
read (eg user space passing a small buffer) causes the seek position to
move on one block. Ditto a block based system with EOF markers might do
that (eg some tape systems).

In the weirdness (but API) category we have things like the MSR driver
which returns an entire MSR for each byte offset.

Alan