2019-11-25 21:48:57

by Kenneth Crudup

[permalink] [raw]
Subject: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace


FYI. Don't know if the fault lies with these subsystems or not, but I know
it's been a goal to keep things working. In both cases, reverting this commit
fixes both these issues. If you need additional information from me, please
let me know.

First issue manifests itself as a timeout in KDE's upower daemon:

----
Nov 25 13:18:51 hp-x360n systemd[1]: upower.service: Start operation timed out. Terminating.
Nov 25 13:18:51 hp-x360n systemd[1]: upower.service: Main process exited, code=killed, status=15/TERM
Nov 25 13:18:51 hp-x360n systemd[1]: upower.service: Failed with result 'timeout'.
Nov 25 13:18:51 hp-x360n systemd[1]: Failed to start Daemon for power management.
Nov 25 13:18:51 hp-x360n systemd[1]: upower.service: Service RestartSec=100ms expired, scheduling restart.
Nov 25 13:18:51 hp-x360n systemd[1]: upower.service: Scheduled restart job, restart counter is at 9.
Nov 25 13:18:51 hp-x360n systemd[1]: Stopped Daemon for power management.
Nov 25 13:18:51 hp-x360n systemd[1]: Starting Daemon for power management...
----

Second issue:

----
Nov 25 13:19:22 hp-x360n kernel: [ 861.213047] INFO: task vmware-vmblock-:1179 blocked for more than 737 seconds.
Nov 25 13:19:22 hp-x360n kernel: [ 861.213053] Tainted: G U 5.4.0-Kenny+ #8
Nov 25 13:19:22 hp-x360n kernel: [ 861.213054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 25 13:19:22 hp-x360n kernel: [ 861.213056] vmware-vmblock- D 0 1179 1 0x00000000
Nov 25 13:19:22 hp-x360n kernel: [ 861.213060] Call Trace:
Nov 25 13:19:22 hp-x360n kernel: [ 861.213068] ? __schedule+0x2b8/0x580
Nov 25 13:19:22 hp-x360n kernel: [ 861.213071] schedule+0x36/0xc0
Nov 25 13:19:22 hp-x360n kernel: [ 861.213074] schedule_preempt_disabled+0x11/0x20
Nov 25 13:19:22 hp-x360n kernel: [ 861.213077] __mutex_lock.isra.11+0x2f0/0x500
Nov 25 13:19:22 hp-x360n kernel: [ 861.213082] __fdget_pos+0x42/0x50
Nov 25 13:19:22 hp-x360n kernel: [ 861.213085] do_writev+0x2f/0x110
Nov 25 13:19:22 hp-x360n kernel: [ 861.213088] do_syscall_64+0x46/0x170
Nov 25 13:19:22 hp-x360n kernel: [ 861.213091] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 25 13:19:22 hp-x360n kernel: [ 861.213094] RIP: 0033:0x7f4c6da1d527
Nov 25 13:19:22 hp-x360n kernel: [ 861.213098] Code: Bad RIP value.
Nov 25 13:19:22 hp-x360n kernel: [ 861.213100] RSP: 002b:00007f4c6d6aebc0 EFLAGS: 00000297 ORIG_RAX: 0000000000000014
Nov 25 13:19:22 hp-x360n kernel: [ 861.213102] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4c6da1d527
Nov 25 13:19:22 hp-x360n kernel: [ 861.213103] RDX: 0000000000000002 RSI: 00007f4c6d6aec50 RDI: 0000000000000003
Nov 25 13:19:22 hp-x360n kernel: [ 861.213104] RBP: 00007f4c6d6aec50 R08: 0000000000000001 R09: 00007f4c6ce8c700
Nov 25 13:19:22 hp-x360n kernel: [ 861.213106] R10: 00007f4c68000080 R11: 0000000000000297 R12: 0000000000000002
Nov 25 13:19:22 hp-x360n kernel: [ 861.213107] R13: 0000000000020000 R14: 0000000000000000 R15: 0000000000000000
----

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Silicon Valley


2019-11-25 22:57:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace

On Mon, Nov 25, 2019 at 1:30 PM Kenneth R. Crudup <[email protected]> wrote:
>
> FYI. Don't know if the fault lies with these subsystems or not, but I know
> it's been a goal to keep things working. In both cases, reverting this commit
> fixes both these issues. If you need additional information from me, please
> let me know.

Do you have any way to figure out what file descriptor we have in
those two cases?

It's almost certainly trivial to fix - if I were to know just what
kind of file we're blocking on that just needs to be marked
FMODE_STREAM.

The commit itself is also trivial enough to just revert if we can't
figure it out, but that's why I applied it first thing in the merge
window - to see the problems quickly and hopefully get them fixed. It
obviously showed no problems on my machines, but I have neither vmware
nor KDE upower...

Anyway, here's a TOTALLYT UNTESTED patch that may help pinpoint which
thing it is that causes issues.

It might also be so noisy as to be useless, I didn't think it through
a lot. Mind trying it out if you can't immediately see "ahh, vmware fd
#N is a file of type XYZ"?

Linus


Attachments:
patch.diff (1.06 kB)

2019-11-26 02:13:57

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace


On Mon, 25 Nov 2019, Linus Torvalds wrote:

> So to get rid of at least _that_ endless noise, add this to the patch:

Thanks for the update; I'll get to test this later tonight (US Pacific time).

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Silicon Valley

2019-11-26 02:16:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace

On Mon, Nov 25, 2019 at 2:55 PM Linus Torvalds
<[email protected]> wrote:
>
> Anyway, here's a TOTALLY UNTESTED patch that may help pinpoint which
> thing it is that causes issues.
>
> It might also be so noisy as to be useless, I didn't think it through
> a lot.

Yeah, I ended up testing it between merges, and it points out a lot of
files. Too many to usefully narrow down which one then might cause
problems for you.

The main cause of that is that it will complain for _every_ O_PATH
open, because those use the 'empty_fops' and don't actually allow any
operations at all (neither read/write nor llseek). We could have
marked those O_STREAM, since they can't be used for any seeking
operations.

So to get rid of at least _that_ endless noise, add this to the patch:

--- a/fs/open.c
+++ b/fs/open.c
@@ -748,7 +748,7 @@ static int do_dentry_open(struct file *f,
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

if (unlikely(f->f_flags & O_PATH)) {
- f->f_mode = FMODE_PATH | FMODE_OPENED;
+ f->f_mode = FMODE_PATH | FMODE_OPENED | FMODE_STREAM;
f->f_op = &empty_fops;
return 0;
}

the above is entirely whitespace-damaged, but you can see what it's
doing. That should cut down on all the noise generated by O_PATH
opens.

It might still be a bit noisy even with the above, but I think it will
at least be better.

Linus

2019-11-26 03:27:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace

On Mon, Nov 25, 2019 at 5:58 PM Linus Torvalds
<[email protected]> wrote:
>
> It might still be a bit noisy even with the above, but I think it will
> at least be better.

Yeah, I think I see what's going on.

I for some reason entirely missed the tty case. Oops. That was just
stupid of me, I should have thought of it.

There are other things that trigger the new informational line in that
hacky patch, and they _might_ matter, but I suspect it's the tty and
sound cases that causes the worst problems.

I suspect Kirill Smelkov even might have mentioned the tty case at one
point, and I just spaced out.

There are other things too that trigger that debug check, like the
sound file descriptors, and they might well matter too.

Anyway, I think the thing to do (for now) is to just say "character
devices are FMODE_STREAM files if they have no llseek operations".
That should take care of both tty's and the sound devices.

You can certainly have a character device that can do llseek, but it
sounds like a reasonable base rule.

Of course, this may fix the f_pos locking issue, but replace it with a
"oops, the character device driver tried to look at *pos anyway", and
that will give you a nice OOPS instead.

So this patch might just replace the failure mode with another failure
mode instead. At which point I think I'd have to revert that "get rid
of FMODE_ATOMIC_POS" trial balloon, but let's see if this patch solves
your problem and is sufficient..

I'd suggest using it _together_ with that "pr_info()" debug patch I
sent, to see what else might be going on..

Linus


Attachments:
patch.diff (756.00 B)

2019-11-26 03:59:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace

On Mon, Nov 25, 2019 at 7:21 PM Linus Torvalds
<[email protected]> wrote:
>
> Anyway, I think the thing to do (for now) is to just say "character
> devices are FMODE_STREAM files if they have no llseek operations".
> That should take care of both tty's and the sound devices.

A cleaner thing might be to add an explicit field to 'struct
file_operations' to show that it is a stream operation. That would
make it much easier for drivers to say "mark me as a stream" without
having to change their open routines (not all cases might even have
open routines).

The file operations already have a history of this kind of "this is
what I support" flags with the "mmap_supported_flags" mask, which is a
different (but at the same time somewhat similar) set of "this is the
set of operations I support" thing. FMODE_STREAM wouldn't be that
different.

Anyway, I was clearly too optimistic as to how painless this would be.
I tested it on my desktop and laptop, but they have very similar
setups other than their form factor, so the fact that neither showed
any issues was perhaps not all that meaningful.

Linus

2019-11-26 04:05:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace

On Mon, Nov 25, 2019 at 7:21 PM Linus Torvalds
<[email protected]> wrote:
>
> Of course, this may fix the f_pos locking issue, but replace it with a
> "oops, the character device driver tried to look at *pos anyway", and
> that will give you a nice OOPS instead.

Confirmed. At least the x86 firmware update code uses
"simple_read_from_buffer()", which does use the file position, but
doesn't actually allow llseek().

So no, "it's a character device no llseek" does not mean that it acts
as a pure streaming device with no file position, and we'd actually
have to mark individual drivers (either by adding 'stream_open()' in
their open routines, or adding the extra field to 'struct
file_operations' that I mentioned).

I think I'll have to revert that trial commit. I'll give it another
day in case somebody has a better idea, but it looks like it's too
early to do that nice cleanup as things are now.

Linus

2019-11-26 10:17:22

by Kirill Smelkov

[permalink] [raw]
Subject: Re: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace

On Mon, Nov 25, 2019 at 07:39:34PM -0800, Linus Torvalds wrote:
> On Mon, Nov 25, 2019 at 7:21 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > Of course, this may fix the f_pos locking issue, but replace it with a
> > "oops, the character device driver tried to look at *pos anyway", and
> > that will give you a nice OOPS instead.
>
> Confirmed. At least the x86 firmware update code uses
> "simple_read_from_buffer()", which does use the file position, but
> doesn't actually allow llseek().
>
> So no, "it's a character device no llseek" does not mean that it acts
> as a pure streaming device with no file position, and we'd actually
> have to mark individual drivers (either by adding 'stream_open()' in
> their open routines, or adding the extra field to 'struct
> file_operations' that I mentioned).
>
> I think I'll have to revert that trial commit. I'll give it another
> day in case somebody has a better idea, but it looks like it's too
> early to do that nice cleanup as things are now.
>
> Linus

Linus, thanks for heads up and for pushing FMODE_STREAM conversion
forward.

I still believe the most practical approach to do it is via automation -
via extending and teaching stream_open.cocci to find other places that
actually don't use ppos / ki->ki_pos at all and at least warn that
stream_open() should be used for that file. Else it will be a lot of
pain.

Of course we can manually convert the core cases like pipes and sockets.
But those things are easy to do. On the other hand by manually
converting them, we lesser the extent to where the automation is applied
and tested, and that leaves us with just many small corner cases that
will be left in potentially racy/deadlocked state until someone actually
hit that problem and cares to debug it to understand what is going on.
Those will be many different places and so likely many different people,
and since debugging concurrency issues is not easy, it will likely last for
many years. Of course things like KCSAN helps a lot, but, if I
understand correctly, they do not provide full coverage for the whole
kernel, especially they are likely not providing coverage for leaf
kernel bits in drivers.

Of course, I might be missing something, and there is a way to e.g.
combine automation+pain approaches. Then that would be good to make
combined progress.

I still keep in mind extending stream_open.cocci myself, but for now I
cannot delve into it before finishing my transactional filesystem...

Kirill


P.S. even though I was put into cc there, somehow I did not received any
notification email for commits d8e464ecc17b (vfs: mark pipes and sockets as
stream-like file descriptors) and 0be0ee71816b (vfs: properly and
reliably lock f_pos in fdget_pos()).


P.P.S completely untested, but looks sane:

---- 8< ----
From 634e1af8775aa27799b4879fdb527e4a3b8b31ef Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <[email protected]>
Date: Tue, 26 Nov 2019 12:40:56 +0300
Subject: [PATCH] socket: No need to check for ki_pos != 0 anymore

Commit d8e464ecc17b (vfs: mark pipes and sockets as stream-like file
descriptors) converted sockets to use stream_open() which forbids llseek
and sets FMODE_STREAM for which VFS ksys_read/ksys_write routines make
sure not to change file->f_pos at all and always pass either ppos=NULL,
or ki->ki_pos=0 to fops->read/write, because there is no file position
for stream-like files.

Drop unnecessary checks.

Signed-off-by: Kirill Smelkov <[email protected]>
---
net/socket.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 17bc1eee198a..d8c583c755ae 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -959,9 +959,6 @@ static ssize_t sock_read_iter(struct kiocb *iocb, struct iov_iter *to)
if (file->f_flags & O_NONBLOCK)
msg.msg_flags = MSG_DONTWAIT;

- if (iocb->ki_pos != 0)
- return -ESPIPE;
-
if (!iov_iter_count(to)) /* Match SYS5 behaviour */
return 0;

@@ -978,9 +975,6 @@ static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from)
.msg_iocb = iocb};
ssize_t res;

- if (iocb->ki_pos != 0)
- return -ESPIPE;
-
if (file->f_flags & O_NONBLOCK)
msg.msg_flags = MSG_DONTWAIT;

--
2.20.1

2019-11-26 13:03:18

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace


On Tue, 26 Nov 2019, Kirill Smelkov wrote:

> P.S. even though I was put into cc there, somehow I did not received any
> notification email for commits d8e464ecc17b (vfs: mark pipes and sockets as
> stream-like file descriptors) and 0be0ee71816b (vfs: properly and
> reliably lock f_pos in fdget_pos()).

That's my fault; the CC: list for those commits was pretty long and I was
worried about it appearing like SPAMming anyone who'd signed off on it, so
I'd punted and sent it to Linus and the LKML only.

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Silicon Valley

2019-11-26 20:06:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace

On Mon, Nov 25, 2019 at 7:39 PM Linus Torvalds
<[email protected]> wrote:
>
> I think I'll have to revert that trial commit. I'll give it another
> day in case somebody has a better idea, but it looks like it's too
> early to do that nice cleanup as things are now.

I've reverted it for now,

I don't want this problem to cause people to report known bugs during
the merge window. The fact that I saw no issues obviously is just a
matter of my workloads being too simple.

Linus

2019-11-27 08:31:47

by Kirill Smelkov

[permalink] [raw]
Subject: Re: Commit 0be0ee71 ("fs: properly and reliably lock f_pos in fdget_pos()") breaking userspace

On Tue, Nov 26, 2019 at 12:03:36PM -0800, Linus Torvalds wrote:
> On Mon, Nov 25, 2019 at 7:39 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > I think I'll have to revert that trial commit. I'll give it another
> > day in case somebody has a better idea, but it looks like it's too
> > early to do that nice cleanup as things are now.
>
> I've reverted it for now,
>
> I don't want this problem to cause people to report known bugs during
> the merge window. The fact that I saw no issues obviously is just a
> matter of my workloads being too simple.

I see, ok; let's move on incrementally. What is the fate of small
net/socket.c cleanup patch? Should I resend it to netdev or something?


On Tue, Nov 26, 2019 at 04:46:39AM -0800, Kenneth R. Crudup wrote:
>
> On Tue, 26 Nov 2019, Kirill Smelkov wrote:
>
> > P.S. even though I was put into cc there, somehow I did not received any
> > notification email for commits d8e464ecc17b (vfs: mark pipes and sockets as
> > stream-like file descriptors) and 0be0ee71816b (vfs: properly and
> > reliably lock f_pos in fdget_pos()).
>
> That's my fault; the CC: list for those commits was pretty long and I was
> worried about it appearing like SPAMming anyone who'd signed off on it, so
> I'd punted and sent it to Linus and the LKML only.

I did not mean it was your fault. I was trying to say that there was no
automatic notification just as those commits were pushed to master. And
neither there is for 2be7d348fe92 (Revert "vfs: properly and reliably
lock f_pos in fdget_pos()").

Kirill