2012-08-31 04:25:38

by Andreas Bombe

[permalink] [raw]
Subject: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"

I have recently started to get problems with X simply shutting itself
down and returning to the login screen. In the X logs I find:

> [ 1492.936]
> Fatal server error:
> [ 1492.936] WaitForSomething(): select: Invalid argument

No messages whatsoever is found in the kernel logs. This error happens
randomly without any correlation to user input, but with a high
likelihood (within a few minutes at most) when a video is playing. It
doesn't matter if the video is in Flash in a browser window or in a
video player playing a local file.

With that somewhat easy test I bisected it down to 4e8b14526 "time:
Improve sanity checking of timekeeping inputs". The latest Linus git
(155e36d40) with a revert of the bisected commit does not show the
problem.

Video is Radeon HD 6950 with open source drivers. Xorg version is the
one currently in Debian unstable (xserver-xorg-core: 2:1.12.3.902-1,
xserver-xorg-video-radeon: 1:6.14.4-5, libdrm: 2.4.33-3).

--
Andreas Bombe


2012-08-31 04:26:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"

On Thu, Aug 30, 2012 at 9:05 PM, Andreas Bombe <[email protected]> wrote:
>
> With that somewhat easy test I bisected it down to 4e8b14526 "time:
> Improve sanity checking of timekeeping inputs". The latest Linus git
> (155e36d40) with a revert of the bisected commit does not show the
> problem.

Ok, I guess we need to revert it. Although it might be interesting to
add a WARN_ON_ONCE() for the case of timespec_valid() returning false,
to just see exactly *where* that thing triggers. Could you do that? In
fact, do it with separate WARN_ON_ONCE's for each of the reasons that
function returns false, so that we also see which check it is that
triggers. Ok?

Linus

2012-08-31 17:41:28

by Andreas Bombe

[permalink] [raw]
Subject: Re: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"

On Thu, Aug 30, 2012 at 09:25:52PM -0700, Linus Torvalds wrote:
> On Thu, Aug 30, 2012 at 9:05 PM, Andreas Bombe <[email protected]> wrote:
> >
> > With that somewhat easy test I bisected it down to 4e8b14526 "time:
> > Improve sanity checking of timekeeping inputs". The latest Linus git
> > (155e36d40) with a revert of the bisected commit does not show the
> > problem.
>
> Ok, I guess we need to revert it. Although it might be interesting to
> add a WARN_ON_ONCE() for the case of timespec_valid() returning false,
> to just see exactly *where* that thing triggers. Could you do that? In
> fact, do it with separate WARN_ON_ONCE's for each of the reasons that
> function returns false, so that we also see which check it is that
> triggers. Ok?

It triggers on ((unsigned long long)ts->tv_sec >= KTIME_SEC_MAX).
Looking at some straces (I could have thought of that earlier…) X does
in fact call select with unreasonable timeouts:

| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 20000}) = 1 (in [24], left {0, 19988})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 19000}) = 1 (in [24], left {0, 18988})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 19000}) = 1 (in [24], left {0, 16804})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 16000}) = 1 (in [24], left {0, 15988})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 16000}) = 1 (in [9], left {0, 3649})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 3000}) = 1 (in [24], left {0, 2988})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 2000}) = 1 (in [24], left {0, 1988})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 2000}) = 0 (Timeout)
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {18446744073709551, 615000}) = -1 EINVAL (Invalid argument)

The time values are actually decreasing from 90 seconds to this. That
seconds value is actually (0ULL - 1) / 1000, so something is
decrementing the timeout beyond zero. I don't see how it could happen
directly in WaitForSomething in the X server sources[1], it's probably
in the BlockHandler callbacks somewhere.

Have to dig deeper to see if that is a long standing issue.

[1] http://cgit.freedesktop.org/xorg/xserver/tree/os/WaitFor.c?h=mpx&id=xorg-server-1.12.3.902#n145

--
Andreas Bombe

2012-08-31 17:43:52

by john stultz

[permalink] [raw]
Subject: Re: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"

On 08/30/2012 09:05 PM, Andreas Bombe wrote:
> I have recently started to get problems with X simply shutting itself
> down and returning to the login screen. In the X logs I find:
>
>> [ 1492.936]
>> Fatal server error:
>> [ 1492.936] WaitForSomething(): select: Invalid argument
> No messages whatsoever is found in the kernel logs. This error happens
> randomly without any correlation to user input, but with a high
> likelihood (within a few minutes at most) when a video is playing. It
> doesn't matter if the video is in Flash in a browser window or in a
> video player playing a local file.
>
> With that somewhat easy test I bisected it down to 4e8b14526 "time:
> Improve sanity checking of timekeeping inputs". The latest Linus git
> (155e36d40) with a revert of the bisected commit does not show the
> problem.
>
> Video is Radeon HD 6950 with open source drivers. Xorg version is the
> one currently in Debian unstable (xserver-xorg-core: 2:1.12.3.902-1,
> xserver-xorg-video-radeon: 1:6.14.4-5, libdrm: 2.4.33-3).

Thanks so much for bisecting this down!
I'm guessing X is passing crazy large timespecs into select (via
WaitForSomething()) values that are catching on the ktime_t overflow
check in timespec_valid(). Previously these would be clamped to
KTIME_MAX (which basically is infinity) in the timer subsystem before.

So the issue is the patch in question is too strict in its validation.
We want to be strict on things like timekeeping inputs, but for timers
wait to infinity is still valid.

The attached (sorry not inline, on the road) patch should fix this, but
could you verify it? (I'm running my testing concurrently)

Linus: The issue the patch in question addresses has only been reported
from trinity stress testing and a system with a crazy CMOS clock value,
so I'm ok with the revert if you'd prefer that.

thanks
-john


Attachments:
0001-time-Move-ktime_t-overflow-checking-into-timespec_va.patch (3.39 kB)

2012-08-31 17:56:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"

On Fri, Aug 31, 2012 at 10:41 AM, Andreas Bombe <[email protected]> wrote:
>
> It triggers on ((unsigned long long)ts->tv_sec >= KTIME_SEC_MAX).
> Looking at some straces (I could have thought of that earlier?) X does
> in fact call select with unreasonable timeouts:
>
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 20000}) = 1 (in [24], left {0, 19988})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 19000}) = 1 (in [24], left {0, 18988})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 19000}) = 1 (in [24], left {0, 16804})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 16000}) = 1 (in [24], left {0, 15988})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 16000}) = 1 (in [9], left {0, 3649})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 3000}) = 1 (in [24], left {0, 2988})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 2000}) = 1 (in [24], left {0, 1988})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 2000}) = 0 (Timeout)
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {18446744073709551, 615000}) = -1 EINVAL (Invalid argument)

Ok, X clearly *is* doing something wrong, and as you point out, it
looks like X has taken a negative milliseconds value, treated it as
positive, and then done a (unsigned) divide by 1000 on that to create
a large positive seconds value.

It's a fairly "natural" error to make if you keep timeouts as
milliseconds, and then convert those to a timeval or timespec.

At the same time, that value is not really "invalid". It's a
crazy-long timeout, but it's a valid timeout. The fact that it doesn't
happen to fit into our "jiffies" value is an internal kernel
implementation issue that we shouldn't be exposing to user space as an
error. Turning the crazy-long timeout into "forever" (or
alternatively, into the longest timeout we support) should be the
right fix.

Andreas, can you check whether John Stultz patch works for you? It
looks like it should.

Also, letting the X people know that they are doing something crazy
sounds like a good idea. Looking at that strace, I'm also struct by
the fact that always giving "256" as the number of fds is going to be
a performance thing. It would be much better if X actually knew how
many bits it really needs, as that means the kernel isn't going to
need to look at the bits. So opening a bugzilla for this sounds like a
good idea.

Linus

2012-09-01 02:02:58

by Andreas Bombe

[permalink] [raw]
Subject: Re: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"

On Fri, Aug 31, 2012 at 10:43:42AM -0700, John Stultz wrote:
> On 08/30/2012 09:05 PM, Andreas Bombe wrote:
> >With that somewhat easy test I bisected it down to 4e8b14526 "time:
> >Improve sanity checking of timekeeping inputs". The latest Linus git
> >(155e36d40) with a revert of the bisected commit does not show the
> >problem.
>
> Thanks so much for bisecting this down!
> I'm guessing X is passing crazy large timespecs into select (via
> WaitForSomething()) values that are catching on the ktime_t overflow
> check in timespec_valid(). Previously these would be clamped to
> KTIME_MAX (which basically is infinity) in the timer subsystem
> before.
>
> So the issue is the patch in question is too strict in its
> validation. We want to be strict on things like timekeeping inputs,
> but for timers wait to infinity is still valid.
>
> The attached (sorry not inline, on the road) patch should fix this,
> but could you verify it? (I'm running my testing concurrently)

I'm running it now and it's looking good. I did the video test again and
confirmed with strace that X was doing the giant timeout in select
again, but this time without any errors.

--
Andreas Bombe