As to the real reason for stalls on /proc/<pid>/stat, I bet it has nothing
to do with IO except indirectly (the IO is necessary to trigger the
problem, but the _reason_ for the problem lies elsewhere).
And it has everything to do with the fact that the way Linux semaphores
are implemented, a non-blocking process has a HUGE advantage over a
blocking one. Linux kernel semaphores are extreme unfair in that way.
What happens is that some process is getting a lot of VM faults and gets
its VM semaphore. No contention yet. it holds the semaphore over the
IO, and now another process does a "ps".
The "ps" process goes to sleep on the semaphore. So far so good.
The original process releases the semaphore, which increments the count,
and wakes up the process waiting for it. Note that it _wakes_ it, it does
not give the semaphore to it. Big difference.
The process that got woken up will run eventually. Probably not all that
immediately, because the process that woke it (and held the semaphore)
just slept on a page fault too, so it's not likely to immediately
relinquish the CPU.
The original running process comes back faulting again, finds the
semaphore still unlocked (the "ps" process is awake but has not gotten to
run yet), gets the semaphore, and falls asleep on the IO for the next
page.
The "ps" process actually gets to run now, but it's a bit late. The
semaphore is locked again.
Repeat until luck breaks the bad circle.
(This schenario, btw, is much harder to trigger on SMP than on UP. And
it's completely separate from the issue of simple disk bandwidth issues
which can obviously cause no end of stalls on anything that needs the
disk, and which can also happen on SMP).
NOTE! If somebody wants to fix this, the fix should be reasonably simple
but needs to be quite exhaustively checked and double-checked. It's just
too easy to break the semaphores by mistake.
The way to make semaphores more fair is to NOT allow a new process to just
come in immediately and steal the semaphore in __down() if there are other
sleepers. This is most easily accomplished by something along the lines of
the following in __down() in arch/i386/kernel/semaphore.c
spin_lock_irq(&semaphore_lock);
sem->sleepers++;
+
+ /*
+ * Are there other people waiting for this?
+ * They get to go first.
+ */
+ if (sleepers > 1)
+ goto inside;
for (;;) {
int sleepers = sem->sleepers;
/*
* Add "everybody else" into it. They aren't
* playing, because we own the spinlock.
*/
if (!atomic_add_negative(sleepers - 1, &sem->count)) {
sem->sleepers = 0;
break;
}
sem->sleepers = 1; /* us - see -1 above */
+inside:
spin_unlock_irq(&semaphore_lock);
schedule();
tsk->state = TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE;
spin_lock_irq(&semaphore_lock);
}
spin_unlock_irq(&semaphore_lock);
But note that teh above is UNTESTED and also note that from a throughput
(as opposed to latency) standpoint being unfair tends to be nice.
Anybody want to try out something like the above? (And no, I'm not
applying it to my tree yet. It needs about a hundred pairs of eyes to
verify that there isn't some subtle "lost wakeup" race somewhere).
Linus
On Thu, 9 Nov 2000, Linus Torvalds wrote:
>
>
> As to the real reason for stalls on /proc/<pid>/stat, I bet it has nothing
> to do with IO except indirectly (the IO is necessary to trigger the
> problem, but the _reason_ for the problem lies elsewhere).
>
> And it has everything to do with the fact that the way Linux semaphores
> are implemented, a non-blocking process has a HUGE advantage over a
> blocking one. Linux kernel semaphores are extreme unfair in that way.
>
> What happens is that some process is getting a lot of VM faults and gets
> its VM semaphore. No contention yet. it holds the semaphore over the
> IO, and now another process does a "ps".
>
> The "ps" process goes to sleep on the semaphore. So far so good.
>
> The original process releases the semaphore, which increments the count,
> and wakes up the process waiting for it. Note that it _wakes_ it, it does
> not give the semaphore to it. Big difference.
>
> The process that got woken up will run eventually. Probably not all that
> immediately, because the process that woke it (and held the semaphore)
> just slept on a page fault too, so it's not likely to immediately
> relinquish the CPU.
>
> The original running process comes back faulting again, finds the
> semaphore still unlocked (the "ps" process is awake but has not gotten to
> run yet), gets the semaphore, and falls asleep on the IO for the next
> page.
>
> The "ps" process actually gets to run now, but it's a bit late. The
> semaphore is locked again.
>
> Repeat until luck breaks the bad circle.
>
> (This schenario, btw, is much harder to trigger on SMP than on UP. And
> it's completely separate from the issue of simple disk bandwidth issues
> which can obviously cause no end of stalls on anything that needs the
> disk, and which can also happen on SMP).
Unfortunately, it didn't help in the scenario I'm running.
time make -j30 bzImage:
real 14m19.987s (within stock variance)
user 6m24.480s
sys 1m12.970s
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
31 2 1 12 1432 4440 12660 0 12 27 151 202 848 89 11 0
34 4 1 1908 2584 536 5376 248 1904 602 763 785 4094 63 32 5
13 19 1 64140 67728 604 33784 106500 84612 43625 21683 19080 52168 28 22 50
I understood the above well enough to be very interested in seeing what
happens with flush IO restricted.
-Mike
[try_to_free_pages()->swap_out()/shm_swap().. can fight over who gets
to shrink the best candidate's footprint?]
Thanks!
> I understood the above well enough to be very interested in seeing what
> happens with flush IO restricted.
>
> -Mike
>
> [try_to_free_pages()->swap_out()/shm_swap().. can fight over who gets
> to shrink the best candidate's footprint?]
>
> Thanks!
The results:
pre2+semaphore
real 14m19.987s
user 6m24.480s
sys 1m12.970s
pre2+semaphore+throttle_IO
real 10m13.953s
user 6m19.980s
sys 0m28.960s
pre2+semaphore+throttle_IO extended to refill_inactive()
real 9m46.395s
user 6m23.510s
sys 0m29.420s
pre2+semaphore+throttle_IO + above + tiny little tweak to page_launder()
real 8m56.808s
user 6m23.420s
sys 0m29.430s
Unfortunately, when I try to get past this point I burn trees :-)
-Mike
In article <[email protected]>,
Mike Galbraith <[email protected]> wrote:
>>
>> (This schenario, btw, is much harder to trigger on SMP than on UP. And
>> it's completely separate from the issue of simple disk bandwidth issues
>> which can obviously cause no end of stalls on anything that needs the
>> disk, and which can also happen on SMP).
>
>Unfortunately, it didn't help in the scenario I'm running.
>
>time make -j30 bzImage:
>
>real 14m19.987s (within stock variance)
>user 6m24.480s
>sys 1m12.970s
Note that the above kin of "throughput performance" should not have been
affected, and was not what I was worried about.
>procs memory swap io system cpu
> r b w swpd free buff cache si so bi bo in cs us sy id
>31 2 1 12 1432 4440 12660 0 12 27 151 202 848 89 11 0
>34 4 1 1908 2584 536 5376 248 1904 602 763 785 4094 63 32 5
>13 19 1 64140 67728 604 33784 106500 84612 43625 21683 19080 52168 28 22 50
Looks like there was a big delay in vmstat there - that could easily be
due to simple disk throughput issues..
Does it feel any different under the original load that got the original
complaint? The patch may have just been buggy and ineffective, for all I
know.
Linus
Linus Torvalds wrote:
...
>
> And it has everything to do with the fact that the way Linux semaphores
> are implemented, a non-blocking process has a HUGE advantage over a
> blocking one. Linux kernel semaphores are extreme unfair in that way.
>
...
> The original running process comes back faulting again, finds the
> semaphore still unlocked (the "ps" process is awake but has not gotten to
> run yet), gets the semaphore, and falls asleep on the IO for the next
> page.
>
> The "ps" process actually gets to run now, but it's a bit late. The
> semaphore is locked again.
>
> Repeat until luck breaks the bad circle.
>
But doesn't __down have a fast path coded in assembly? In other words,
it only hits your patched code if there is already contention, which
there isn't in this case, and therefore the bug...?
David Mansfield
In article <[email protected]>,
David Mansfield <[email protected]> wrote:
>Linus Torvalds wrote:
>...
>>
>> And it has everything to do with the fact that the way Linux semaphores
>> are implemented, a non-blocking process has a HUGE advantage over a
>> blocking one. Linux kernel semaphores are extreme unfair in that way.
>>
>...
>> The original running process comes back faulting again, finds the
>> semaphore still unlocked (the "ps" process is awake but has not gotten to
>> run yet), gets the semaphore, and falls asleep on the IO for the next
>> page.
>>
>> The "ps" process actually gets to run now, but it's a bit late. The
>> semaphore is locked again.
>>
>> Repeat until luck breaks the bad circle.
>>
>
>But doesn't __down have a fast path coded in assembly? In other words,
>it only hits your patched code if there is already contention, which
>there isn't in this case, and therefore the bug...?
The __down() case should be hit if there's a waiter, even if that waiter
has not yet been able to pick up the lock (the waiter _will_ have
decremented the count to negative in order to trigger the proper logic
at release time).
But as I mentioned, the pseudo-patch was certainly untested, so
somebody should probably walk through the cases to check that I didn't
miss something.
Linus