2004-09-15 15:18:10

by Ingo Molnar

[permalink] [raw]
Subject: [patch] remove the BKL (Big Kernel Lock), this time for real


the attached patch is a new approach to get rid of Linux's Big Kernel
Lock as we know it today.

The trick is to turn the BKL spinlock + depth counter into a special
type of cpu-affine, recursive semaphore, which gets released by
schedule() but not by preempt_schedule().

this gives the following advantages:

- BKL critical sections are fully preemptable with this patch applied

- there is no time wasted spinning on the BKL if there's BKL contention

- no changes are needed for lock_kernel() users - the new
semaphore-based approach is fully compatible with the BKL.

code using lock_kernel()/unlock_kernel() will see very similar semantics
as they got from the BKL, so correctness should be fully preserved.
Per-CPU assumptions still work, locking exclusion and lock-recursion
still works the same way as it did with the BKL.

non-BKL code sees no overhead from this approach. (other than the
slighly smaller code due to the uninlining of the BKL APIs.)

(the patch is against vanilla 2.6.9-rc2. I have tested it on x86
UP+PREEMPT, UP+!PREEMPT, SMP+PREEMPT, SMP+!PREEMPT and x64 SMP+PREEMPT.)

Ingo


Attachments:
(No filename) (1.08 kB)
remove-bkl.patch (9.70 kB)
Download all attachments

2004-09-15 15:41:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real



On Wed, 15 Sep 2004, Ingo Molnar wrote:
>
> the attached patch is a new approach to get rid of Linux's Big Kernel
> Lock as we know it today.

I really think this is wrong.

Maybe not from a conceptual standpoint, but that implementation with the
scheduler doing "reaquire_kernel_lock()" and doing a down() there is just
wrong, wrong, wrong.

If we're going to do a down() and block immediately after being scheduled,
I don't think we should have been picked in the first place.

Yeah, yeah, you have all that magic to not recurse by setting lock-depth
negative before doing the down(), but it still feels fundamentally wrong
to me. There's also the question whether this actually _helps_ anything,
since it may well just replace the spinning with lots of new scheduler
activity.

And you make schedule() a lot more expensive for kernel lock holders by
copying the CPU map. You may have tested it on a machine where the CPU map
is just a single word, but what about the big machines?

Spinlocks really _are_ cheaper. Wouldn't it be nice to just continue
removing kernel lock users and keeping a very _simple_ kernel lock for
legacy issues?

In other words, I'd _really_ like to see some serious numbers for this.

Linus

2004-09-15 15:46:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

Ingo Molnar <[email protected]> writes:

> the attached patch is a new approach to get rid of Linux's Big Kernel
> Lock as we know it today.

Interesting approach. Did you measure what it does to context
switch rates? Usually adding semaphores tends to increase them
a lot.

One minor comment only:
Please CSE "current" manually. It generates much better code
on some architectures because the compiler cannot do it for you.

-Andi

2004-09-15 15:57:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real


* Linus Torvalds <[email protected]> wrote:

> > the attached patch is a new approach to get rid of Linux's Big Kernel
> > Lock as we know it today.
>
> I really think this is wrong.
>
> Maybe not from a conceptual standpoint, but that implementation with
> the scheduler doing "reaquire_kernel_lock()" and doing a down() there
> is just wrong, wrong, wrong.
>
> If we're going to do a down() and block immediately after being
> scheduled, I don't think we should have been picked in the first
> place.

agreed, and that codepath is only for correctness - the semaphore code
will make sure most of the time that this doesnt happen. It's the same
as the need_resched check - it doesnt happen in 99.9% of the cases but
when it happens it must happen.

> Yeah, yeah, you have all that magic to not recurse by setting
> lock-depth negative before doing the down(), but it still feels
> fundamentally wrong to me. There's also the question whether this
> actually _helps_ anything, since it may well just replace the spinning
> with lots of new scheduler activity.

Rare activity that still runs under the BKL (e.g. mounting, or ioctls)
can introduce many milliseconds of scheduling delays that hurt
latencies. None of this is really performance-sensitive as it's almost
always used for some big old piece of code. Anything that is frequently
taken/dropped we've replaced with proper spinlocks already. So what's
left is the code for which 1) everyone hurts most from it not being a
real semaphore 2) no _user_ or maintainer of that code cares about the
BKL because it's not performance-critical.

> And you make schedule() a lot more expensive for kernel lock holders
> by copying the CPU map. [...]

we dont really care about BKL _users_, other than correctness. We've got
cpusets for the bitmap overhead reduction.

> [...] You may have tested it on a machine where the CPU map is just a
> single word, but what about the big machines?

and it's not that bad - if it's .. 512 CPUs then _BKL users_ copy 64
bytes. I dont think the 512-CPU guys will complain about this patch. If
there are any frequent BKL users on such big boxes then they need
serious and quick fixing anyway, because they are bouncing a global
spinlock madly across 512 CPUs, 32 cross-connects and 4 backplanes! The
64-byte copy within the CPU-local task structure really dwarves in
comparison...

> Spinlocks really _are_ cheaper. Wouldn't it be nice to just continue
> removing kernel lock users and keeping a very _simple_ kernel lock for
> legacy issues?

yes, but progress in this area seems to have slowed down, and people are
hurting from the latencies introduced by the BKL meanwhile. Who cares if
some rare big chunk of code runs under a semaphore, as long as it's
preemptable?

a global spinlock isnt all that much cheaper than a semaphore, even if
it's not taken or lightly taken. If it's heavily taken then a semaphore
wins. There is clearly an interim area where a spinlock will win - but
only if the spinlock is well-localized to some resource or CPU - this is
absolutely not true for the BKL which is as global as it gets.

> In other words, I'd _really_ like to see some serious numbers for
> this.

fully agreed about that too!

Ingo

2004-09-15 16:00:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real


* Andi Kleen <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
>
> > the attached patch is a new approach to get rid of Linux's Big Kernel
> > Lock as we know it today.
>
> Interesting approach. Did you measure what it does to context switch
> rates? Usually adding semaphores tends to increase them a lot.

not yet - i've coded it up today. Perhaps the lowlatency audio folks
(who are most interested in BKL-latency removal) could report some
numbers?

but as i've replied to Linus too, i believe that _if_ the context-switch
rate goes up then some piece of code uses the BKL way too often! So
having a semaphore here might in fact help fixing those rare cases.

> One minor comment only:
> Please CSE "current" manually. It generates much better code
> on some architectures because the compiler cannot do it for you.

yeah, agreed, will do.

Ingo

2004-09-15 17:11:04

by Ricky Beam

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Wed, 15 Sep 2004, Ingo Molnar wrote:
>yes, but progress in this area seems to have slowed down, and people are
>hurting from the latencies introduced by the BKL meanwhile. Who cares if
>some rare big chunk of code runs under a semaphore, as long as it's
>preemptable?

"as long as it's preemptable" is the key there. Not all arch's can run
with PREEMPT enabled (yet) -- sparc/sparc64 for but one. And at the moment,
PREEMPT is a bit on the hosed side.

--Ricky


2004-09-15 18:29:15

by Chris Wedgwood

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Wed, Sep 15, 2004 at 05:55:55PM +0200, Ingo Molnar wrote:

> Rare activity that still runs under the BKL (e.g. mounting, or
> ioctls) can introduce many milliseconds of scheduling delays that
> hurt latencies.

Is it not worth looking at the worst offenders (ioctl will be hard I
suspect) and fixing these as an interim step?


--cw

2004-09-15 19:38:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real


* Ricky Beam <[email protected]> wrote:

> On Wed, 15 Sep 2004, Ingo Molnar wrote:
> >yes, but progress in this area seems to have slowed down, and people are
> >hurting from the latencies introduced by the BKL meanwhile. Who cares if
> >some rare big chunk of code runs under a semaphore, as long as it's
> >preemptable?
>
> "as long as it's preemptable" is the key there. Not all arch's can
> run with PREEMPT enabled (yet) -- sparc/sparc64 for but one. And at
> the moment, PREEMPT is a bit on the hosed side.

there's no problem at all - with my patch you'll still get the other
advantage:

- there is no time wasted spinning on the BKL if there's BKL contention

so if a platform doesnt support PREEMPT (or a user doesnt enable it)
then BKS-holding kernel code wont be preempted of course.

Ingo

2004-09-15 20:16:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real


* Andi Kleen <[email protected]> wrote:

> One minor comment only:
> Please CSE "current" manually. It generates much better code
> on some architectures because the compiler cannot do it for you.

new patch attached - this one uses 'current' as few times as possible.

Ingo


Attachments:
(No filename) (270.00 B)
remove-bkl.patch (9.79 kB)
Download all attachments

2004-09-15 21:31:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Wed, 15 Sep 2004, Ingo Molnar wrote:
>> the attached patch is a new approach to get rid of Linux's Big Kernel
>> Lock as we know it today.
>
On Wed, Sep 15, 2004 at 08:40:55AM -0700, Linus Torvalds wrote:
> I really think this is wrong.
> Maybe not from a conceptual standpoint, but that implementation with the
> scheduler doing "reaquire_kernel_lock()" and doing a down() there is just
> wrong, wrong, wrong.
> If we're going to do a down() and block immediately after being scheduled,
> I don't think we should have been picked in the first place.

Well, I'm more concerned that the users all need to be swept anyway.
This could make sense to do anyway, but the sweeps IMHO are necessary
because the users almost universally are those that haven't been taught
to do any locking yet, stale and crusty code that needs porting, or
things where the locking hasn't quite been debugged or straightened out.

One thing I like is that this eliminates the implicit dropping on sleep
as a source of bugs (e.g. it was recently pointed out to me that setfl()
uses the BKL to protect ->f_flags in a codepath spanning a sleeping
call to ->fasync()), where the semaphore may be retained while sleeping.
I originally wanted to make sleeping under the BKL illegal and sweep
users to repair when it is, but maybe that's unrealistic, particularly
considering that the sum of my BKL sweeps to date are one removal from
procfs "protecting" its access of nr_threads.


-- wli

2004-09-16 01:18:55

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

Ingo Molnar wrote:

> code using lock_kernel()/unlock_kernel() will see very similar semantics
> as they got from the BKL, so correctness should be fully preserved.
> Per-CPU assumptions still work, locking exclusion and lock-recursion
> still works the same way as it did with the BKL.
>

Hi Ingo,

One change is that lock_kernel now sleeps, while it previously didn't, I
think?

Is this going to be a problem? Or have you checked the remaining bkl users
to ensure this is OK? Am I on drugs? :)

Nick

2004-09-16 14:27:31

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

Andi Kleen wrote:
> Ingo Molnar <[email protected]> writes:
>
>
>>the attached patch is a new approach to get rid of Linux's Big Kernel
>>Lock as we know it today.
>
>
> Interesting approach. Did you measure what it does to context
> switch rates? Usually adding semaphores tends to increase them
> a lot.

Is that (necessarily) a bad thing? If it results in less time waiting
for BKL, and/or more time doing user work, that may drive throughput and
responsiveness up. It depends if the time for two ctx is greater or less
than the spin time on BKL.

It would be nice to have the best of both worlds, use the semaphore if
there is a process on the run queue, and spin if not. That sounds
complex, and hopefully not worth the effort.

High ctx rates are not necessarily bad, when we implemented O_DIRECT for
an application the rate went up 30%, the outbound bandwidth went up
10-15%, and waitio dropped by half at peak load. As long as something
useful is being done with the time previously wasted in spinning, I
would expect it to be a win.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-09-16 22:29:46

by Bill Huey

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Thu, Sep 16, 2004 at 10:28:08AM -0400, Bill Davidsen wrote:
> Is that (necessarily) a bad thing? If it results in less time waiting
> for BKL, and/or more time doing user work, that may drive throughput and
> responsiveness up. It depends if the time for two ctx is greater or less
> than the spin time on BKL.
>
> It would be nice to have the best of both worlds, use the semaphore if
> there is a process on the run queue, and spin if not. That sounds
> complex, and hopefully not worth the effort.

FreeBSD-current uses adaptive mutexes. However they spin on that mutex
only if the thread owning it is running across another CPU at that time,
otherwise it sleeps, maybe priority inherited depending on the
circumstance. It's kind of heavy weight, I know, but it might be a good
place to look for an example implementation of this thing.

I'm sure that a simplified adaptive mutex can be used that's compatible
with the Linux SMP/preemption methology if this becomes relevant.

> High ctx rates are not necessarily bad, when we implemented O_DIRECT for
> an application the rate went up 30%, the outbound bandwidth went up
> 10-15%, and waitio dropped by half at peak load. As long as something
> useful is being done with the time previously wasted in spinning, I
> would expect it to be a win.

It's critical for other devices as well, like keeping a GPU fully engaged
so that there isn't computational stalls since the process queue is empty,
etc...

Anybody thought about using this to protect RCU-ed critical sections yet
even though it doesn't guarantee quiescience ? or has the recent work
softirq-ing them sufficient ? Can't this be used as a kind per CPU-local
data guard much like {get,set}_cpu() and friends ?

bill

2004-09-16 22:43:15

by David Miller

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Thu, 16 Sep 2004 15:29:03 -0700
Bill Huey (hui) <[email protected]> wrote:

> FreeBSD-current uses adaptive mutexes. However they spin on that mutex
> only if the thread owning it is running across another CPU at that time,
> otherwise it sleeps, maybe priority inherited depending on the
> circumstance.

This is how Solaris MUTEX objects work too.

2004-09-16 22:52:04

by Bill Huey

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Thu, Sep 16, 2004 at 03:40:11PM -0700, David S. Miller wrote:
> On Thu, 16 Sep 2004 15:29:03 -0700
> Bill Huey (hui) <[email protected]> wrote:
>
> > FreeBSD-current uses adaptive mutexes. However they spin on that mutex
> > only if the thread owning it is running across another CPU at that time,
> > otherwise it sleeps, maybe priority inherited depending on the
> > circumstance.
>
> This is how Solaris MUTEX objects work too.

Yeah, I know from Solaris Internals and FreeBSD can be considered a
Solaris style kernel. In contract, I think the Linux community has a
few things up on FreeBSD/Solaris style SMP. Specifically, the FreeBSD
community has ignored a lot of the really hard work of pushing down
locks in favor of "getting fancier locks", which only abuses thread
priorities and the scheduler. A large part of it is because they have
really create a very complicated SMP infrastructure that less than a
handful of their kernel engineers really know how to use, 2-3, it
seems.

Judging from how the Linux code is done and the numbers I get from
Bill Irwin in casual conversation, the Linux SMP approach is clearly
the right track at this time with it's hand honed per-CPU awareness of
things. The only serious problem that spinlocks have as they aren't
preemptable, which is what Ingo is trying to fix.

bill

2004-09-16 23:00:19

by David Miller

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Thu, 16 Sep 2004 15:51:02 -0700
Bill Huey (hui) <[email protected]> wrote:

> Judging from how the Linux code is done and the numbers I get from
> Bill Irwin in casual conversation, the Linux SMP approach is clearly
> the right track at this time with it's hand honed per-CPU awareness of
> things. The only serious problem that spinlocks have as they aren't
> preemptable, which is what Ingo is trying to fix.

This is what Linus proclaimed 6 or 7 years ago when people were
trying to convince us to do things like Solaris and other big
Unixes at the time.

2004-09-16 23:03:46

by Bill Huey

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Thu, Sep 16, 2004 at 03:54:12PM -0700, David S. Miller wrote:
> On Thu, 16 Sep 2004 15:51:02 -0700
> Bill Huey (hui) <[email protected]> wrote:
>
> > Judging from how the Linux code is done and the numbers I get from
> > Bill Irwin in casual conversation, the Linux SMP approach is clearly
> > the right track at this time with it's hand honed per-CPU awareness of
> > things. The only serious problem that spinlocks have as they aren't
> > preemptable, which is what Ingo is trying to fix.
>
> This is what Linus proclaimed 6 or 7 years ago when people were
> trying to convince us to do things like Solaris and other big
> Unixes at the time.

FreeBSD's SMPng project is stalled for the most part and developers that
disagree with that approach have move onto the DragonFly BSD community.
It has a much more top-down driven locking system that's conceptually
CPU local called tokens, effectively deadlock free and difficult to misused.
It's already been able to multi-thread the networking stack using lock-less
techniques, while the FreeBSD-current tree had to retract their "all or
nothing" approach with threading their network stack. Jeffery Hsu is
the main developer pushing that subsystem.

bill

2004-09-16 23:40:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Thu, 16 Sep 2004 15:51:02 -0700 Bill Huey (hui) wrote:
>> Judging from how the Linux code is done and the numbers I get from
>> Bill Irwin in casual conversation, the Linux SMP approach is clearly
>> the right track at this time with it's hand honed per-CPU awareness of
>> things. The only serious problem that spinlocks have as they aren't
>> preemptable, which is what Ingo is trying to fix.

On Thu, Sep 16, 2004 at 03:54:12PM -0700, David S. Miller wrote:
> This is what Linus proclaimed 6 or 7 years ago when people were
> trying to convince us to do things like Solaris and other big
> Unixes at the time.

Just in case, I didn't claim it was my idea, I merely gave empirical
evidence.


-- wli

2004-09-17 06:42:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real


* Bill Huey <[email protected]> wrote:

> Judging from how the Linux code is done and the numbers I get from
> Bill Irwin in casual conversation, the Linux SMP approach is clearly
> the right track at this time with it's hand honed per-CPU awareness of
> things. The only serious problem that spinlocks have as they aren't
> preemptable, which is what Ingo is trying to fix.

a clarification: note that the current BKL is a special case. No way do
i suggest that the BKS is the proper model for any SMP implementation.
It is a narrow special-case because it wraps historic UP-only kernel
code.

our primary multiprocessing primitives are still the following 4:
lockless data structures, RCU, spinlocks and mutexes. (reverse ordered
by level of parallelism.) The BKS is basically a fifth method, a special
type of semaphore that i'd never want to be seen used by any new SMP
code. It is completely local to sched.c.

Ingo

2004-09-17 07:21:29

by Tony Lee

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Fri, 17 Sep 2004 08:43:21 +0200, Ingo Molnar <[email protected]> wrote:
>
> * Bill Huey <[email protected]> wrote:
>
> > Judging from how the Linux code is done and the numbers I get from
> > Bill Irwin in casual conversation, the Linux SMP approach is clearly
> > the right track at this time with it's hand honed per-CPU awareness of
> > things. The only serious problem that spinlocks have as they aren't
> > preemptable, which is what Ingo is trying to fix.
>
> a clarification: note that the current BKL is a special case. No way do
> i suggest that the BKS is the proper model for any SMP implementation.
> It is a narrow special-case because it wraps historic UP-only kernel
> code.
>
> our primary multiprocessing primitives are still the following 4:
> lockless data structures, RCU, spinlocks and mutexes. (reverse ordered
> by level of parallelism.) The BKS is basically a fifth method, a special
> type of semaphore that i'd never want to be seen used by any new SMP
> code. It is completely local to sched.c.
>
> Ingo

I coded a IPC system before use atomic add + share memory.
It works very well (fast) on 4 CPU SMP system, since it doesn't use
any locking API at all. Very good for resource allocation for
SMP. I implemented speciall malloc/free use by ISR, different
prority process
completely without any lock. Negative side, it use more memory.


--
-Tony
Having a lot of fun with Xilinx Virtex Pro II reconfigurable HW + ppc + Linux

2004-09-17 10:38:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real


the attached patch is a simplified variant of the remove-bkl patch i
sent two days ago: it doesnt do the ->cpus_allowed trick.

The question is, is there any BKL-using kernel code that relies on the
task not migrating to another CPU within the BLK critical section?

(Patch is against 2.6.9-rc2-mm1. Patch booted fine on x86 SMP+PREEMPT
and UP and the previous patch was tested on x64 as well so i'd expect
this one to work fine on all platforms.)

Ingo

--- linux/include/linux/smp_lock.h.orig
+++ linux/include/linux/smp_lock.h
@@ -7,59 +7,14 @@

#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)

-extern spinlock_t kernel_flag;
-
-#define kernel_locked() (current->lock_depth >= 0)
-
-#define get_kernel_lock() spin_lock(&kernel_flag)
-#define put_kernel_lock() spin_unlock(&kernel_flag)
-
-/*
- * Release global kernel lock.
- */
-static inline void release_kernel_lock(struct task_struct *task)
-{
- if (unlikely(task->lock_depth >= 0))
- put_kernel_lock();
-}
-
-/*
- * Re-acquire the kernel lock
- */
-static inline void reacquire_kernel_lock(struct task_struct *task)
-{
- if (unlikely(task->lock_depth >= 0))
- get_kernel_lock();
-}
-
-/*
- * Getting the big kernel lock.
- *
- * This cannot happen asynchronously,
- * so we only need to worry about other
- * CPU's.
- */
-static inline void lock_kernel(void)
-{
- int depth = current->lock_depth+1;
- if (likely(!depth))
- get_kernel_lock();
- current->lock_depth = depth;
-}
-
-static inline void unlock_kernel(void)
-{
- BUG_ON(current->lock_depth < 0);
- if (likely(--current->lock_depth < 0))
- put_kernel_lock();
-}
+extern int kernel_locked(void);
+extern void lock_kernel(void);
+extern void unlock_kernel(void);

#else

#define lock_kernel() do { } while(0)
#define unlock_kernel() do { } while(0)
-#define release_kernel_lock(task) do { } while(0)
-#define reacquire_kernel_lock(task) do { } while(0)
#define kernel_locked() 1

#endif /* CONFIG_SMP || CONFIG_PREEMPT */
--- linux/include/linux/hardirq.h.orig
+++ linux/include/linux/hardirq.h
@@ -27,12 +27,12 @@
#define in_softirq() (softirq_count())
#define in_interrupt() (irq_count())

+#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0)
+
#ifdef CONFIG_PREEMPT
-# define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != kernel_locked())
# define preemptible() (preempt_count() == 0 && !irqs_disabled())
# define IRQ_EXIT_OFFSET (HARDIRQ_OFFSET-1)
#else
-# define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0)
# define preemptible() 0
# define IRQ_EXIT_OFFSET HARDIRQ_OFFSET
#endif
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -2288,6 +2288,105 @@ static inline int dependent_sleeper(int
}
#endif

+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
+
+/*
+ * The 'big kernel semaphore'
+ *
+ * This mutex is taken and released recursively by lock_kernel()
+ * and unlock_kernel(). It is transparently dropped and reaquired
+ * over schedule(). It is used to protect legacy code that hasn't
+ * been migrated to a proper locking design yet.
+ *
+ * Note: code locked by this semaphore will only be serialized against
+ * other code using the same locking facility. The code guarantees that
+ * the task remains on the same CPU.
+ *
+ * Don't use in new code.
+ */
+static __cacheline_aligned_in_smp DECLARE_MUTEX(kernel_sem);
+
+int kernel_locked(void)
+{
+ return current->lock_depth >= 0;
+}
+
+EXPORT_SYMBOL(kernel_locked);
+
+/*
+ * Release global kernel semaphore:
+ */
+static inline void release_kernel_sem(struct task_struct *task)
+{
+ if (unlikely(task->lock_depth >= 0))
+ up(&kernel_sem);
+}
+
+/*
+ * Re-acquire the kernel semaphore.
+ *
+ * This function is called with preemption off.
+ *
+ * We are executing in schedule() so the code must be extremely careful
+ * about recursion, both due to the down() and due to the enabling of
+ * preemption. schedule() will re-check the preemption flag after
+ * reacquiring the semaphore.
+ */
+static inline void reacquire_kernel_sem(struct task_struct *task)
+{
+ int saved_lock_depth = task->lock_depth;
+
+ if (likely(saved_lock_depth < 0))
+ return;
+
+ task->lock_depth = -1;
+ preempt_enable_no_resched();
+
+ down(&kernel_sem);
+
+ preempt_disable();
+ task->lock_depth = saved_lock_depth;
+}
+
+/*
+ * Getting the big kernel semaphore.
+ */
+void lock_kernel(void)
+{
+ struct task_struct *task = current;
+ int depth = task->lock_depth + 1;
+
+ if (likely(!depth))
+ /*
+ * No recursion worries - we set up lock_depth _after_
+ */
+ down(&kernel_sem);
+
+ task->lock_depth = depth;
+}
+
+EXPORT_SYMBOL(lock_kernel);
+
+void unlock_kernel(void)
+{
+ struct task_struct *task = current;
+
+ BUG_ON(task->lock_depth < 0);
+
+ if (likely(--task->lock_depth < 0))
+ up(&kernel_sem);
+}
+
+EXPORT_SYMBOL(unlock_kernel);
+
+#else
+
+static inline void release_kernel_sem(struct task_struct *task) { }
+static inline void reacquire_kernel_sem(struct task_struct *task) { }
+
+#endif
+
+
/*
* schedule() is the main scheduler function.
*/
@@ -2328,7 +2427,7 @@ need_resched:
dump_stack();
}

- release_kernel_lock(prev);
+ release_kernel_sem(prev);
schedstat_inc(rq, sched_cnt);
now = clock_us();
run_time = now - prev->timestamp;
@@ -2443,7 +2542,7 @@ switch_tasks:
} else
spin_unlock_irq(&rq->lock);

- reacquire_kernel_lock(current);
+ reacquire_kernel_sem(current);
preempt_enable_no_resched();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
@@ -2460,6 +2559,8 @@ EXPORT_SYMBOL(schedule);
asmlinkage void __sched preempt_schedule(void)
{
struct thread_info *ti = current_thread_info();
+ struct task_struct *task = current;
+ int saved_lock_depth;

/*
* If there is a non-zero preempt_count or interrupts are disabled,
@@ -2470,7 +2571,15 @@ asmlinkage void __sched preempt_schedule

need_resched:
ti->preempt_count = PREEMPT_ACTIVE;
+ /*
+ * We keep the big kernel semaphore locked, but we
+ * clear ->lock_depth so that schedule() doesnt
+ * auto-release the semaphore:
+ */
+ saved_lock_depth = task->lock_depth;
+ task->lock_depth = 0;
schedule();
+ task->lock_depth = saved_lock_depth;
ti->preempt_count = 0;

/* we could miss a preemption opportunity between schedule and now */
@@ -3519,11 +3628,7 @@ void __devinit init_idle(task_t *idle, i
spin_unlock_irqrestore(&rq->lock, flags);

/* Set the preempt count _outside_ the spinlocks! */
-#ifdef CONFIG_PREEMPT
- idle->thread_info->preempt_count = (idle->lock_depth >= 0);
-#else
idle->thread_info->preempt_count = 0;
-#endif
}

/*
@@ -3924,21 +4029,6 @@ int __init migration_init(void)
}
#endif

-/*
- * The 'big kernel lock'
- *
- * This spinlock is taken and released recursively by lock_kernel()
- * and unlock_kernel(). It is transparently dropped and reaquired
- * over schedule(). It is used to protect legacy code that hasn't
- * been migrated to a proper locking design yet.
- *
- * Don't use in new code.
- *
- * Note: spinlock debugging needs this even on !CONFIG_SMP.
- */
-spinlock_t kernel_flag __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
-EXPORT_SYMBOL(kernel_flag);
-
#ifdef CONFIG_SMP
/*
* Attach the domain 'sd' to 'cpu' as its base domain. Callers must
--- linux/init/main.c.orig
+++ linux/init/main.c
@@ -436,6 +436,7 @@ static void noinline rest_init(void)
kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND);
numa_default_policy();
unlock_kernel();
+ preempt_enable_no_resched();
cpu_idle();
}

@@ -501,6 +502,12 @@ asmlinkage void __init start_kernel(void
* time - but meanwhile we still have a functioning scheduler.
*/
sched_init();
+ /*
+ * The early boot stage up until we run the first idle thread
+ * is a very volatile affair for the scheduler. Disable preemption
+ * up until the init thread has been started:
+ */
+ preempt_disable();
build_all_zonelists();
page_alloc_init();
trap_init();

2004-09-17 13:31:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Fri, Sep 17, 2004 at 12:39:45PM +0200, Ingo Molnar wrote:
> task not migrating to another CPU within the BLK critical section?

I very much doubt, I'd expect this to work, but it really should be a
config option if you don't open 2.7. This is the kind of thing that
cannot happen in a 2.6.* release without a config option to leave off in
production IMHO since it can have implications well outside the mainline
kernel (every driver outside the kernel would be affected too).

2004-09-17 13:47:55

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Fri, Sep 17, 2004 at 12:39:45PM +0200, Ingo Molnar wrote:
>> task not migrating to another CPU within the BLK critical section?

On Fri, Sep 17, 2004 at 03:26:41PM +0200, Andrea Arcangeli wrote:
> I very much doubt, I'd expect this to work, but it really should be a
> config option if you don't open 2.7. This is the kind of thing that
> cannot happen in a 2.6.* release without a config option to leave off in
> production IMHO since it can have implications well outside the mainline
> kernel (every driver outside the kernel would be affected too).

IMHO this would be more likely to fix bugs than introduce them so long
as the BKL is acquired before spinlocks in all instances. There are
many instances where scheduling under the BKL is inadvertent and thus
racy, but very few (if any) where the BKL is an inner lock or a holder
of the BKL yields so something that will wake it can acquire the BKL.


-- wli

2004-09-17 14:01:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Fri, Sep 17, 2004 at 06:47:37AM -0700, William Lee Irwin III wrote:
> as the BKL is acquired before spinlocks in all instances. There are

that reinforces my argument, that's another case that can break with
this patch.

I don't think it has a chance to fix any bug, if something it will
trigger some deadlock or race condition, but OTOH I agree it should work
fine (all code I can recall by memory takes the BKL before any inner
spinlock too, and I don't think we left anything that depends on
smp_processor_id()), but again this is the kind of change that requires
some testing and cannot be shipped by default in a stable tree, it
requires config option at the very least.

2004-09-17 14:22:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real


* Ingo Molnar <[email protected]> wrote:

> the attached patch is a simplified variant of the remove-bkl patch i
> sent two days ago: it doesnt do the ->cpus_allowed trick.
>
> The question is, is there any BKL-using kernel code that relies on the
> task not migrating to another CPU within the BLK critical section?

the attached debug patch is ontop of the above patch and prints warnings
if code uses smp_processor_id() in a preemptible section of code. The
patch gets rid of a number of common false positives but more false
positives are more than likely.

The patch has already uncovered a bug: softirq invocation had a softirq
invocation race possibly leading to (<1msec) softirq processing delays
in the PREEMPT case. But more importantly it has found no BKL user so
far that relied on smp_processor_id(), which makes me hopeful that the
problem is not widespread at all and we could opt for the simpler,
semaphore-based BKS design.

(patch tested on x86 and x64, should work on all architectures.)

Ingo


--- linux/arch/i386/lib/delay.c
+++ linux/arch/i386/lib/delay.c
@@ -34,7 +34,7 @@ inline void __const_udelay(unsigned long
xloops *= 4;
__asm__("mull %0"
:"=d" (xloops), "=&a" (d0)
- :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy * (HZ/4)));
+ :"1" (xloops),"0" (boot_cpu_data.loops_per_jiffy * (HZ/4)));
__delay(++xloops);
}

--- linux/arch/x86_64/lib/delay.c
+++ linux/arch/x86_64/lib/delay.c
@@ -34,7 +34,7 @@ void __delay(unsigned long loops)

inline void __const_udelay(unsigned long xloops)
{
- __delay(((xloops * current_cpu_data.loops_per_jiffy) >> 32) * HZ);
+ __delay(((xloops * boot_cpu_data.loops_per_jiffy) >> 32) * HZ);
}

void __udelay(unsigned long usecs)
--- linux/include/asm-i386/smp.h
+++ linux/include/asm-i386/smp.h
@@ -50,7 +50,7 @@ extern u8 x86_cpu_to_apicid[];
* from the initial startup. We map APIC_BASE very early in page_setup(),
* so this is correct in the x86 case.
*/
-#define smp_processor_id() (current_thread_info()->cpu)
+#define __smp_processor_id() (current_thread_info()->cpu)

extern cpumask_t cpu_callout_map;
#define cpu_possible_map cpu_callout_map
--- linux/include/asm-x86_64/smp.h
+++ linux/include/asm-x86_64/smp.h
@@ -66,7 +66,7 @@ static inline int num_booting_cpus(void)
return cpus_weight(cpu_callout_map);
}

-#define smp_processor_id() read_pda(cpunumber)
+#define __smp_processor_id() read_pda(cpunumber)

extern __inline int hard_smp_processor_id(void)
{
--- linux/include/linux/smp.h
+++ linux/include/linux/smp.h
@@ -95,8 +95,11 @@ void smp_prepare_boot_cpu(void);
/*
* These macros fold the SMP functionality into a single CPU system
*/
-
-#define smp_processor_id() 0
+
+#if !defined(__smp_processor_id) || !defined(CONFIG_PREEMPT)
+# define smp_processor_id() 0
+
+#endif
#define hard_smp_processor_id() 0
#define smp_threads_ready 1
#define smp_call_function(func,info,retry,wait) ({ 0; })
@@ -107,6 +110,17 @@ static inline void smp_send_reschedule(i

#endif /* !SMP */

+#ifdef __smp_processor_id
+# ifdef CONFIG_PREEMPT
+ extern unsigned int smp_processor_id(void);
+# else
+# define smp_processor_id() __smp_processor_id()
+# endif
+# define _smp_processor_id() __smp_processor_id()
+#else
+# define _smp_processor_id() smp_processor_id()
+#endif
+
#define get_cpu() ({ preempt_disable(); smp_processor_id(); })
#define put_cpu() preempt_enable()
#define put_cpu_no_resched() preempt_enable_no_resched()
--- linux/include/net/route.h
+++ linux/include/net/route.h
@@ -105,7 +105,7 @@ struct rt_cache_stat

extern struct rt_cache_stat *rt_cache_stat;
#define RT_CACHE_STAT_INC(field) \
- (per_cpu_ptr(rt_cache_stat, smp_processor_id())->field++)
+ (per_cpu_ptr(rt_cache_stat, _smp_processor_id())->field++)

extern struct ip_rt_acct *ip_rt_acct;

--- linux/include/net/snmp.h
+++ linux/include/net/snmp.h
@@ -128,18 +128,18 @@ struct linux_mib {
#define SNMP_STAT_USRPTR(name) (name[1])

#define SNMP_INC_STATS_BH(mib, field) \
- (per_cpu_ptr(mib[0], smp_processor_id())->mibs[field]++)
+ (per_cpu_ptr(mib[0], _smp_processor_id())->mibs[field]++)
#define SNMP_INC_STATS_OFFSET_BH(mib, field, offset) \
- (per_cpu_ptr(mib[0], smp_processor_id())->mibs[field + (offset)]++)
+ (per_cpu_ptr(mib[0], _smp_processor_id())->mibs[field + (offset)]++)
#define SNMP_INC_STATS_USER(mib, field) \
- (per_cpu_ptr(mib[1], smp_processor_id())->mibs[field]++)
+ (per_cpu_ptr(mib[1], _smp_processor_id())->mibs[field]++)
#define SNMP_INC_STATS(mib, field) \
- (per_cpu_ptr(mib[!in_softirq()], smp_processor_id())->mibs[field]++)
+ (per_cpu_ptr(mib[!in_softirq()], _smp_processor_id())->mibs[field]++)
#define SNMP_DEC_STATS(mib, field) \
- (per_cpu_ptr(mib[!in_softirq()], smp_processor_id())->mibs[field]--)
+ (per_cpu_ptr(mib[!in_softirq()], _smp_processor_id())->mibs[field]--)
#define SNMP_ADD_STATS_BH(mib, field, addend) \
- (per_cpu_ptr(mib[0], smp_processor_id())->mibs[field] += addend)
+ (per_cpu_ptr(mib[0], _smp_processor_id())->mibs[field] += addend)
#define SNMP_ADD_STATS_USER(mib, field, addend) \
- (per_cpu_ptr(mib[1], smp_processor_id())->mibs[field] += addend)
+ (per_cpu_ptr(mib[1], _smp_processor_id())->mibs[field] += addend)

#endif
--- linux/kernel/printk.c
+++ linux/kernel/printk.c
@@ -641,8 +641,9 @@ void release_console_sem(void)
_con_start = con_start;
_log_end = log_end;
con_start = log_end; /* Flush */
- spin_unlock_irqrestore(&logbuf_lock, flags);
+ spin_unlock(&logbuf_lock);
call_console_drivers(_con_start, _log_end);
+ local_irq_restore(flags);
}
console_locked = 0;
console_may_schedule = 0;
--- linux/kernel/sched.c
+++ linux/kernel/sched.c
@@ -2288,6 +2288,56 @@ static inline int dependent_sleeper(int
}
#endif

+#if defined(CONFIG_PREEMPT) && defined(__smp_processor_id)
+/*
+ * Debugging check.
+ */
+unsigned int smp_processor_id(void)
+{
+ unsigned long preempt_count = preempt_count();
+ int this_cpu = __smp_processor_id();
+ cpumask_t this_mask;
+
+ if (likely(preempt_count))
+ goto out;
+
+ if (irqs_disabled())
+ goto out;
+
+ /*
+ * Kernel threads bound to a single CPU can safely use
+ * smp_processor_id():
+ */
+ this_mask = cpumask_of_cpu(this_cpu);
+
+ if (cpus_equal(current->cpus_allowed, this_mask))
+ goto out;
+
+ /*
+ * It is valid to assume CPU-locality during early bootup:
+ */
+ if (system_state != SYSTEM_RUNNING)
+ goto out;
+
+ /*
+ * Avoid recursion:
+ */
+ preempt_disable();
+
+ if (!printk_ratelimit())
+ goto out_enable;
+
+ printk(KERN_ERR "using smp_processor_id() in preemptible code: %s/%d\n",
+ current->comm, current->pid);
+ dump_stack();
+
+out_enable:
+ preempt_enable_no_resched();
+out:
+ return this_cpu;
+}
+#endif
+
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)

/*
@@ -3406,7 +3456,7 @@ EXPORT_SYMBOL(yield);
*/
void __sched io_schedule(void)
{
- struct runqueue *rq = this_rq();
+ struct runqueue *rq = &per_cpu(runqueues, _smp_processor_id());

atomic_inc(&rq->nr_iowait);
schedule();
@@ -3417,7 +3467,7 @@ EXPORT_SYMBOL(io_schedule);

long __sched io_schedule_timeout(long timeout)
{
- struct runqueue *rq = this_rq();
+ struct runqueue *rq = &per_cpu(runqueues, _smp_processor_id());
long ret;

atomic_inc(&rq->nr_iowait);
--- linux/kernel/softirq.c
+++ linux/kernel/softirq.c
@@ -137,12 +137,19 @@ EXPORT_SYMBOL(do_softirq);

void local_bh_enable(void)
{
- __local_bh_enable();
WARN_ON(irqs_disabled());
- if (unlikely(!in_interrupt() &&
- local_softirq_pending()))
+ if (unlikely(!in_interrupt() && local_softirq_pending())) {
+ /*
+ * Keep preemption disabled until we are done with
+ * softirq processing:
+ */
+ preempt_count() -= SOFTIRQ_OFFSET - 1;
invoke_softirq();
- preempt_check_resched();
+ preempt_enable();
+ } else {
+ __local_bh_enable();
+ preempt_check_resched();
+ }
}
EXPORT_SYMBOL(local_bh_enable);

2004-09-17 14:22:30

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Fri, Sep 17, 2004 at 06:47:37AM -0700, William Lee Irwin III wrote:
>> as the BKL is acquired before spinlocks in all instances. There are

On Fri, Sep 17, 2004 at 03:56:57PM +0200, Andrea Arcangeli wrote:
> that reinforces my argument, that's another case that can break with
> this patch.
> I don't think it has a chance to fix any bug, if something it will
> trigger some deadlock or race condition, but OTOH I agree it should work
> fine (all code I can recall by memory takes the BKL before any inner
> spinlock too, and I don't think we left anything that depends on
> smp_processor_id()), but again this is the kind of change that requires
> some testing and cannot be shipped by default in a stable tree, it
> requires config option at the very least.

But the corner cases I outlined don't exist; the case that does exist
is where the BKL is dropped by mistake by e.g. a call to a function
that may sleep. c.f. fs/fcntl.c:setfl().


-- wli

2004-09-17 15:34:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real



On Fri, 17 Sep 2004, Ingo Molnar wrote:
>
> the attached patch is a simplified variant of the remove-bkl patch i
> sent two days ago: it doesnt do the ->cpus_allowed trick.
>
> The question is, is there any BKL-using kernel code that relies on the
> task not migrating to another CPU within the BLK critical section?

I don't think there can be any _valid_ such use.

Anything that knows about CPU's has to use "get_cpu()/put_cpu()" anyway.
You might add back in the debugging checks that we used to have for
"smp_processor_id()" in this patch for testing, and if it goes into -mm
we'd see if anything triggers.

I still don't exactly love this patch to death, but making the cpumask
trickery go away does make it look a lot simpler, I have to say.

Linus

2004-09-17 21:07:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Fri, Sep 17, 2004 at 02:53:34PM +0200, Ingo Molnar wrote:
> the attached debug patch is ontop of the above patch and prints warnings
> if code uses smp_processor_id() in a preemptible section of code. The
> patch gets rid of a number of common false positives but more false
> positives are more than likely.

This cannot be applied to 2.6 IMHO. it still doesn't track the
lock_kernel usage inside a spinlock, and even if it did, we cannot use
the stable 2.6 userbase for the beta testing and see if some production
machine triggers those warnings.

If you would make this a config option *then* it could be included, OTOH
for these kind of things 2.7 would be more appropriate (the config
option is just a waste of resources since eventually the whole thing
will go away).

Overall I think this is not a change that a vendor kernel could ever
make in the middle of a stable series (it sounds quite similar to
preempt, even if less risky, and it really buys nothing to the end
user and in turn it's definitely not justified), and in turn I don't
think it's appropriate for a 2.6 mainline either (at least by default
without config option like you're posting).

Infact I'm not even convinced this is the right step forward in the
removal of the BKL, rather than wasting our time discussing this, it'd
be better to start removing the lock_kernel calls.

2004-09-18 05:45:23

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

Tony wrote:

>I coded a IPC system before use atomic add + share memory.
>It works very well (fast) on 4 CPU SMP system, since it doesn't use
>any locking API at all. Very good for resource allocation for
>SMP. I implemented speciall malloc/free use by ISR, different
>prority process completely without any lock.
>
Without any lock or without any common cacheline that are accessed by
atomic operations?
I usually consider the costs of
atomic_inc(&global_atomic_var);
and
spin_lock(&global_lock);
global_var++;
spin_unlock(&global_lock);
as nearly identical (assuming that global_var and global_lock are in the
same cacheline): one cachline transfer per run. The 5 instructions under
the spinlock and the theoretical chance that spin_lock() blocks are
noise compared to the cost of the line transfer.
And that's without thinking about smb_mb__{before,after}_atomic_inc().

Btw, Ingo forgot to mention sequence locks and percpu_counter as two
high-scalability locking primitives.

--
Manfred

2004-09-18 08:00:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real


* Andrea Arcangeli <[email protected]> wrote:

> [...] it still doesn't track the lock_kernel usage inside a spinlock,
> [...]

what do you mean? If something does lock_kernel() within spin_lock()
then the might_sleep() check in down() catches it and will print a
warning. (None of these warnings has triggered on any of my 3 systems
yet, suggesting that the problem is not widespread. In any case it's
very likely a _bug_ so we want to know!)

Ingo

2004-09-18 13:09:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real


* Manfred Spraul <[email protected]> wrote:

> Btw, Ingo forgot to mention sequence locks and percpu_counter as two
> high-scalability locking primitives.

yeah.

Ingo

2004-09-18 23:47:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] remove the BKL (Big Kernel Lock), this time for real

On Sat, Sep 18, 2004 at 10:02:14AM +0200, Ingo Molnar wrote:
> what do you mean? If something does lock_kernel() within spin_lock()
> then the might_sleep() check in down() catches it and will print a
> warning. [..]

I didn't see any patch that removed the CONFIG_DEBUG_SPINLOCK_SLEEP
config option and avoided might_sleep to be defined to noop. If we apply
your patch that's something we need IMHO, and that's what I meant with
"it still doesn't track the lock_kernel usage inside a spinlock", I
didn't specify "_always_ track the lock_kernel usage inside a spinlock",
but I thought the "always" was implicitly, clearly it was not, sorry for
not clarifying this.

> [..] In any case it's very likely a _bug_ so we want to know!)

wanting to know is fine with me, what's not fine with me is deadlocking
a production box after that, when it could be a legitimate usage. It's
not about the printk, it's about the following crash. As said this thing
spreads all over the place, especially in possibly bogus drivers out of
the tree, and I don't think it provides any significant benefit (Ok I'm
biased since I tend to think about PREEMPT=n where a semaphore won't
provide any benefit, and if there's any latency issue with the BKL that
can be fixed without risks with cond_resched_bkl as pointed out
earlier and it definitely doesn't need this patch). And as Linus
mentioned the risk of overscheduling is also possible with the semaphore
but that's the minor issue in this IMHO.

So in short if you change your patch to only add checks that leads into
printk that's very much fine with me (basically you would have to change
lock_kernel() to add an unconditional might_sleep in there, and the
other stuff for the smp_processor_id). I'm not against wanting to know.

Overall I still don't see any benefit. The thing to fix is the
lock_kernel global lock concept that doesn't scale and doesn't
self-document what is being locked against what. IMHO changing the
lock_kernel internal details in a way that changes the semantics in a
subtle manner cannot bring any long term benefit and it can only hurt
the short term due the potentail breakage it introduces.