2001-10-29 16:57:42

by Juergen Doelle

[permalink] [raw]
Subject: Pls apply this spinlock patch to the kernel

Alan,

I had created a patch for improving the spinlock behavior on IA32 SMP
systems for file system work load created with dbench
(ftp://samba.org/pub/tridge/dbench). The work load is a mix of create,
delete, write, and read operations executed from a scalable number of
clients. It is mainly handled in buffer cache.

The patch introduces a new spinlock type, which is cacheline aligned
and occupies the full cacheline. The five hottest spinlocks for this
work load are set to that type (kmap_lock, lru_list_lock,
pagecache_lock, kernel_flag, pagemap_lru_lock) The change is done in
the common code, except for the kernel_flag.

Due to that each lock has now a separate cacheline, the communication
overhead for L1 cache synchronization is at the minimum for these
spinlocks. This results in significant improvement on an IA32 system
(PIII Xeon 700 MHz) for 4 and 8 CPUs.

The improvement by the patch on 2.4.12-ac5 is:

#CPU peak throughput average from 8,10,12,..20 clients
U 0.0% 0.1%
1 -0.1% 0.0%
2 -0.8% -1.2%
4 3.7% 4.4%
8 16.5% 23.2%

Note:
o Differences of 0.1% can be considered as noise. UP should behave
unchanged.
o The throughput for 8 CPUs on the base kernel immediately declines
after the maximum at 6 clients.

With that patch the 8-way does not really scale, but it is not longer
worse than a 4-way for that kind of workload.

I already have posted detailed measurement results and the patch to
the lse-tech mailing list and LKML. The current state of the patch
has feedback from Steven Tweedie, Daniel Phillips, Andrea Arcangeli
and Andrew Morton included.

Because the patch modifies common code, I also verified it on linux/390.
On that architecture it does not change the results. It seems that the
overhead for cache synchronization is already minimal, because the CPUs
are sharing one L2 cache. But this shows that the patch runs on other
architectures, and in the worst case it does not disturb, even for on
a system with a 256 byte cacheline.

I think this change is worth to go into the kernel, because it improves
scaling for servers with 4 and more CPUs and the degradation for 2 CPU
systems is really small.

Can you apply it to your kernel?

Thanks!
Juergen

______________________________________________________________
Juergen Doelle
IBM Linux Technology Center - kernel performance
[email protected]


= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

--- linux/include/linux/spinlock.h.orig Tue Sep 25 09:05:09 2001
+++ linux/include/linux/spinlock.h Tue Sep 25 11:42:51 2001
@@ -133,4 +133,20 @@
extern int atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock);
#endif

+#ifdef CONFIG_SMP
+#include <linux/cache.h>
+
+typedef union {
+ spinlock_t lock;
+ char fill_up[(SMP_CACHE_BYTES)];
+} __attribute__ ((aligned(SMP_CACHE_BYTES))) spinlock_cacheline_t;
+
+#else /* SMP */
+
+typedef struct {
+ spinlock_t lock;
+} spinlock_cacheline_t;
+
+
+#endif
#endif /* __LINUX_SPINLOCK_H */
--- linux/include/linux/swap.h.orig Tue Sep 25 09:59:15 2001
+++ linux/include/linux/swap.h Tue Sep 25 11:42:51 2001
@@ -86,7 +86,10 @@
extern atomic_t nr_async_pages;
extern atomic_t page_cache_size;
extern atomic_t buffermem_pages;
-extern spinlock_t pagecache_lock;
+
+extern spinlock_cacheline_t pagecache_lock_cacheline;
+#define pagecache_lock (pagecache_lock_cacheline.lock)
+
extern void __remove_inode_page(struct page *);

/* Incomplete types for prototype declarations: */
@@ -159,7 +162,8 @@
extern unsigned long swap_cache_find_success;
#endif

-extern spinlock_t pagemap_lru_lock;
+extern spinlock_cacheline_t pagemap_lru_lock_cacheline;
+#define pagemap_lru_lock pagemap_lru_lock_cacheline.lock

extern void FASTCALL(mark_page_accessed(struct page *));

--- linux/include/asm-i386/smplock.h.orig Tue Sep 25 11:29:51 2001
+++ linux/include/asm-i386/smplock.h Tue Sep 25 11:42:52 2001
@@ -8,7 +8,8 @@
#include <linux/sched.h>
#include <asm/current.h>

-extern spinlock_t kernel_flag;
+extern spinlock_cacheline_t kernel_flag_cacheline;
+#define kernel_flag kernel_flag_cacheline.lock

#define kernel_locked() spin_is_locked(&kernel_flag)

--- linux/arch/i386/kernel/i386_ksyms.c.orig Tue Sep 25 09:14:57 2001
+++ linux/arch/i386/kernel/i386_ksyms.c Tue Sep 25 10:13:06 2001
@@ -120,7 +120,7 @@

#ifdef CONFIG_SMP
EXPORT_SYMBOL(cpu_data);
-EXPORT_SYMBOL(kernel_flag);
+EXPORT_SYMBOL(kernel_flag_cacheline);
EXPORT_SYMBOL(smp_num_cpus);
EXPORT_SYMBOL(cpu_online_map);
EXPORT_SYMBOL_NOVERS(__write_lock_failed);
--- linux/arch/i386/kernel/smp.c.orig Tue Sep 25 09:15:16 2001
+++ linux/arch/i386/kernel/smp.c Tue Sep 25 10:13:06 2001
@@ -101,7 +101,7 @@
*/

/* The 'big kernel lock' */
-spinlock_t kernel_flag = SPIN_LOCK_UNLOCKED;
+spinlock_cacheline_t kernel_flag_cacheline = {SPIN_LOCK_UNLOCKED};

struct tlb_state cpu_tlbstate[NR_CPUS] = {[0 ... NR_CPUS-1] = { &init_mm, 0 }};

--- linux/fs/buffer.c.orig Tue Sep 25 09:15:47 2001
+++ linux/fs/buffer.c Wed Sep 26 12:17:29 2001
@@ -81,7 +81,10 @@
static rwlock_t hash_table_lock = RW_LOCK_UNLOCKED;

static struct buffer_head *lru_list[NR_LIST];
-static spinlock_t lru_list_lock = SPIN_LOCK_UNLOCKED;
+
+static spinlock_cacheline_t lru_list_lock_cacheline = {SPIN_LOCK_UNLOCKED};
+#define lru_list_lock lru_list_lock_cacheline.lock
+
static int nr_buffers_type[NR_LIST];
static unsigned long size_buffers_type[NR_LIST];

--- linux/mm/filemap.c.orig Tue Sep 25 09:16:06 2001
+++ linux/mm/filemap.c Tue Sep 25 10:13:06 2001
@@ -46,12 +46,13 @@
unsigned int page_hash_bits;
struct page **page_hash_table;

-spinlock_t pagecache_lock ____cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
+spinlock_cacheline_t pagecache_lock_cacheline = {SPIN_LOCK_UNLOCKED};
+
/*
* NOTE: to avoid deadlocking you must never acquire the pagecache_lock with
* the pagemap_lru_lock held.
*/
-spinlock_t pagemap_lru_lock ____cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
+spinlock_cacheline_t pagemap_lru_lock_cacheline = {SPIN_LOCK_UNLOCKED};

#define CLUSTER_PAGES (1 << page_cluster)
#define CLUSTER_OFFSET(x) (((x) >> page_cluster) << page_cluster)
--- linux/mm/highmem.c.orig Tue Sep 25 09:29:49 2001
+++ linux/mm/highmem.c Tue Sep 25 10:21:14 2001
@@ -32,7 +32,8 @@
*/
static int pkmap_count[LAST_PKMAP];
static unsigned int last_pkmap_nr;
-static spinlock_t kmap_lock = SPIN_LOCK_UNLOCKED;
+static spinlock_cacheline_t kmap_lock_cacheline = {SPIN_LOCK_UNLOCKED};
+#define kmap_lock kmap_lock_cacheline.lock

pte_t * pkmap_page_table;


2001-10-29 17:16:02

by Alan

[permalink] [raw]
Subject: Re: Pls apply this spinlock patch to the kernel

> and occupies the full cacheline. The five hottest spinlocks for this
> work load are set to that type (kmap_lock, lru_list_lock,
> pagecache_lock, kernel_flag, pagemap_lru_lock) The change is done in
> the common code, except for the kernel_flag.

That seems overkill. What you are saying is that static spinlocks need
to be declared in some cases to be followed by padding.

> +static spinlock_cacheline_t lru_list_lock_cacheline = {SPIN_LOCK_UNLOCKED};
> +#define lru_list_lock lru_list_lock_cacheline.lock

So why not just add a macro for aligning then do

spinlock_t pagecache_lock ____cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
cache_line_pad;

where cache_line_pad is an asm(".align") - I would assume that is
sufficient - Linus ?

Alan

2001-10-29 17:28:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: Pls apply this spinlock patch to the kernel


On Mon, 29 Oct 2001, Juergen Doelle wrote:
>
> I had created a patch for improving the spinlock behavior on IA32 SMP
> systems for file system work load created with dbench
> (ftp://samba.org/pub/tridge/dbench). The work load is a mix of create,
> delete, write, and read operations executed from a scalable number of
> clients. It is mainly handled in buffer cache.

Fair enough. However, I wonder why you didn't just use the existing
(unaligned) types, and then on a lock-by-lock basis just mark them
aligned. That implies no code-changes.

Something like this should do it:

.. regular "spinlock_t" type

#define cachealign \
__attribute__((section("aligned"),__aligned__(SMP_CACHELINE_SIZE)))

(use a separate section so that subsequent data structures are also
guaranteed to be aligned - otherwise you might get false sharing from
non-aligned data structures that follow this one).

Eh?

Yes, we already try to do something like this, but due to the false
sharing with other stuff it doesn't _guarantee_ an exclusive cacheline.
Sometimes that is what you want (ie once you get the lock, it _can_ be
advantageous to have the hottest data structure associated with the lock
be in the same cacheline)

Linus

2001-10-29 17:43:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: Pls apply this spinlock patch to the kernel


On Mon, 29 Oct 2001, Alan Cox wrote:
>
> So why not just add a macro for aligning then do
>
> spinlock_t pagecache_lock ____cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
> cache_line_pad;
>
> where cache_line_pad is an asm(".align") - I would assume that is
> sufficient - Linus ?

Gcc won't guarantee that it puts different variables adjacently - the
linker (or even the compiler) can move things around to make them fit
better. Which is why it would be better to use the separate section trick.

(The union trick Juergen uses obviously also works around this, but
requires that the _users_ be aware of what kind of lock it is, which I
don't particularly like - then you can't just change the lock, you have to
change the users too).

Linus

2001-10-30 13:19:09

by Anton Blanchard

[permalink] [raw]
Subject: Re: Pls apply this spinlock patch to the kernel


Hi,

> spinlock_t pagecache_lock ____cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
> cache_line_pad;
>
> where cache_line_pad is an asm(".align") - I would assume that is
> sufficient - Linus ?

In include/asm-ppc64/cache.h I ended up redefining ____cacheline_aligned_in_smp
to avoid the problem of things sharing a cacheline:

#ifdef CONFIG_SMP
#define ____cacheline_aligned_in_smp __cacheline_aligned
#else
#define ____cacheline_aligned_in_smp
#endif /* CONFIG_SMP */

so that these things end up in the cacheline_aligned section and no variables
share a cacheline. Maybe there are some places where we want to group
variables in the same cacheline, but for consistency I think we should
have:

__cacheline_aligned{,_in_smp}
variable goes into cacheline_aligned section

____cacheline_aligned{,_in_smp}
just align to a cacheline

Anton

2001-11-03 19:56:32

by Richard Henderson

[permalink] [raw]
Subject: Re: Pls apply this spinlock patch to the kernel

On Mon, Oct 29, 2001 at 09:32:52AM -0800, Linus Torvalds wrote:
> On Mon, 29 Oct 2001, Alan Cox wrote:
> > spinlock_t pagecache_lock ____cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
> > cache_line_pad;
> >
> > where cache_line_pad is an asm(".align") - I would assume that is
> > sufficient - Linus ?

The "cache_line_pad" is useless. The __attribute__((aligned(N)))
is completely sufficient.

> Gcc won't guarantee that it puts different variables adjacently - the
> linker (or even the compiler) can move things around to make them fit
> better. Which is why it would be better to use the separate section trick.

Separate sections are also not needed. While you can't guarantee
adjacency, the object file *does* record the required alignment
and that must be honored by the linker.

Now, separate sections do make sense for minimizing accumulated
padding, but that is a separate issue.


r~

2001-11-03 20:24:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: Pls apply this spinlock patch to the kernel


On Sat, 3 Nov 2001, Richard Henderson wrote:
>
> The "cache_line_pad" is useless. The __attribute__((aligned(N)))
> is completely sufficient.

I think you missed the important part: there must be no false sharing with
ANYTHING ELSE.

If you have a 4-byte entry that is aligned to 128 bytes, you have 124
bytes of stuff that the linker _will_ fill up with other things.

And if you don't want false sharing, that MUST NOT HAPPEN.

Try it. You'll see.

> Separate sections are also not needed. While you can't guarantee
> adjacency, the object file *does* record the required alignment
> and that must be honored by the linker.

It's not just alignment: it wants an exclusive cacheline. Thus the
padding.

And I'm claiming, based on past experiences with the linker, that the
padding won't guarantee anything, because the linker can re-order things
to "pack" them tighter. So the padding either has to be inside a structure
or a union (which implies a new type, and thus that the users care about
whether the spinlock is padded or not), or it needs a separate section, so
that it doesn't _matter_ if the linker re-orders anything, because
everything in that section is aligned, and as such you cannot get false
sharing even with reordering.

Linus

2001-11-03 21:02:27

by Richard Henderson

[permalink] [raw]
Subject: Re: Pls apply this spinlock patch to the kernel

On Sat, Nov 03, 2001 at 12:20:53PM -0800, Linus Torvalds wrote:
> If you have a 4-byte entry that is aligned to 128 bytes, you have 124
> bytes of stuff that the linker _will_ fill up with other things.

If you put the alignment on the type, not the variable, e.g.

typedef int aligned_int __attribute__((aligned(128)));
aligned_int foo;

then sizeof(foo) == 128, and the linker sees a 128-byte object,
not a 4 byte object with 128 byte alignment.

It's a subtle difference between alignment of types and alignment
of variables, but it makes sense if you think about it.


r~

2001-11-04 00:46:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: Pls apply this spinlock patch to the kernel


On Sat, 3 Nov 2001, Richard Henderson wrote:
>
> On Sat, Nov 03, 2001 at 12:20:53PM -0800, Linus Torvalds wrote:
> > If you have a 4-byte entry that is aligned to 128 bytes, you have 124
> > bytes of stuff that the linker _will_ fill up with other things.
>
> If you put the alignment on the type, not the variable, e.g.

That doesn't work.

The whole _point_ here is to make the thing variable-specific, so that we
can say _this_ spinlock needs a cache-line of its own, without blowing up
all spinlocks to 128 bytes.

There's no way we want 128-byte spinlocks in general. We want to mark 4-5
spinlocks as being so critical that they can have a cacheline of their
own. But I do _not_ want to have special operations for those spinlocks,
so I don't want those spinlocks to have any special types (ie the actual
_user_ should not need to care, so that you can play around with testing
the different spinlocks one by one without having to edit all the users of
any specific spinlock).

Linus