folks, hi,
i've hit an unusual situation where i'm responsible for evaluating and
setting the general specification of a new multi-core processor, but
it's based around a RISC core that cannot be significantly changed
without going through some very expensive verification procedures. in
the discussions, the engineer responsible for it said that modifying
the cache is prohibitively expensive and time-consuming, but that one
possible workaround would be to have a hardware detection mechanism of
cache-write conflicts, to generate a software interrupt that you would
then simply run some assembly code to flush the 1st-level cache line.
the hardware detection mechanism could be tacked on, would be very
quick and easy to implement, and would generate interrupts to the
specific processor whose data required flushing.
now, whilst it tickles my hardware hacker fancy like anything, because
i feel that this could be used for many other purposes such as
implementing spin-locks, i have some concerns about the performance
implications that i'm not qualified or experienced enough to say one
way or the other if it's a stonking good idea or just outright mad.
so, bearing in mind that sensible answers will likely result in offers
of a consulting contract to actually *implement* the software /
assembly code for the linux kernel modifications required (yes, linux
is already available for this RISC processor type - but only in
single-core), i would greatly appreciate some help in getting answers
to these questions:
* is this even a good idea? does it "fly"?
* if it does work, at what point do the number of cores involved just
make it... completely impractical? over 2? over 4? 8? 16?
* i believe the cache lines in the 1st level data cache are 8 bytes
(and the AMBA / AXI bus on each is 64-bit wide) - is that reasonable?
* does anyone know of any other processors that have actually
implemented software-driven cache coherency, esp. ones with linux
kernel running on them, and if so, how does it do?
much appreciated considerate and informative answers - i must
apologise that i will be immediately unsubscribing from linux-kernel
list, and re-subscribing again in the near future, but will be
watching responses via web-based list archives: the number of messages
on lkml is too high to do otherwise. also for those of you who
remember it: whilst it was fun in a scary kind of way, if would be
nice if this didn't turn into the free-for-all whopper-thread that
occurred back in 2005 or so - this multi-core processor is going to be
based around an existing proven 20-year-old well-established RISC core
that has been running linux for over a decade, it just has never been
put into an SMP arrangement before and we're on rather short
timescales to get it done.
l.
On Fri, Mar 25, 2011 at 9:52 PM, Luke Kenneth Casson Leighton
<[email protected]> wrote:
> so, bearing in mind that sensible answers will likely result in offers
> of a consulting contract to actually *implement* the software /
> assembly code for the linux kernel modifications required (yes, linux
> is already available for this RISC processor type - but only in
> single-core), i would greatly appreciate some help in getting answers
> to these questions:
>
> * is this even a good idea? does it "fly"?
Probably not. Is it a virtual or physical indexed cache? Do you have a
precise workload in mind? If you have a very precise workload and you
don't expect to get many write conflicts then it could be made to
work.
> * if it does work, at what point do the number of cores involved just
> make it... completely impractical? ?over 2? ?over 4? ?8? 16?
You would have to simulate it with your workload to know the answer to
that, but if you're pushing to higher number of cores I think it would
pay to do this properly.
> occurred back in 2005 or so - this multi-core processor is going to be
> based around an existing proven 20-year-old well-established RISC core
> that has been running linux for over a decade, it just has never been
> put into an SMP arrangement before and we're on rather short
> timescales to get it done.
There are a number of mature cores out there that can do this already
and can be bought off the shelf, I wouldn't underestimate the
difficulty of getting your cache coherency protocol right particularly
on a limited time/resource budget.
> Probably not. Is it a virtual or physical indexed cache? Do you have a
> precise workload in mind? If you have a very precise workload and you
> don't expect to get many write conflicts then it could be made to
> work.
I'm unconvinced. The user space isn't the hard bit - little user memory
is shared writable, the kernel data structures on the other hand,
especially in the RCU realm are going to be interesting.
> There are a number of mature cores out there that can do this already
> and can be bought off the shelf, I wouldn't underestimate the
> difficulty of getting your cache coherency protocol right particularly
> on a limited time/resource budget.
Architecturally you may want to look at running one kernel per device
(remembering that you can share the non writable kernel pages between
different instances a bit if you are careful) - and in theory certain
remote mappings.
Basically it would become a cluster with a very very fast "page transfer"
operation for moving data between nodes.
On Sat, Mar 26, 2011 at 12:08:47PM +0000, Alan Cox wrote:
> > Probably not. Is it a virtual or physical indexed cache? Do you have a
> > precise workload in mind? If you have a very precise workload and you
> > don't expect to get many write conflicts then it could be made to
> > work.
>
> I'm unconvinced. The user space isn't the hard bit - little user memory
> is shared writable, the kernel data structures on the other hand,
> especially in the RCU realm are going to be interesting.
Indeed. One approach is to flush the caches on each rcu_dereference().
Of course, this assumes that the updaters flush their caches on each
smp_wmb(). You probably also need to make ACCESS_ONCE() flush caches
(which would automatically take care of rcu_dereference()). So might
work, but won't be fast.
You can of course expect a lot of odd bugs in taking this approach.
The assumption of cache coherence is baked pretty deeply into most
shared-memory parallel software. As you might have heard in the 2005
discussion. ;-)
> > There are a number of mature cores out there that can do this already
> > and can be bought off the shelf, I wouldn't underestimate the
> > difficulty of getting your cache coherency protocol right particularly
> > on a limited time/resource budget.
>
> Architecturally you may want to look at running one kernel per device
> (remembering that you can share the non writable kernel pages between
> different instances a bit if you are careful) - and in theory certain
> remote mappings.
>
> Basically it would become a cluster with a very very fast "page transfer"
> operation for moving data between nodes.
This works for applications coded specially for this platform, but unless
I am missing something, not for existing pthreads applications. Might
be able to handle things like Erlang that do parallelism without shared
memory.
Thanx, Paul
On Mon, Mar 28, 2011 at 7:06 PM, Paul E. McKenney
<[email protected]> wrote:
>> Basically it would become a cluster with a very very fast "page transfer"
>> operation for moving data between nodes.
>
> This works for applications coded specially for this platform, but unless
> I am missing something, not for existing pthreads applications. Might
> be able to handle things like Erlang that do parallelism without shared
> memory.
ok - well, having thought about this a little bit (in a non-detailed
high-level way) i was sort-of hoping, as alan hinted at, to still do
SMP, even if it's slow, for userspace. the primary thing to prevent
from happening is to have kernelspace data structures from
conflicting.
i found kerrigan, btw, spoke to the people on it: louis agreed that
the whole idea was mad as hell and was therefore actually very
interesting to attempt :)
as a first approximation i'm absolutely happy for existing pthreads
applications to be forced to run on the same core.
l.
On Mon, Mar 28, 2011 at 07:48:34PM +0100, Luke Kenneth Casson Leighton wrote:
> On Mon, Mar 28, 2011 at 7:06 PM, Paul E. McKenney
> <[email protected]> wrote:
> >> Basically it would become a cluster with a very very fast "page transfer"
> >> operation for moving data between nodes.
> >
> > This works for applications coded specially for this platform, but unless
> > I am missing something, not for existing pthreads applications. ?Might
> > be able to handle things like Erlang that do parallelism without shared
> > memory.
>
> ok - well, having thought about this a little bit (in a non-detailed
> high-level way) i was sort-of hoping, as alan hinted at, to still do
> SMP, even if it's slow, for userspace. the primary thing to prevent
> from happening is to have kernelspace data structures from
> conflicting.
>
> i found kerrigan, btw, spoke to the people on it: louis agreed that
> the whole idea was mad as hell and was therefore actually very
> interesting to attempt :)
What was that old Chinese curse? "May you live in interesting times"
or something like that? ;-)
I suspect that you can get something that runs suboptimally but mostly
works. Getting something that really works likely requires that the
hardware support cache coherence.
> as a first approximation i'm absolutely happy for existing pthreads
> applications to be forced to run on the same core.
In a past life, we forced any given pthread process to have all of
its threads confined to a single NUMA node, so I guess that there is
precedent. The next step is figuring out how to identify apps that
use things like mmap() to share memory among otherwise unrelated
processes, and working out what to do with them.
Thanx, Paul
> ok - well, having thought about this a little bit (in a non-detailed
> high-level way) i was sort-of hoping, as alan hinted at, to still do
> SMP, even if it's slow, for userspace. the primary thing to prevent
> from happening is to have kernelspace data structures from
> conflicting.
>
> i found kerrigan, btw, spoke to the people on it: louis agreed that
> the whole idea was mad as hell and was therefore actually very
> interesting to attempt :)
>
> as a first approximation i'm absolutely happy for existing pthreads
> applications to be forced to run on the same core.
The underlying problem across a cluster of nodes can be handled
transparently. MOSIX solved that problem a very long time ago using DSM
(distributed shared memory). It's not pretty, it requires a lot of tuning
to make it fly but they did it over comparatively slow interconnects.
Alan
On Mon, Mar 28, 2011 at 11:18 PM, Alan Cox <[email protected]> wrote:
>> ok - well, having thought about this a little bit (in a non-detailed
>> high-level way) i was sort-of hoping, as alan hinted at, to still do
>> SMP, even if it's slow, for userspace. the primary thing to prevent
>> from happening is to have kernelspace data structures from
>> conflicting.
>>
>> i found kerrigan, btw, spoke to the people on it: louis agreed that
>> the whole idea was mad as hell and was therefore actually very
>> interesting to attempt :)
>>
>> as a first approximation i'm absolutely happy for existing pthreads
>> applications to be forced to run on the same core.
>
> The underlying problem across a cluster of nodes can be handled
> transparently. MOSIX solved that problem a very long time ago using DSM
> (distributed shared memory). It's not pretty, it requires a lot of tuning
> to make it fly but they did it over comparatively slow interconnects.
hmmm, the question is, therefore: would the MOSIX DSM solution be
preferable, which i presume assumes that memory cannot be shared at
all, to a situation where you *could* at least get cache coherency in
userspace, if you're happy to tolerate a software interrupt handler
flushing the cache line manually?
it had occurred to me, btw, that it would be good to have separate
interrupts for userspace and kernelspace. kernelspace would have a
"serious problem occurred!" interrupt handler and userspace would have
the horribly-slow-but-tolerable cache-flush assembly code.
is that preferable over - faster than - the MOSIX DSM solution, do you think?
l.
p.s. alan am not ignoring what you wrote, it's just that if this goes
ahead, it has to be done _quickly_ and without requiring
re-verification of large VHDL macro blocks. of the two companies
whose cores are under consideration, neither of them have done SMP
variants (yet) and we haven't got the time to wait around whilst they
get it done. so these beautiful and hilarious hacks, which can be
tacked onto the outside, are what we have to live with for at least oo
18 months, get a successful saleable core out, that pays for the work
to be done proppa :)
On Tue, Mar 29, 2011 at 12:38 AM, Luke Kenneth Casson Leighton
<[email protected]> wrote:
> p.s. alan am not ignoring what you wrote, it's just that if this goes
alan? beh?? paul. sorry :)
On Tue, Mar 29, 2011 at 12:39:30AM +0100, Luke Kenneth Casson Leighton wrote:
> On Tue, Mar 29, 2011 at 12:38 AM, Luke Kenneth Casson Leighton
> <[email protected]> wrote:
>
> > ?p.s. alan am not ignoring what you wrote, it's just that if this goes
>
> alan? beh?? paul. sorry :)
;-)
FWIW, I believe that the http://www.scalemp.com/ folks do something
similar to what Alan suggests in order to glue multiple x86 systems into
one SMP system from the viewpoint of user applications.
Thanx, Paul
> hmmm, the question is, therefore: would the MOSIX DSM solution be
> preferable, which i presume assumes that memory cannot be shared at
> all, to a situation where you *could* at least get cache coherency in
> userspace, if you're happy to tolerate a software interrupt handler
> flushing the cache line manually?
In theory DSM goes further than this. One way to think about DSM is cache
coherency in software with a page size granularity. So you could imagine
a hypothetical example where the physical MMU of each node and a memory
manager layer comnunicating between them implemented a virtualised
machine on top which was cache coherent.
The detail (and devil no doubt) is in the performance.
Basically however providing your MMU can trap both reads and writes you
can implement a MESI cache in software. Mosix just took this to an
extreme as part of a distributed Unix (originally V7 based).
So you've got
Modified: page on one node, MMU set to fault on any other so you can
fix it up
Exclusive: page on one node, MMU set to fault on any other or on writes
by self (latter taking you to modified so you know to write
back)
Shared: any write set to be caught by the MMU, the fun bit then is
handling invalidating across other nodes with the page in
cache. (and the fact multiple nodes may fault the page at once)
Invalid: our copy is invalid (its M or E elsewhere probably), MMU set so
we fault on any access. For shared this is also relevant so
you can track for faster invalidates
and the rest is a software problem.
alan, paul, will, apologies for not responding sooner, i've just moved
to near stranraer, in scotland. och-aye. the removal lorry has been
rescued from the mud by an 18 tonne tractor and we have not run over
any sheep. yet.
On Tue, Mar 29, 2011 at 10:16 AM, Alan Cox <[email protected]> wrote:
>> hmmm, the question is, therefore: would the MOSIX DSM solution be
>> preferable, which i presume assumes that memory cannot be shared at
>> all, to a situation where you *could* at least get cache coherency in
>> userspace, if you're happy to tolerate a software interrupt handler
>> flushing the cache line manually?
>
> In theory DSM goes further than this. One way to think about DSM is cache
> coherency in software with a page size granularity. So you could imagine
> a hypothetical example where the physical MMU of each node and a memory
> manager layer comnunicating between them implemented a virtualised
> machine on top which was cache coherent.
> [...details of M.E.S.I ... ]
well... the thing is that there already exists an MMU per core.
standard page-faults occur, etc. in this instance (i think!), just as
would occur in any much more standard SIMD architecture (with normal
hardware-based 1st level cache coherency)
hm - does this statement sound reasonable: this is sort-of a
second-tier of MMU principles, with a page size granularity of 8 bytes
(!) with oo 4096 or 8192 such "pages" (32 or 64k or whatever of 1st
level cache). thus, the principles you're describing [M.E.S.I] could
be applied, even at that rather small level of granularity.
or... wait... "invalid" is taken care of at a hardware level, isn't
it? [this is 1st level cache]
much appreciated the thoughts and discussion so far.
l.
On Thu, Apr 07, 2011 at 01:09:29PM +0100, Luke Kenneth Casson Leighton wrote:
> alan, paul, will, apologies for not responding sooner, i've just moved
> to near stranraer, in scotland. och-aye. the removal lorry has been
> rescued from the mud by an 18 tonne tractor and we have not run over
> any sheep. yet.
>
> On Tue, Mar 29, 2011 at 10:16 AM, Alan Cox <[email protected]> wrote:
> >> ?hmmm, the question is, therefore: would the MOSIX DSM solution be
> >> preferable, which i presume assumes that memory cannot be shared at
> >> all, to a situation where you *could* at least get cache coherency in
> >> userspace, if you're happy to tolerate a software interrupt handler
> >> flushing the cache line manually?
> >
> > In theory DSM goes further than this. One way to think about DSM is cache
> > coherency in software with a page size granularity. So you could imagine
> > a hypothetical example where the physical MMU of each node and a memory
> > manager layer comnunicating between them implemented a virtualised
> > machine on top which was cache coherent.
>
> > [...details of M.E.S.I ... ]
>
> well... the thing is that there already exists an MMU per core.
> standard page-faults occur, etc. in this instance (i think!), just as
> would occur in any much more standard SIMD architecture (with normal
> hardware-based 1st level cache coherency)
>
> hm - does this statement sound reasonable: this is sort-of a
> second-tier of MMU principles, with a page size granularity of 8 bytes
> (!) with oo 4096 or 8192 such "pages" (32 or 64k or whatever of 1st
> level cache). thus, the principles you're describing [M.E.S.I] could
> be applied, even at that rather small level of granularity.
If your MMU supports 8-byte pages, this could work. If you are trying
to leverage the hardware caches, then you really do need hardware cache
coherence. If there is no hardware cache coherence (which I believe
is the situation you are dealing with), then you need to implement
M.E.S.I. in software. In this case, the hardware caches are blissfully
unaware of the "invalid" state -- instead, one core takes a page fault,
communicates its need for that page to the core that has it in either
"modified" or "exclusive" state (or to all cores that have it in "shared"
state in the case of a write). The recipient core(s) flush that page's
memory, mark the page as "invalid" in its/their MMU(s), then respond to
the original core's message. Once the original core has received all
the acks, it can map the page "shared" (in the case of a read access)
or "modified" (in the case of a write access).
The "exclusive" state can be used if the original core sees that no
other core has that page mapped.
Of course, updates to the shared state tracking what page is in what
state on what core must be updated carefully with appropriate cache
flushing, atomic operations (if available), and memory barriers.
> or... wait... "invalid" is taken care of at a hardware level, isn't
> it? [this is 1st level cache]
No. The only situation in which "invalid" is taken care of at the
hardware level (by the 1st level cache) is when the hardware implements
cache coherence, and you have stated that your hardware does not implement
cache coherence.
Now, using the DSM approach that Alan suggested -does- in fact handle
"invalid" in hardware, but it is the MMU rather than the caches that
are doing the handling.
There are a number of DSM projects out there. The wikipedia article
lists several of them:
http://en.wikipedia.org/wiki/Distributed_shared_memory
Of course, one of the problems with DSM is that the cache-miss penalties
are quite high. After all, you must take a page fault, then communicate
to one (perhaps many) other cores, which must update their MMUs, flush
TLBs, and so on. But then again, that is why hardware cache coherence
exists and why DSM has not taken over the world.
But given the hardware you are expecting to work with, if you want
reliable operation, I don't see much alternative. And DSM can actually
perform very well, as long as your workload doesn't involve too much
high-frequency data sharing among the cores.
Thanx, Paul
> much appreciated the thoughts and discussion so far.
>
> l.