2003-09-09 17:27:40

by Luca Veraldi

[permalink] [raw]
Subject: Efficient IPC mechanism on Linux

Hi all.
At the web page
http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
You can find the results of my attempt in modifing the linux kernel sources
to implement a new Inter Process Communication mechanism.

It is called ECBM for Efficient Capability-Based Messaging.

In the reading You can also find the comparison of ECBM
against some other commonly-used Linux IPC primitives
(such as read/write on pipes or SYS V tools).

The results are quite clear.

Enjoy.
Luca Veraldi


----------------------------------------
Luca Veraldi

Graduate Student of Computer Science
at the University of Pisa

[email protected]
[email protected]
ICQ# 115368178
----------------------------------------


2003-09-09 18:55:00

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

You're right. When i have a bit of time, i'll translate the page.

The version of the kernel is not so important.
The code is highly portable and little dependent of the internals of kernel.
It only needs some functions that continue to survive in newer versions of
Linux.
I used 2.2.4 only because it was ready-to-use on my machine.

The main purpose of the article was not to present a professional
mechanism that you can download and use into your own applications.

It is only the proof of a fact: Linux IPC Mechanisms
are ***structurally*** inefficient.

Bye.
Luca


2003-09-09 21:19:25

by Alan

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Maw, 2003-09-09 at 18:30, Luca Veraldi wrote:
> Hi all.
> At the web page
> http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
> You can find the results of my attempt in modifing the linux kernel sources
> to implement a new Inter Process Communication mechanism.
>
> It is called ECBM for Efficient Capability-Based Messaging.
>
> In the reading You can also find the comparison of ECBM
> against some other commonly-used Linux IPC primitives
> (such as read/write on pipes or SYS V tools).

The text is a little hard to follow for an English speaker. I can follow
a fair bit and there are a few things I'd take issue with (forgive me if
I'm misunderstanding your material)


1. Almost all IPC is < 512 bytes long. There are exceptions including
X11 image passing which is an important performance one

2. One of the constraints that a message system has is that if it uses
shared memory it must protect the memory queues from abuse. For example
if previous/next pointers are kept in the shared memory one app may be
able to trick another into modifying the wrong thing. Sometimes this
matters.

3. An interesting question exists as to how whether you can create the
same effect with current 2.6 kernel primitives. We have posix shared
memory objects, we have read only mappings and we have extremely fast
task switch and also locks (futex locks)

Also from the benching point of view it is always instructive to run a
set of benchmarks that involve the sender writing all the data bytes,
sending it and the receiver reading them all. Without this zero copy
mechanisms (especially mmu based ones) look artificially good.


2003-09-09 21:53:28

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> The text is a little hard to follow for an English speaker. I can follow
> a fair bit and there are a few things I'd take issue with (forgive me if
> I'm misunderstanding your material)

I know. I'm very sorry for this. I'm rewriting the text in English,
so, don't bore with Italian. The new page will be up in a moment.

> 1. Almost all IPC is < 512 bytes long. There are exceptions including
> X11 image passing which is an important performance one

Sure for system needs. What about programming?
You can write application of any complexity.
Why do I have to be limited by kernel in sending 4056 bytes
(for exmple, if I use SYS V message queues)
instead of an arbitrary sized message?

> 2. One of the constraints that a message system has is that if it uses
> shared memory it must protect the memory queues from abuse. For example
> if previous/next pointers are kept in the shared memory one app may be
> able to trick another into modifying the wrong thing. Sometimes this
> matters.

Please, refer to address space.
If you don't have the piece of information you want to access
into your own private address space, the only thing that will occur
is a general protection fault and your process will be killed.
Remember that, with my primitives, you're working with logical addresses
that need relocation in hardware. The MMU contestually performs
memory protection, too.

> 3. An interesting question exists as to how whether you can create the
> same effect with current 2.6 kernel primitives. We have posix shared
> memory objects, we have read only mappings and we have extremely fast
> task switch and also locks (futex locks)

I too need to use locks in my primitives. The fact that new kernel has
faster locks also reduceses the completion time of my primitives.

Also. The central point is not to have10 instead of 50 assembler lines
in the primitives. The central point is to implement communication
primitives
that do not require physical copying of the messages being sent and
received.
Lock probably reduces the time to write or read from a pipe.
But the order of magnitute remains the same.

Task switching is relevant in all communication primitives.
Fast task switch means proportionally fast primitives. Also mine primitives.

> Also from the benching point of view it is always instructive to run a
> set of benchmarks that involve the sender writing all the data bytes,
> sending it and the receiver reading them all. Without this zero copy
> mechanisms (especially mmu based ones) look artificially good.

The bench you are talking about is exactly the one I implemented to misure
the IPC overhead.
We have 2 process communicating over a channel in a pipeline fashion.
A writes some information in a buffer and sends it to B.
B receives and reads.
This for 1000 times.
Times reported are average time.

Bye,
Luca

2003-09-09 22:11:43

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Ok. The English version of the document is online, now.
I've translated the main part of it.
The address is the same.

For English speaker:
http://web.tiscali.it/lucavera/www/root/ecbm/index.htm

If You want it in Italian:
http://web.tiscali.it/lucavera/www/root/ecbm/indexIT.htm

Sorry if You find some typo. Here in Italy it's midnight...

2003-09-09 23:13:16

by Alan

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Maw, 2003-09-09 at 22:57, Luca Veraldi wrote:
> Also. The central point is not to have10 instead of 50 assembler lines
> in the primitives. The central point is to implement communication
> primitives
> that do not require physical copying of the messages being sent and
> received.

The question (for smaller messages) is whether this is a win or not.
While the data fits in L1 cache the copy is close to free compared
with context switching and TLB overhead

> We have 2 process communicating over a channel in a pipeline fashion.
> A writes some information in a buffer and sends it to B.
> B receives and reads.
> This for 1000 times.
> Times reported are average time.

Ok - from you rexample I couldnt tell if B then touches each byte of
data as well as having it appear in its address space.

2003-09-10 09:00:15

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> The question (for smaller messages) is whether this is a win or not.
> While the data fits in L1 cache the copy is close to free compared
> with context switching and TLB overhead

I can answer for messages of 512 bytes (that I consider small).
This is the smallest case considered in the work.

Completion times in this case are reported in graphs:
you have the time of write() on pipe,
of SYS V shmget() and of new primitives.
It's easy to compare numbers and to say if it is "a win or not".

Surely, transferring bytes from a cache line to another is at zero cost
(that is, a single clock cycle of the cache unit).
But here, how can you matematically grant that,
using traditional IPC mechanism,
you'll find the info you want directly in cache?
This is quite far from being realistic.

In real world, what happens is that copying a page memory
raises many cache faults... at least, more that copying a capability
structure
(16 bytes long, as reported in the article).

Of course, if you want to send a single int,
probably you don't have any reason to use capability.
That's sure.

> Ok - from you rexample I couldnt tell if B then touches each byte of
> data as well as having it appear in its address space.

It is irrilevant. The time reported in the work are the time needed
to complete the IPC primitives.
After a send or a receive over the channel,
the two processes may do everything they want.
It is not interesting for us.

Bye,
Luca

2003-09-10 09:14:38

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> I mean for benchmarks. You compared your implementation to a very old
> linux' one. There were big changes in these areas.

The overhead implied by a memcpy() is the same, in the oder of magnitude,
***whatever*** kernel version you can develop.

If you have to send, lt's say, 3 memory pages from A to B,
using pipes you have to ***physically copy***
2 * 3 * PAGE_SIZE bytes
(firt time, sending; then receiving them. So two times).
Which evals to 8192 * 3.

Using capability, the only thing you have to do
is copying an amount of bytes that are linearly dependent
by the number of pages: so, proportional to 3.

I you want the (quite) exact amount, it is
3 * sizeof(unsigned long) + sizeof(capability)
which evals to 12 + 16 = 28.
X is not present in the equation.

I compared pipes and SYS V IPC on Linux 2.2.4 with the new mechanism
also developed over kernel 2.2.4!
The same inefficiency of the kernel support you are talking about
affet all the primitives being examined in the article. Mine, too.

So the relative results will be quite the same.

> you surely know, that it is just an implementation. The mechanisms have
> always been there, evolved from UNIX.
> That is mainly the reason for them to exists: support the applications
> which use them.
>
> Will it be possible to base existing facilities on your approach?
> SVR5 messages (msg{get,snd,rcv}), for example?
> (I have to ask, because I cannot understand the article, sorry)

Ah, ok. So let's continue to do ineffient things
only because it has always been so!

Compatibility is not a problem. Simply rewrite the write() and read() for
pipes
in order to make them do the same thing done by zc_send() and zc_receive().
Or, if you are not referring to pipes, rewrite the support level of you
anchient IPC primitives
in order to make them do the same thing done by zc_send() and zc_receive().

Bye,
Luca.

2003-09-10 09:23:25

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, 2003-09-10 at 11:18, Luca Veraldi wrote:

> The overhead implied by a memcpy() is the same, in the oder of magnitude,
> ***whatever*** kernel version you can develop.


yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the
uncached case. A pagetable operation (like the cpu setting the accessed
or dirty bit) is in that same order I suspect (maybe half this, but not
a lot less). Changing pagetable content is even more because all the
tlb's and internal cpu state will need to be flushed... which is also a
microcode operation for the cpu. And it's deadly in an SMP environment.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-09-10 09:36:24

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the
> uncached case. A pagetable operation (like the cpu setting the accessed
> or dirty bit) is in that same order I suspect (maybe half this, but not
> a lot less).

Probably you don't know what you're talking about.
I don't know where you studied computer architectures, but...
Let's answer.

To set the accessed or dirty bit you use

38 __asm__ __volatile__( LOCK_PREFIX
39 "btsl %1,%0"
40 :"=m" (ADDR)
41 :"Ir" (nr));

which is a ***SINGLE CLOCK CYCLE*** of cpu.
I don't think really that on any machine Firmware
a btsl will require 4000 cycles.
Neither on Intel x86.

> Changing pagetable content is even more because all the
> tlb's and internal cpu state will need to be flushed... which is also a
> microcode operation for the cpu.

Good. The same overhead you will find accessing a message
after a read form a pipe. There will occur many TLB faults.
And the same apply copying the message to the pipe.
Many many TLB faults.

> And it's deadly in an SMP environment.

You say "tlb's and internal cpu state will need to be flushed".
The other cpus in an SMP environment can continue to work, indipendently.
TLBs and cpu state registers are ***PER-CPU*** resorces.

Probably, it is worse the case of copying a memory page,
because you have to hold some global lock all the time.
This is deadly in an SMP environment,
because critical section lasts for thousand of cycles,
instead of simply a few.

2003-09-10 09:44:20

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, Sep 10, 2003 at 11:40:33AM +0200, Luca Veraldi wrote:
> To set the accessed or dirty bit you use
>
> 38 __asm__ __volatile__( LOCK_PREFIX
> 39 "btsl %1,%0"
> 40 :"=m" (ADDR)
> 41 :"Ir" (nr));
>
> which is a ***SINGLE CLOCK CYCLE*** of cpu.
> I don't think really that on any machine Firmware
> a btsl will require 4000 cycles.
> Neither on Intel x86.

For fun do the measurement on a pIV cpu. You'll be surprised.
The microcode "mark dirty" (which is NOT a btsl, it gets done when you do a write
memory access to the page content) result will be in the 2000 to 4000 range I
predict. There are things like SMP synchronisation to do, but also
if the cpu marks a page dirty in the page table, that means the page table
changes which means the pagetable needs to be marked in the
PMD. Which means the PMD changes, which means the PGD needs the PMD marked
dirty. Etc Etc. It's microcode. It'll take several 1000 cycles.



> > Changing pagetable content is even more because all the
> > tlb's and internal cpu state will need to be flushed... which is also a
> > microcode operation for the cpu.
>
> Good. The same overhead you will find accessing a message
> after a read form a pipe. There will occur many TLB faults.
> And the same apply copying the message to the pipe.
> Many many TLB faults.

A TLB fault in the normal case is about 7 cycles. But that's for a TLB not
being present. For TLB that IS present being written to means going to
microcode.

>
> > And it's deadly in an SMP environment.
>
> You say "tlb's and internal cpu state will need to be flushed".
> The other cpus in an SMP environment can continue to work, indipendently.
> TLBs and cpu state registers are ***PER-CPU*** resorces.

if you change a page table, you need to flush the TLB on all other cpus
that have that same page table mapped, like a thread app running
on all cpu's at once with the same pagetables.

> Probably, it is worse the case of copying a memory page,
> because you have to hold some global lock all the time.

why would you need a global lock for copying memory ?

2003-09-10 09:53:17

by Jamie Lokier

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Arjan van de Ven wrote:
> > The overhead implied by a memcpy() is the same, in the oder of magnitude,
> > ***whatever*** kernel version you can develop.
>
> yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the
> uncached case. A pagetable operation (like the cpu setting the accessed
> or dirty bit) is in that same order I suspect (maybe half this, but not
> a lot less). Changing pagetable content is even more because all the
> tlb's and internal cpu state will need to be flushed... which is also a
> microcode operation for the cpu. And it's deadly in an SMP environment.

I have just done a measurement on a 366MHz PII Celeron. The test is
quite crude, but these numbers are upper bounds on the timings:

mean 813 cycles to:

page fault
install zero page
remove zero page

(= read every page from MAP_ANON region and unmap, repeatedly)

mean 11561 cycles to:

page fault
copy zero page
install copy
remove copy

(= write every page from MAP_ANON region and unmap, repeatedly)

I think that makes a clear case for avoiding page copies.

-- Jamie

2003-09-10 10:07:10

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> mean 11561 cycles to:
>
> page fault
> copy zero page
> install copy
> remove copy
>
> (= write every page from MAP_ANON region and unmap, repeatedly)
>
> I think that makes a clear case for avoiding page copies.

Thanks.

2003-09-10 10:07:39

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, Sep 10, 2003 at 10:52:55AM +0100, Jamie Lokier wrote:
> Arjan van de Ven wrote:
> > > The overhead implied by a memcpy() is the same, in the oder of magnitude,
> > > ***whatever*** kernel version you can develop.
> >
> > yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the
> > uncached case. A pagetable operation (like the cpu setting the accessed
> > or dirty bit) is in that same order I suspect (maybe half this, but not
> > a lot less). Changing pagetable content is even more because all the
> > tlb's and internal cpu state will need to be flushed... which is also a
> > microcode operation for the cpu. And it's deadly in an SMP environment.
>
> I have just done a measurement on a 366MHz PII Celeron

This test is sort of the worst case against my argument:
1) It's a cpu with low memory bandwidth
2) It's a 1 CPU system
3) It's a pII not pIV; the pII is way more efficient cycle wise
for pagetable operations

2003-09-10 10:05:24

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> For fun do the measurement on a pIV cpu. You'll be surprised.
> The microcode "mark dirty" (which is NOT a btsl, it gets done when you do
a write
> memory access to the page content) result will be in the 2000 to 4000
range I
> predict.

I'm not responsible for microarchitecture designer stupidity.
If a simple STORE assembler instruction will eat up 4000 clock cycles,
as you say here, well, I think all we Computer Scientists can go home and
give it up now.

> There are things like SMP synchronisation to do, but also
> if the cpu marks a page dirty in the page table, that means the page table
> changes which means the pagetable needs to be marked in the
> PMD. Which means the PMD changes, which means the PGD needs the PMD marked
> dirty. Etc Etc. It's microcode. It'll take several 1000 cycles.

Please, return to the fact.
Modifying the page contents is done by that part of benchmark application we
are not misuring.
That is, the code ***BEFORE*** the zc_send() (write() on pipe or whatever
you choose)
and ***AFTER*** a zc_receive() (or read() from pipe ot whatever else).
This is nice, thanks, but out of our interests.

We are only reading the relocation tables or inserting new entries in it.
Not modifying page contents.

> if you change a page table, you need to flush the TLB on all other cpus
> that have that same page table mapped, like a thread app running
> on all cpu's at once with the same pagetables.

Ok. Simply, this is not the case in my experiment.
This does not apply.
We have no threads. But only detached process address spaces.
Threads are a bit different from processes.

> why would you need a global lock for copying memory ?

System call sys_write() calls
locks_verify_area() which calls
locks_mandatory_area() which calls
lock_kernel()

oops...
global spin_lock locking...

SMP is crying...
But not so much, if the lock is not lasting much, is it right?

THIS IS NOT WHAT WE CALL A SHORT LOCK SECTION:

732 repeat:
733 /* Search the lock list for this inode for locks that conflict
with
734 * the proposed read/write.
735 */
736 for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
737 if (!(fl->fl_flags & FL_POSIX))
738 continue;
739 if (fl->fl_start > new_fl->fl_end)
740 break;
741 if (posix_locks_conflict(new_fl, fl)) {
742 error = -EAGAIN;
743 if (filp && (filp->f_flags & O_NONBLOCK))
744 break;
745 error = -EDEADLK;
746 if (posix_locks_deadlock(new_fl, fl))
747 break;
748
749 error = locks_block_on(fl, new_fl);
750 if (error != 0)
751 break;
752
753 /*
754 * If we've been sleeping someone might have
755 * changed the permissions behind our back.
756 */
757 if ((inode->i_mode & (S_ISGID | S_IXGRP)) !=
S_ISGID)
758 break;
759 goto repeat;
760 }
761 }

>From the source code of your Operating System.

2003-09-10 10:14:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, Sep 10, 2003 at 12:09:21PM +0200, Luca Veraldi wrote:
> > For fun do the measurement on a pIV cpu. You'll be surprised.
> > The microcode "mark dirty" (which is NOT a btsl, it gets done when you do
> a write
> > memory access to the page content) result will be in the 2000 to 4000
> range I
> > predict.
>
> I'm not responsible for microarchitecture designer stupidity.
> If a simple STORE assembler instruction will eat up 4000 clock cycles,
> as you say here, well, I think all we Computer Scientists can go home and
> give it up now.

I'm saying it can. I don't want to go too deep into an arguement about
microarchitectural details, but my point was that a memory copy of a page
is NOT super expensive relative to several other effects that have to do
with pagetable manipulations.

> > if you change a page table, you need to flush the TLB on all other cpus
> > that have that same page table mapped, like a thread app running
> > on all cpu's at once with the same pagetables.
>
> Ok. Simply, this is not the case in my experiment.
> This does not apply.
> We have no threads. But only detached process address spaces.
> Threads are a bit different from processes.

but the pipe code cannot know this so it has to do a cross cpu invalidate.

> > why would you need a global lock for copying memory ?
>
> System call sys_write() calls
> locks_verify_area() which calls
> locks_mandatory_area() which calls
> lock_kernel()

and... which is also releasing it before the copy

2003-09-10 10:15:13

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> This test is sort of the worst case against my argument:
> 1) It's a cpu with low memory bandwidth
> 2) It's a 1 CPU system
> 3) It's a pII not pIV; the pII is way more efficient cycle wise
> for pagetable operations

I'm a Compuer Science graduate. And at least one single thing I've learned
in my studies.
More efficient Firmware support
(such as an extremely wide memory bandwith
or tens of CPUs in an SMP/NUMA
or efficient cache line transfer support)

DOES NOT ALLOW YOU TO WASTE CYCLES IN DOING USELESS THINGS,
such as TWO physical copies of a message if a process want to cooperate with
another one.

We are not engineer that simply have to let the things work.

That's all.
Luca.

2003-09-10 10:21:26

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> I'm saying it can. I don't want to go too deep into an arguement about
> microarchitectural details, but my point was that a memory copy of a page
> is NOT super expensive relative to several other effects that have to do
> with pagetable manipulations.

Sorry, but I cannot believe it.
Reading a page tagle entry and storing in into a struct capability is not
comparable at all with the "for" needed to move bytes all around memory.

> but the pipe code cannot know this so it has to do a cross cpu invalidate.

Sorry for you. Don't knowning does not justify it.
It's inefficient.

> and... which is also releasing it before the copy

Oh, yes. After wasting thousands of cycles, sure.

Bye,
Luca.

2003-09-10 10:32:18

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> Don't bother about people nagging about bits and pieces because they are
jealous.
> Stand up for the basic idea since the rest is just details that always can
be fixed.

The incredible thing to say is that are what? 10, 20 years, perhaps even
more,
that in my University in Pisa we are talking about such communication
primitives.

It's not so new!
We have only effectively implemented it.

Bye,
Luca.

2003-09-10 10:38:06

by Jamie Lokier

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Arjan van de Ven wrote:
> > I have just done a measurement on a 366MHz PII Celeron
>
> This test is sort of the worst case against my argument:
> 1) It's a cpu with low memory bandwidth
> 2) It's a 1 CPU system
> 3) It's a pII not pIV; the pII is way more efficient cycle wise
> for pagetable operations

I thought that later generation CPUs were supposed to have lower
memory bandwidth relative to the CPU core, so CPU operations are
better than copying. Hence all the fancy special memory instructions,
to work around that.

Not that it matters. I think the 366 Celeron is typical of a lot of
computers being used today. I still use it every day, after all.

-- Jaie

2003-09-10 10:50:19

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> Memory is sort of starting to be like disk IO in this regard.

Good. So the less you copy memory all around, the better you permorm.

2003-09-10 10:41:59

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, Sep 10, 2003 at 11:37:52AM +0100, Jamie Lokier wrote:
> I thought that later generation CPUs were supposed to have lower
> memory bandwidth relative to the CPU core

this is far more true for "random/access" latency than for streaming bandwidth.
Memory is sort of starting to be like disk IO in this regard.

2003-09-10 10:54:27

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, Sep 10, 2003 at 12:54:32PM +0200, Luca Veraldi wrote:
> > Memory is sort of starting to be like disk IO in this regard.
>
> Good. So the less you copy memory all around, the better you permorm.

Actually it's "the less discontiguos memory you touch the better you
perform". Just like a 128Kb read is about the same cost as a 4Kb
disk read, the same is starting to become true for memory copies. So
in the extreme the memory copy is one operation only.

2003-09-10 11:17:31

by Nick Piggin

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux



Luca Veraldi wrote:

>>Memory is sort of starting to be like disk IO in this regard.
>>
>
>Good. So the less you copy memory all around, the better you permorm.
>

Hi Luca,
There was a zero-copy pipe implementation floating around a while ago
I think. Did you have a look at that? IIRC it had advantages and
disadvantages over regular pipes in performance.

Nick


2003-09-10 11:26:09

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> Hi Luca,
> There was a zero-copy pipe implementation floating around a while ago
> I think. Did you have a look at that? IIRC it had advantages and
> disadvantages over regular pipes in performance.

Sorry, but i subscripted this mailing-list only one day ago.
Advantages and disadvantages depends on what actually you implement
and on how you do that.

I can only say that capabilities are a disadvantage only with very very
short messages
(that is, a few bytes). And this disadvantage is theoretically demonstrable.

But, let's say also that such elementary messages are meaningful only in the
kernel
and for kernel purposes.

User processes are another tail.

Bye,
Luca

2003-09-10 11:44:38

by Nick Piggin

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux



Luca Veraldi wrote:

>>Hi Luca,
>>There was a zero-copy pipe implementation floating around a while ago
>>I think. Did you have a look at that? IIRC it had advantages and
>>disadvantages over regular pipes in performance.
>>
>
>Sorry, but i subscripted this mailing-list only one day ago.
>Advantages and disadvantages depends on what actually you implement
>and on how you do that.
>

Yes you're right. Just search for it. It would be interesting to compare
them with your implementation.

>
>I can only say that capabilities are a disadvantage only with very very
>short messages
>(that is, a few bytes). And this disadvantage is theoretically demonstrable.
>
>But, let's say also that such elementary messages are meaningful only in the
>kernel
>and for kernel purposes.
>
>User processes are another tail.
>

I'm not too sure, I was just pointing you to zero copy pipes because
they did have some disadvantages which is why they weren't included in
the kernel. Quite possibly your mechanism doesn't suffer from these.

What would really help convince everyone is, of course, "real world"
benchmarks. I like your ideas though ;)



2003-09-10 11:53:07

by Alex Riesen

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Luca Veraldi, Wed, Sep 10, 2003 11:18:48 +0200:
> > Will it be possible to base existing facilities on your approach?
> > SVR5 messages (msg{get,snd,rcv}), for example?
>
> Ah, ok. So let's continue to do ineffient things
> only because it has always been so!

It is because the interface is perfectly enough. You still can
implement zero-copy local transfers for pipes for read and write
calls.
And for small amounts, where it is impossible to do zero-copy does not
bring noticable advantages (as was already mentioned by Alan).

> Compatibility is not a problem. Simply rewrite the write() and read()
> for pipes in order to make them do the same thing done by zc_send()
> and zc_receive(). Or, if you are not referring to pipes, rewrite the
> support level of you anchient IPC primitives in order to make them do
> the same thing done by zc_send() and zc_receive().

If it is possible, why new user-side interface?

-alex

2003-09-10 12:09:47

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux


> > Compatibility is not a problem. Simply rewrite the write() and read()
> > for pipes in order to make them do the same thing done by zc_send()
> > and zc_receive(). Or, if you are not referring to pipes, rewrite the
> > support level of you anchient IPC primitives in order to make them do
> > the same thing done by zc_send() and zc_receive().
>
> If it is possible, why new user-side interface?

Because my programming model is clear and easy.
Linux one's is far from being so, in my opinion.

Bye,
Luca

2003-09-10 12:11:52

by Alex Riesen

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Luca Veraldi, Wed, Sep 10, 2003 14:14:00 +0200:
> > > Compatibility is not a problem. Simply rewrite the write() and read()
> > > for pipes in order to make them do the same thing done by zc_send()
> > > and zc_receive(). Or, if you are not referring to pipes, rewrite the
> > > support level of you anchient IPC primitives in order to make them do
> > > the same thing done by zc_send() and zc_receive().
> >
> > If it is possible, why new user-side interface?
>
> Because my programming model is clear and easy.

does anyone, besides you, says so?

> Linux one's is far from being so, in my opinion.

it is widely accepted one.

2003-09-10 12:12:42

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> I've read your posting on the lkml and also the answers
> concerning IPC mechanisms on Linux.
> You speak English very well - why don't you translate your
> page into english, I think many people would be very interested
> in it... at least I am ;) Unfortunately not many kernel hackers
> are able to understand Italian, I think...

Page is now in English since last night (Italian time).
Please, refresh your browser.

http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
for English users and

http://web.tiscali.it/lucavera/www/root/ecbm/indexIT.htm
for Italian ones.

Bye,
Luca.

2003-09-10 12:10:46

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> Yes you're right. Just search for it. It would be interesting to compare
> them with your implementation.
> I'm not too sure, I was just pointing you to zero copy pipes because
> they did have some disadvantages which is why they weren't included in
> the kernel. Quite possibly your mechanism doesn't suffer from these.

Ok. I'll search for them.

> What would really help convince everyone is, of course, "real world"
> benchmarks. I like your ideas though ;)

Thanks.

Luca

2003-09-10 12:25:26

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> it is widely accepted one.

Widely accepted does not necessary implies clear.

Sorry if I consider more readable this:
ch=acquire(CHANNEL_ID);

instead of this:
while((fd = open("fifo", O_WRONLY)) < 0) ;

But this is quite an aestethical fact.

If you like "open" and "write", open Emacs and do a global replace
"acquire" with "open"
and
"zc_send" with "write".

Few other minor details will make it a compatible interface.

------------------------------
Fashion is a variable
but style is a constant
------------------------------ Larry Wall

Bye,
Luca

2003-09-10 12:32:06

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> but it does imply "tested". And widely used. And well-known.

And inefficient, too.
Sorry, but I use theory not popular convention to judge.

Bye,
Luca

2003-09-10 12:28:17

by Alex Riesen

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Luca Veraldi, Wed, Sep 10, 2003 14:29:37 +0200:
> > it is widely accepted one.
>
> Widely accepted does not necessary implies clear.
>

but it does imply "tested". And widely used. And well-known.

2003-09-10 12:36:55

by Alex Riesen

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Luca Veraldi, Wed, Sep 10, 2003 14:36:19 +0200:
> > but it does imply "tested". And widely used. And well-known.
>
> And inefficient, too.

yes, probably. Until proven. What exactly wrong with api?

> Sorry, but I use theory not popular convention to judge.

I personally do not see deficiencies of the mentioned apis.

2003-09-10 12:41:20

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> What occured to me:
> Perhaps you're interested in the L4KA microkernel project
> at my University, Karlsruhe, Germany.
> The URL is http://l4ka.org/

Sure I know. Great work.
But it is a completly new Operating Systems,
and it is also a microkernel one.

Here, they seem not to understand what permance is,
figure out what will happen if I start talking about microkernel.

However, that is the future, not old and dirty monolithical ones.
And ***THEORY*** supports.

Bye,
Luca

2003-09-10 12:43:35

by Alan

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Mer, 2003-09-10 at 12:16, Nick Piggin wrote:
> There was a zero-copy pipe implementation floating around a while ago
> I think. Did you have a look at that? IIRC it had advantages and
> disadvantages over regular pipes in performance.

FreeBSD had one, it was lightning fast until you used the data the other
end of the pipe. Im not sure what happened to it in the long term.


2003-09-10 12:58:13

by Alan

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Mer, 2003-09-10 at 10:04, Luca Veraldi wrote:
> Surely, transferring bytes from a cache line to another is at zero cost
> (that is, a single clock cycle of the cache unit).

More complicated - when you move stuff you end up evicting another chunk
from the cache, you can't armwave that out of existance because you
yourself have created new eviction overhead for someone further down the
line.

> But here, how can you matematically grant that,
> using traditional IPC mechanism,
> you'll find the info you want directly in cache?

Because programs normally pass data they've touched to other apps.
Your A/B example touches the data before sending even, but your
example on receive does not which is very atypical.

> > Ok - from you rexample I couldnt tell if B then touches each byte of
> > data as well as having it appear in its address space.
>
> It is irrilevant. The time reported in the work are the time needed
> to complete the IPC primitives.
> After a send or a receive over the channel,
> the two processes may do everything they want.
> It is not interesting for us.

Its vital to include it in the measurement of all three. Without it
your measurements are meaningless. Look at it as an engineer. IPC
involves writing to memory, making the data appear in the other task and
then reading it. You also want to measure things like throughput of your
IPC tasks running parallel to other processes doing stuff like pure
memory bandwidth tests so you can measure the impact you have on the
system although in this case thats probably less relevant by far

2003-09-10 12:49:09

by Alan

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Mer, 2003-09-10 at 10:40, Luca Veraldi wrote:
> To set the accessed or dirty bit you use
>
> 38 __asm__ __volatile__( LOCK_PREFIX
> 39 "btsl %1,%0"
> 40 :"=m" (ADDR)
> 41 :"Ir" (nr));
>
> which is a ***SINGLE CLOCK CYCLE*** of cpu.

Its a _lot_ more than that in the normal case. Upwards of 60 clocks on
a PIV. You then need to reload cr3 if you touched permissions which
means every cached TLB in the system is lost, and you may need to do
a cross CPU IPI on SMP (which takes a long time)

> You say "tlb's and internal cpu state will need to be flushed".
> The other cpus in an SMP environment can continue to work, indipendently.
> TLBs and cpu state registers are ***PER-CPU*** resorces.

Think of a threaded app passing a message to another app. You have to do
the cross CPU flush in order to prevent races where another thread can
scribble on data it doesnt own. Assuming a reasonable TLB reuse rate
thats 120 plus TLB reloads. The newer CPU's cache TLB's in L1/L2 so
thats not too bad but it all adds up. On SMP its a real pain

> Probably, it is worse the case of copying a memory page,
> because you have to hold some global lock all the time.
> This is deadly in an SMP environment,

You don't need a global lock to copy memory.

One thing I do agree with you on is the API aspect - and that is
problematic. The current API leaves data also owned by the source.
If I write "fred" down a pipe I still own the "fred" bits. The
method you propose was added long ago go to Solaris (see "doors") and
its not exactly the most used interface even there.


2003-09-10 12:52:10

by Alan

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Mer, 2003-09-10 at 11:09, Luca Veraldi wrote:
> SMP is crying...
> But not so much, if the lock is not lasting much, is it right?
>
> THIS IS NOT WHAT WE CALL A SHORT LOCK SECTION:

This is the difference between science and engineering. Its a path that
isnt normally taken and normally if taken takes almost no loops.


2003-09-10 13:33:09

by Gábor Lénárt

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, Sep 10, 2003 at 02:36:19PM +0200, Luca Veraldi wrote:
> > but it does imply "tested". And widely used. And well-known.
>
> And inefficient, too.
> Sorry, but I use theory not popular convention to judge.

But theory is often far from practice ...

Theory is something which uses some abstraction to "model" reality. Which
often means oversimplification of the problem or simply the fact that a real
case has got much more (probably hidden) parameters affecting the problem.
Maybe these parameters are very hard to even detect at the FIRST sight.
That's why most software development starting from theory _BUT_ some "REAL
WORLD" benchmark/test should confirm the truth of that theory. Or even think
about relativity theory by Einstein: the theory is good, but it's even
better to prove it in the real life too: perturbation in the orbit of
Mercury, or affect to the visible position of stars at an eclipse. That was
the point when Einstein's theory bacome to be widely used by science.

So I thing it's no use to start a flame thread here. Implement that IPC
solution, and test it. If it becomes a good one, everybody will be happy
with your work. If not, at least you've tried, and maybe you will also
learn from it. Since Linux is open source, you have nothing to lose with
your implementation.

- G?bor (larta'H)

2003-09-10 13:52:00

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> > Probably, it is worse the case of copying a memory page,
> > because you have to hold some global lock all the time.
> > This is deadly in an SMP environment,
>
> You don't need a global lock to copy memory.

See sources. You neek to lock_kernel() for a great amount of instructions
like any nother kernel function.

There are kernel locks all around in sources.
So don't come here and talk about locking ineffiency.
Because scarse scalability of Linux on multiprocessor
is a reality nobody can hide.

With or without my primitives.

> One thing I do agree with you on is the API aspect - and that is
> problematic. The current API leaves data also owned by the source.
> If I write "fred" down a pipe I still own the "fred" bits.

Even with my API. Data doesn't disapper magically
from the sender logical address space.

Also, with minor changes, you can use my primitives
to implement a one-copy channel, if shared memory among processes
sounds to you as a blasfemy.

Always better than two-fase-copying current pipe implementation.

> The
> method you propose was added long ago go to Solaris (see "doors") and
> its not exactly the most used interface even there.

Sincerely, I COMPLETELY DON'T CARE.
If it is used or not, it's completely irrilevant, when valueing performance.

Good bye.
Luca

2003-09-10 13:59:53

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> So I thing it's no use to start a flame thread here. Implement that IPC
> solution, and test it. If it becomes a good one, everybody will be happy
> with your work. If not, at least you've tried, and maybe you will also
> learn from it. Since Linux is open source, you have nothing to lose with
> your implementation.

It is exactly what I've done. But, did you read the page?
I've implemented the IPC mechanism... And tested, too.
The numbers reported in tables weren't the fuit of my immagination, dear.

Bye,
Luca

2003-09-10 14:26:55

by Stewart Smith

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

hrm, you may find things related to the Password-Capability system and
the Walnut kernel of interest - these systems take this kind of IPC to
the extreme :) (ahhh... research OS hw & sw - except you *do not* want
to see the walnut source - it makes ppl want to crawl up and cry).

http://www.csse.monash.edu.au/~rdp/fetch/castro-thesis.ps

and check the Readme.txt at
http://www.csse.monash.edu.au/courseware/cse4333/rdp-ma terial/
for stuff on Multi and password-capabilities.

interesting stuff, the Castro thesis does do some comparisons to FreeBSD
(1.1 amazingly enough) - although the number of real world applications
on these systems is minimal (and in the current state impossible -
nobody can remember how to get userspace going on Walnut, we may have
broken it) and so real-world comparisons just don't really happen these
days. Maybe after a rewrite (removing some brain-damage of the original
design).

This is all related to my honors work,
http://www.flamingspork.com/honors/
although my site needs an update.
I'm working on the design and simulation of an improved storage system.


On Wed, 2003-09-10 at 03:30, Luca Veraldi wrote:
> Hi all.
> At the web page
> http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
> You can find the results of my attempt in modifing the linux kernel sources
> to implement a new Inter Process Communication mechanism.
>
> It is called ECBM for Efficient Capability-Based Messaging.
>
> In the reading You can also find the comparison of ECBM
> against some other commonly-used Linux IPC primitives
> (such as read/write on pipes or SYS V tools).
>
> The results are quite clear.
>
> Enjoy.
> Luca Veraldi
>
>
> ----------------------------------------
> Luca Veraldi
>
> Graduate Student of Computer Science
> at the University of Pisa
>
> [email protected]
> [email protected]
> ICQ# 115368178
> ----------------------------------------
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2003-09-10 14:36:00

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> hrm, you may find things related to the Password-Capability system and
> the Walnut kernel of interest - these systems take this kind of IPC to
> the extreme :) (ahhh... research OS hw & sw - except you *do not* want
> to see the walnut source - it makes ppl want to crawl up and cry).
>
> http://www.csse.monash.edu.au/~rdp/fetch/castro-thesis.ps
>
> and check the Readme.txt at
> http://www.csse.monash.edu.au/courseware/cse4333/rdp-ma terial/
> for stuff on Multi and password-capabilities.
>
> interesting stuff, the Castro thesis does do some comparisons to FreeBSD
> (1.1 amazingly enough) - although the number of real world applications
> on these systems is minimal (and in the current state impossible -
> nobody can remember how to get userspace going on Walnut, we may have
> broken it) and so real-world comparisons just don't really happen these
> days. Maybe after a rewrite (removing some brain-damage of the original
> design).

Thanks. It's really very interesting...

Bye,
Luca

2003-09-10 14:53:27

by Larry McVoy

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, Sep 10, 2003 at 02:16:40PM +0200, Luca Veraldi wrote:
> > I've read your posting on the lkml and also the answers
> > concerning IPC mechanisms on Linux.
> > You speak English very well - why don't you translate your
> > page into english, I think many people would be very interested
> > in it... at least I am ;) Unfortunately not many kernel hackers
> > are able to understand Italian, I think...
>
> Page is now in English since last night (Italian time).
> Please, refresh your browser.
>
> http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
> for English users and

I read it and I must be missing something, which is possible, I need more
coffee.

I question the measurement methodology. Why didn't you grab the sources
to LMbench and use them to measure this? It may well be that you disagree
with how it measures things, which may be fine, but I think you'd benefit
from understanding it and thinking about it. It's also trivial to add
another test to the system, you can do it in a few lines of code.

I also question the results. I modified lat_pipe.c from LMbench to measure
a range of sizes, code is included below. The results don't match yours at
all. This is on a 466Mhz Celeron running RedHat 7.1, Linux 2.4.2. The
time reported is the time to send and receive the data between two processes,
i.e.,

Process A Process B
write
<context switch>
read
write
<context switch>
read

In other words, the time printed is for a round trip. Your numbers
appear to be off by a factor of two, they look pretty similar to mine
but as far as I can tell you are saying that it costs 4 usecs for a send
and 4 for a recv and that's not true, the real numbers are 2 sends and
2 receives and 2 context switches.

1 bytes: Pipe latency: 8.0272 microseconds
8 bytes: Pipe latency: 7.8736 microseconds
64 bytes: Pipe latency: 8.0279 microseconds
512 bytes: Pipe latency: 10.0920 microseconds
4096 bytes: Pipe latency: 19.6434 microseconds
40960 bytes: Pipe latency: 313.3328 microseconds
81920 bytes: Pipe latency: 1267.7174 microseconds
163840 bytes: Pipe latency: 3052.1020 microseconds

I want to stick some other numbers in here from LMbench, the signal
handler cost and the select cost. On this machine it is about 5 usecs
to handle the signal and about 4 for select on 10 file descriptors.

If I were faced with the problem of moving data between processes at very
low cost the path I choose would depend on whether it was a lot of data
or just an event notification. It would also depend on whether the
receiving process is doing anything else. Let's walk a couple of those
paths:

If all I want to do is let another process know that something has happened
then a signal is darn close to as cheap as I can get. That's what SIGUSR1
and SIGUSR2 are for.

If I wanted to move large quantities of data I'd combine signals with
mmap and mutexes. This is easier than it sounds. Map a scratch file,
truncate it up to the size you need, start writing into it and when you
have enough signal the other side. It's a lot like the I/O loop in
many device drivers.

So I guess I'm not seeing why there needs to be a new interface here.
This looks to me like you combine cheap messaging (signals, select,
or even pipes) with shared data (mmap). I don't know why you didn't
measure that, it's the obvious thing to measure and you are going to be
running at memory speeds.

The only justification I can see for a different mechanism is if the
signaling really hurt but it doesn't. What am I missing?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-10 15:24:40

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> Can you forward-port your work to 2.6 kernel?
> Can you benchmarkt it against the same primitives in 2.6 kernel?

I could do it only if there was an effective reason to do it.
And there isn't.

They have just posted me a message with pipe latency under kernel 2.4.
And it is exactly the same (apart some minor variations in misurements
that are natural enough).

> You have just started your work - are you going to finish it? Or it
> was just-another-academical-study?

I consider my work finished.
I studied an efficient IPC mechanism and I tried to implement it
on an existing Operating System.

I tried some other IPC primitives in benchmark tests
and reported the comparisons between completion times.

I got my goal.

> If you can try to develop new sematics for old syntax (shm* & etc) it
> can be welcomed too.
> And if it would be poll()able - it would be great. Applications which
> do block on read()/getmsg() in real-life not that common, and as I've
> understood - this is the case for your message passing structure.

I'm not a kernel developer. Ask Linus Torvalds.

Bye,
Luca

2003-09-10 16:00:40

by Alan

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Mer, 2003-09-10 at 14:56, Luca Veraldi wrote:
> There are kernel locks all around in sources.
> So don't come here and talk about locking ineffiency.
> Because scarse scalability of Linux on multiprocessor
> is a reality nobody can hide.

Well if you will read the 2.2 code sure. If you want to argue
scalability I suggest you look at current code and the SGI Altix
boxes

2003-09-10 16:58:15

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Ciao Luca,

On Tue, Sep 09, 2003 at 07:30:58PM +0200, Luca Veraldi wrote:
> Hi all.
> At the web page
> http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
> You can find the results of my attempt in modifing the linux kernel sources
> to implement a new Inter Process Communication mechanism.
>
> It is called ECBM for Efficient Capability-Based Messaging.
>
> In the reading You can also find the comparison of ECBM
> against some other commonly-used Linux IPC primitives
> (such as read/write on pipes or SYS V tools).
>
> The results are quite clear.

in terms of design as far as I can tell the most efficient way to do
message passing is not pass the data through the kernel at all (no
matter if you intend to copy it or not), and to simply use futex on top
of shm to synchronize/wakeup the access. If we want to make an API
widespread, that should be simply an userspace library only.

It's very inefficient to mangle pagetables and flush the tlb in a flood
like you're doing (or better like you should do), when you can keep the
memory mapped in *both* tasks at the same time *always* and there's no
need of any kernel modification at all for that much more efficient
design that I'm suggesting. Obviously lots of apps are already using
this design and there's no userspace API simply because that's not
needed. The only thing we need from the kernel is the wakeup mechanism
and that's already provided by the futex (in the past userspae apps
using this design used sched_yield, and that was very bad).

About the implementation - the locking looks very wrong: you miss the
page_table_lock in all the pte walking, you take a totally worthless
lock_kernel() all over the place for no good reason, and the
unconditional set_bit(PG_locked) clear_bit(PG_locked) on random pieces
of ram almost guarantees that you'll corrupt ram quickly (the PG_locked
is reserved for I/O serialization, the same ram that you're working on
can be sent to disk or to swap by the kernel at the same time and it can
be already locked, you can't clear_bit unless you're sure you're the guy
that owns the lock, and you aren't sure because you didn't test if
you're the owner, so that smeels like an huge bug that will random
corrupt ram, like the pte walking race).

there's also an obvious DoS that is trivial to generate by locking in
ram some 64G of ram with ecbm_create_capability() see the for(count=0;
count<pages; ++count) atomic_inc (btw, you should use get_page, and all
the operations like LockPage to play with pages).

I also don't see where you flush the tlb after the set_pte, and where
you release the ram pointed by the pte (it seems you're leaking plenty
of memory that way).

I didn't check at all the credential checks (I didn't run into it while
reading the code, but I assume I overlooked it). (do you rely on a
random number? that's probably statistically secure, but we can
guarantee security on a local box, we must not work by luck whenever
possible)

this was a very quick review, hope this helps,

Andrea

/*
* If you refuse to depend on closed software for a critical
* part of your business, these links may be useful:
*
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
* http://www.cobite.com/cvsps/
*
* svn://svn.kernel.org/linux-2.6/trunk
* svn://svn.kernel.org/linux-2.4/trunk
*/

2003-09-10 17:04:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, Sep 10, 2003 at 06:59:44PM +0200, Andrea Arcangeli wrote:
> About the implementation - the locking looks very wrong: you miss the
> page_table_lock in all the pte walking, you take a totally worthless
> lock_kernel() all over the place for no good reason, and the

sorry for the self followup, but I just read another message where you
mention 2.2, if that was for 2.2 the locking was the ok.

all other problems remains though.

Andrea

/*
* If you refuse to depend on closed software for a critical
* part of your business, these links may be useful:
*
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
* http://www.cobite.com/cvsps/
*
* svn://svn.kernel.org/linux-2.6/trunk
* svn://svn.kernel.org/linux-2.4/trunk
*/

2003-09-10 17:16:56

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> sorry for the self followup, but I just read another message where you
> mention 2.2, if that was for 2.2 the locking was the ok.

Oh, good. I was already going to put my head under sand. :-)

I'm not so expert about the kernel. I've just read [ORLL]
and some bits of the kernel sources.
So, error in the codes are not so strange.

But, it's better now that you know we are talking about version 2.2...

I'm glad to hear that locking are ok.

You say:

> in terms of design as far as I can tell the most efficient way to do
> message passing is not pass the data through the kernel at all (no
> matter if you intend to copy it or not), and to simply use futex on top
> of shm to synchronize/wakeup the access. If we want to make an API
> widespread, that should be simply an userspace library only.
>
> It's very inefficient to mangle pagetables and flush the tlb in a flood
> like you're doing (or better like you should do), when you can keep the

I guess futex are some kind of semaphore flavour under linux 2.4/2.6.
However, you need to use SYS V shared memory in any case.
Tests for SYS V shared memory are included in the web page
(even though using SYS V semaphores).

I don't think, reading the numbers, that managing pagetables "is very
inefficient".
I think very inefficient are SYS V semaphore orethe double-copying channel
you call a pipe.

> there's also an obvious DoS that is trivial to generate by locking in
> ram some 64G of ram with ecbm_create_capability() see the for(count=0;
> count<pages; ++count) atomic_inc (btw, you should use get_page, and all
> the operations like LockPage to play with pages).

As I say in the web page,

having all the pages locked in memory is not a necessary condition
for the applicability of communication mechanisms based on capabilities.
Simply, it make it easier to write the code and does not make me crazy
with the Linux swapping system.

Bye bye,
Luca


2003-09-10 17:40:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, Sep 10, 2003 at 07:21:09PM +0200, Luca Veraldi wrote:
> > sorry for the self followup, but I just read another message where you
> > mention 2.2, if that was for 2.2 the locking was the ok.
>
> Oh, good. I was already going to put my head under sand. :-)

;)

>
> I'm not so expert about the kernel. I've just read [ORLL]
> and some bits of the kernel sources.
> So, error in the codes are not so strange.

Ok no problem then.

> But, it's better now that you know we are talking about version 2.2...
>
> I'm glad to hear that locking are ok.
>
> You say:
>
> > in terms of design as far as I can tell the most efficient way to do
> > message passing is not pass the data through the kernel at all (no
> > matter if you intend to copy it or not), and to simply use futex on top
> > of shm to synchronize/wakeup the access. If we want to make an API
> > widespread, that should be simply an userspace library only.
> >
> > It's very inefficient to mangle pagetables and flush the tlb in a flood
> > like you're doing (or better like you should do), when you can keep the
>
> I guess futex are some kind of semaphore flavour under linux 2.4/2.6.
> However, you need to use SYS V shared memory in any case.

I need shared memory yes, but I can generate it with /dev/shm or clone(),
it doesn't really make any difference. I would never use the ugly SYSV
IPC API personally ;).

> Tests for SYS V shared memory are included in the web page
> (even though using SYS V semaphores).

IPC semaphores are a totally different thing.

Basically the whole ipc/ directory is an inefficient obsolete piece of
code that nobody should use.

The real thing is (/dev/shm || clone()) + futex.

> I don't think, reading the numbers, that managing pagetables "is very
> inefficient".

Yep, it can be certainly more efficient than IPC semaphores or whatever
IPC thing in ipc.

However the speed of the _shm_ API doesn't matter, you map the shm only
once and you serialize the messaging with the futex. So even if it takes
a bit longer to setup the shm with IPC shm, it won't matter, even using
IPC shm + futex would be fine (despite I find /dev/shm much more friendy
as an API).

> I think very inefficient are SYS V semaphore orethe double-copying channel
> you call a pipe.

you're right in term of bandwidth if the only thing the reader and the
writer do is to make the data transfer.

However if the app is computing stuff besides listening to the pipe (for
example if it's in noblocking) the double copying allows the writer, to
return way before the reader started reading the info. So the
intermediate ram increases a bit the parallelism. One could play tricks
with cows as well though but it would be way more complex than the
current double copy ;).

(as somebody pointed out) you may want to compare the pipe code with the
new zerocopy-pipe one (the one that IMHO has a chance to decreases
parallelism)

> having all the pages locked in memory is not a necessary condition
> for the applicability of communication mechanisms based on capabilities.
> Simply, it make it easier to write the code and does not make me crazy
> with the Linux swapping system.

ok ;)

Overall my point is that the best is to keep the ram mapped in both
tasks at the same time and to use the kernel only for synchronization
(i.e. the wakeup/schedule calls in your code and to remove the pinning/
pte mangling enterely from the design) and the futex provides a more
efficient synchronization since it doesn't even enter kernel space if
the writer completes before the reader start and the reader completes
before the writer starts again (so it has a chance to be better than any
other kernel base scheme that is always forced to enter kernel even in
the very best case).

Andrea

/*
* If you refuse to depend on closed software for a critical
* part of your business, these links may be useful:
*
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
* http://www.cobite.com/cvsps/
*
* svn://svn.kernel.org/linux-2.6/trunk
* svn://svn.kernel.org/linux-2.4/trunk
*/

2003-09-10 17:45:46

by Martin Konold

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Am Wednesday 10 September 2003 06:59 pm schrieb Andrea Arcangeli:

Hi,

> design that I'm suggesting. Obviously lots of apps are already using
> this design and there's no userspace API simply because that's not
> needed.

HPC people have investigated High performance IPC many times basically it
boils down to:

1. Userspace is much more efficient than kernel space. So efficient
implementions avoid kernel space even for message transfers over networks
(e.g. DMA directly to userspace).

2. The optimal protocol to use and the number of copies to do is depending on
the message size.

Small messages are most efficiently handled with a single/dual copy (short
protocol / eager protocol) and large messages are more efficient with
single/zero copy techniques (get protocol) depending if a network is involved
or not.

Traditionally in a networked environment single copy means PIO and zero copy
means DMA when using network hardware.

The idea is while DMA has much higher bandwidth than PIO DMA is more expensive
to initiate than PIO. So DMA is only useful for large messages.

In the local SMP case there do exist userspace APIs like MPI which can do
single copy Message passing at memory speed in pure userspace since many
years.

The following PDF documents a measurement where the communication between two
processes running on different CPUs in an SMP system is exactly the memcpy
bandwidth for large messages using a single copy get protocol.

http://ipdps.eece.unm.edu/1999/pc-now/takahash.pdf

Numbers from a Dual P-II-333, Intel 440LX (100MB/s memcpy)

eager get
min. Latency ?s 8.62 9.98
max Bandwidth MB/s 48.03 100.02
half bandwith point KB 0.3 2.5

You can easily see that the eager has better latency for very short messages
and that the get has a max bandwidth beeing equivalent of a memcpy (single
copy).

True zero copy has unlimited (sigh!) bandwidth within an SMP and does not
really make sense in contrast to a network.

Regards,
-- martin

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: [email protected]

2003-09-10 17:59:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, Sep 10, 2003 at 07:39:17PM +0200, Martin Konold wrote:
> Am Wednesday 10 September 2003 06:59 pm schrieb Andrea Arcangeli:
>
> Hi,
>
> > design that I'm suggesting. Obviously lots of apps are already using
> > this design and there's no userspace API simply because that's not
> > needed.
>
> HPC people have investigated High performance IPC many times basically it
> boils down to:
>
> 1. Userspace is much more efficient than kernel space. So efficient
> implementions avoid kernel space even for message transfers over networks
> (e.g. DMA directly to userspace).
>
> 2. The optimal protocol to use and the number of copies to do is depending on
> the message size.
>
> Small messages are most efficiently handled with a single/dual copy (short
> protocol / eager protocol) and large messages are more efficient with
> single/zero copy techniques (get protocol) depending if a network is involved
> or not.
>
> Traditionally in a networked environment single copy means PIO and zero copy
> means DMA when using network hardware.
>
> The idea is while DMA has much higher bandwidth than PIO DMA is more expensive
> to initiate than PIO. So DMA is only useful for large messages.

agreed.

>
> In the local SMP case there do exist userspace APIs like MPI which can do

btw, so far we were only discussing IPC in a local box (UP or SMP or
NUMA) w/o networking involved. Luca's currnet implementation as well was
only working locally.

> single copy Message passing at memory speed in pure userspace since many
> years.
>
> The following PDF documents a measurement where the communication between two
> processes running on different CPUs in an SMP system is exactly the memcpy
> bandwidth for large messages using a single copy get protocol.
>
> http://ipdps.eece.unm.edu/1999/pc-now/takahash.pdf
>
> Numbers from a Dual P-II-333, Intel 440LX (100MB/s memcpy)
>
> eager get
> min. Latency ?s 8.62 9.98
> max Bandwidth MB/s 48.03 100.02
> half bandwith point KB 0.3 2.5
>
> You can easily see that the eager has better latency for very short messages
> and that the get has a max bandwidth beeing equivalent of a memcpy (single
> copy).
>
> True zero copy has unlimited (sigh!) bandwidth within an SMP and does not
> really make sense in contrast to a network.

if you can avoid to enter kernel, you'd better do that, because entering
kernel will take much more time than the copy itself.

with the shm/futex approch you can also have a ring buffer to handle
parallelism better while it's at the same time zerocopy and enterely
userspace based in the best case (thought that's not the common case).

thanks,

Andrea

/*
* If you refuse to depend on closed software for a critical
* part of your business, these links may be useful:
*
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
* http://www.cobite.com/cvsps/
*
* svn://svn.kernel.org/linux-2.6/trunk
* svn://svn.kernel.org/linux-2.4/trunk
*/

2003-09-10 18:12:49

by Martin Konold

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Am Wednesday 10 September 2003 08:01 pm schrieb Andrea Arcangeli:

Hi Andreas,

> > The idea is while DMA has much higher bandwidth than PIO DMA is more
> > expensive to initiate than PIO. So DMA is only useful for large messages.
>
> agreed.
>
> > In the local SMP case there do exist userspace APIs like MPI which can do
>
> btw, so far we were only discussing IPC in a local box (UP or SMP or
> NUMA) w/o networking involved. Luca's currnet implementation as well was
> only working locally.

Yes, and I claim that the best you can get for large messages is a plain
single copy userspace implementation as already implemented by some people
using the MPI API.

> > True zero copy has unlimited (sigh!) bandwidth within an SMP and does not
> > really make sense in contrast to a network.
>
> if you can avoid to enter kernel, you'd better do that, because entering
> kernel will take much more time than the copy itself.

Yes, doing HPC the kernel may only be used to intialize the intial
communication channels (e.g. handling permissions etc.). The kernel must be
avoided for the actual communication by any means.

> with the shm/futex approch you can also have a ring buffer to handle
> parallelism better while it's at the same time zerocopy

How fast will you get? I think you will get the bandwidth of a memcpy for
large chunks?!

This is imho not really zerocopy. The data has to travel over the memory bus
involving the CPU so I would call this single copy ;-)

> and enterely
> userspace based in the best case (thought that's not the common case).

Yes a userspace library using the standard MPI API is the proven best approach
and freely downloadable from

http://www.pccluster.org/score/dist/pub/score-5.4.0/source/score-5.4.0.mpi.tar.gz

Regards,
-- martin

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: [email protected]

2003-09-10 18:08:29

by Larry McVoy

[permalink] [raw]
Subject: Inappropriate signatures

Dear Andrea,

We're providing the service which enables all of the below and without
our good will that service is at risk. You're publicly slamming the
providers of that service.

If you want to do that, that's your right, but that leads us to ask:

- wasn't the deal that we do the gateway and you stop whining?
- why should we provide this gateway service at all if there is whining?
- why shouldn't we firewall you off since you are whining?
- or if firewalling fails, put in a day or two delay in the gateway?

> /*
> * If you refuse to depend on closed software for a critical
> * part of your business, these links may be useful:
> *
> * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
> * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
> * http://www.cobite.com/cvsps/
> *
> * svn://svn.kernel.org/linux-2.6/trunk
> * svn://svn.kernel.org/linux-2.4/trunk
> */
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-10 18:30:56

by John Bradford

[permalink] [raw]
Subject: Re: Inappropriate signatures

> Dear Andrea,
>
> We're providing the service which enables all of the below and without
> our good will that service is at risk. You're publicly slamming the
> providers of that service.
>
> If you want to do that, that's your right, but that leads us to ask:
>
> - wasn't the deal that we do the gateway and you stop whining?
> - why should we provide this gateway service at all if there is whining?
> - why shouldn't we firewall you off since you are whining?
> - or if firewalling fails, put in a day or two delay in the gateway?
>
> > /*
> > * If you refuse to depend on closed software for a critical
> > * part of your business, these links may be useful:
> > *
> > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
> > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
> > * http://www.cobite.com/cvsps/
> > *
> > * svn://svn.kernel.org/linux-2.6/trunk
> > * svn://svn.kernel.org/linux-2.4/trunk
> > */

Maybe you are mis-interpreting the signature - the information is
'commented out', in the same way that obsolete code is often commented
out. It could be argued that Andrea used to think that way, but no
longer does, and is trying to spread the message that criticising your
product is 'so last year', and no longer trendy. We need
clarification on this.

John.

2003-09-10 18:32:02

by Chris Friesen

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Martin Konold wrote:
> Am Wednesday 10 September 2003 08:01 pm schrieb Andrea Arcangeli:

>>with the shm/futex approch you can also have a ring buffer to handle
>>parallelism better while it's at the same time zerocopy
>>
>
> How fast will you get? I think you will get the bandwidth of a memcpy for
> large chunks?!
>
> This is imho not really zerocopy. The data has to travel over the memory bus
> involving the CPU so I would call this single copy ;-)

Even with zerocopy, you have to build the message somehow. If process A
builds a message and passes it to process B, then then isn't that as
efficient as you can get with message passing? How does MPI do any better?

However, if both processes directly map the same memory and use it
directly (without "messages" as such), then I would see *that* as zerocopy.

Chris

--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2003-09-10 18:42:32

by Manfred Spraul

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

>
>
>Hi Luca,
>There was a zero-copy pipe implementation floating around a while ago
>I think. Did you have a look at that? IIRC it had advantages and
>disadvantages over regular pipes in performance.
>
It has doesn't have any performance disadvantages over regular pipes.
The problems is that it only helps for lmbench. Real world apps use
buffered io, which uses PAGE_SIZEd buffers, which mandate certain
atomicity requirements, which limits the gains that can be achieved.
If someone wants to try it again I can search for my patches.
Actually there are two independant improvements that are possible for
the pipe code:
- zero-copy. The name is misleading. In reality this does a single copy:
the sender creates a list of the pages with the data instead of copying
the data to kernel buffers. The receiver copies from the pages directly
to user space. The main advantage is that this implementation avoids one
copy without requiring any tlb flushes. It works with arbitrary aligned
user space buffers.
- use larger kernel buffers. Linux uses a 4 kB internal buffer. The
result is a very simple implementation that causes an incredible amount
of context switches. Larger buffers reduce that. The problem is
accounting, and avoiding to use too much memory for the pipe buffers. A
patch without accounting should be somewhere. The BSD unices have larger
pipe buffers, and a few pipes with huge buffers [perfect for lmbench
bragging].

Luca: An algorithm that uses page flipping, or requires tlb flushes,
probably performs bad in real-world scenarios. First you have the cost
of the tlb flush, then the inter-processor-interrupt to flush all cpus
on SMP. And if the target app doesn't expect that you use page flipping,
then you get a page fault [with another tlb flush], to break the COW
share. OTHO if you redesign the app for page flipping, then it could
also use sysv shm, and send a short message with a pointer to the shm
segment.

--
Manfred

2003-09-10 18:52:48

by Jamie Lokier

[permalink] [raw]
Subject: Re: Inappropriate signatures

> > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
> > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
> > * http://www.cobite.com/cvsps/

Oh, thanks! I didn't know about those links and have been using the
daily snapshots from http://www.kernel.org's front page.

Those URLs (*and* BitKeeper ones) really should appear on or near the
http://www.kernel.org front page, near all the list of snapshots etc.

-- Jamie

2003-09-10 19:20:17

by Shawn

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, 2003-09-10 at 05:09, Luca Veraldi wrote:
> > For fun do the measurement on a pIV cpu. You'll be surprised.
> > The microcode "mark dirty" (which is NOT a btsl, it gets done when you do
> a write
> > memory access to the page content) result will be in the 2000 to 4000
> range I
> > predict.
>
> I'm not responsible for microarchitecture designer stupidity.
> If a simple STORE assembler instruction will eat up 4000 clock cycles,
> as you say here, well, I think all we Computer Scientists can go home and
> give it up now.
Unfortunately you are responsible for working with the constructs reality
has given you. Giving up is childish. Where there is a lose, there's a
chance it was a tradeoff for a win elsewhere.

2003-09-10 19:29:43

by Pavel Machek

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Hi!

> > > The overhead implied by a memcpy() is the same, in the oder of magnitude,
> > > ***whatever*** kernel version you can develop.
> >
> > yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the
> > uncached case. A pagetable operation (like the cpu setting the accessed
> > or dirty bit) is in that same order I suspect (maybe half this, but not
> > a lot less). Changing pagetable content is even more because all the
> > tlb's and internal cpu state will need to be flushed... which is also a
> > microcode operation for the cpu. And it's deadly in an SMP environment.
>
> I have just done a measurement on a 366MHz PII Celeron. The test is
> quite crude, but these numbers are upper bounds on the timings:
>
> mean 813 cycles to:
>
> page fault
> install zero page
> remove zero page
>
> (= read every page from MAP_ANON region and unmap, repeatedly)
>
> mean 11561 cycles to:
>
> page fault
> copy zero page
> install copy
> remove copy
>
> (= write every page from MAP_ANON region and unmap, repeatedly)
>
> I think that makes a clear case for avoiding page copies.

Can you make it available so we can test on, say, 900MHz athlon? Or
you can have it tested on 1800MHz athlon64, that's about as high end
as it can get.

Pavel

--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-09-10 19:54:53

by Joe Perches

[permalink] [raw]
Subject: rsync head? [was inappropriate signatures]

On Wed, 2003-09-10 at 11:08, Larry McVoy wrote:

> > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/

cvs/bk/svn/bz2? How about just plain text instead?

How about an rsync to the latest uncompressed source tree?

rsync.kernel.org::pub/linux/kernel/head?


2003-09-10 19:41:37

by Jamie Lokier

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Pavel Machek wrote:
> Can you make it available so we can test on, say, 900MHz athlon? Or
> you can have it tested on 1800MHz athlon64, that's about as high end
> as it can get.

I just deleted the program, so here's a rewrite :)

#include <sys/mman.h>

int main()
{
int i, j;
for (j = 0; j < 64; j++) {
volatile char * ptr =
mmap (0, 4096 * 4096, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANON, -1, 0);
for (i = 0; i < 4096; i++) {
#if 1
*(ptr + 4096 * i) = 0; /* Write */
#else
(void) *(ptr + 4096 * i); /* Read */
#endif
}
munmap ((void *) ptr, 4096 * 4096);
}
return 0;
}

Smallest results, from "gcc -o test test.c -O2; time ./test" on a
1500MHz dual Athlon 1800 MP:

Write:
real 0m1.316s
user 0m0.059s
sys 0m1.256s

==> 7531 cycles per page

Read:
real 0m0.199s
user 0m0.053s
sys 0m0.146s

==> 1139 cycles per page

As I said, it's a crude upper bound.

-- Jamie

2003-09-10 20:05:45

by Rik van Riel

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Wed, 10 Sep 2003, Luca Veraldi wrote:

> I'm not responsible for microarchitecture designer stupidity.
> If a simple STORE assembler instruction will eat up 4000 clock cycles,
> as you say here, well,

If current trends continue, a L2 cache miss will be
taking 5000 cycles in 15 to 20 years.

> I think all we Computer Scientists can go home and give it up now.

While I have seen some evidence of computer scientists
going home and ignoring the problems presented to them
by current hardware constraints, I'd really prefer it
if they took up the challenge and did the research on
how we should deal with hardware in the future.

In fact, I've made up a little (incomplete) list of
things that I suspect are in need of serious CS research,
because otherwise both OS theory and practice will be
unable to deal with the hardware of the next decade.

http://surriel.com/research_wanted/

If you have any suggestions for the list, please let
me know.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2003-09-10 21:35:54

by Pavel Machek

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Hi!

> > Can you make it available so we can test on, say, 900MHz athlon? Or
> > you can have it tested on 1800MHz athlon64, that's about as high end
> > as it can get.
>
> I just deleted the program, so here's a rewrite :)
>
> #include <sys/mman.h>
>
> int main()
> {
> int i, j;
> for (j = 0; j < 64; j++) {
> volatile char * ptr =
> mmap (0, 4096 * 4096, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANON, -1, 0);
> for (i = 0; i < 4096; i++) {
> #if 1
> *(ptr + 4096 * i) = 0; /* Write */
> #else
> (void) *(ptr + 4096 * i); /* Read */
> #endif
> }
> munmap ((void *) ptr, 4096 * 4096);
> }
> return 0;
> }
>
> Smallest results, from "gcc -o test test.c -O2; time ./test" on a
> 1500MHz dual Athlon 1800 MP:
>
> Write:
> real 0m1.316s
> user 0m0.059s
> sys 0m1.256s
>
> ==> 7531 cycles per page
>
> Read:
> real 0m0.199s
> user 0m0.053s
> sys 0m0.146s
>
> ==> 1139 cycles per page
>
> As I said, it's a crude upper bound.

But you should also time memory remapping stuff on same cpu, no?

Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-09-10 22:07:03

by Jamie Lokier

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

Pavel Machek wrote:
> But you should also time memory remapping stuff on same cpu, no?

Yes, but can I be bothered ;)
Like I said, it's a crude upper bound.

We know the remapping time will be about the same in both tests,
because all it does is the same vma operations and the same TLB
invalidations.

-- Jamie

2003-09-12 18:19:58

by Timothy Miller

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux



Luca Veraldi wrote:

>
> Sorry, but I cannot believe it.
> Reading a page tagle entry and storing in into a struct capability is not
> comparable at all with the "for" needed to move bytes all around memory.


Pardon my ignorance here, but the impression I get is that changing a
page table entry is not as simple as just writing to a bit somewhere. I
suppose it is if the page descriptor is not loaded into the TLB, but if
it is, then you have to ensure that the TLB entry matches the page
table; this may not be a quick operation.

I can think of a lot of other possible complications to this.

2003-09-12 19:01:22

by Luca Veraldi

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

> Pardon my ignorance here, but the impression I get is that changing a
> page table entry is not as simple as just writing to a bit somewhere. I
> suppose it is if the page descriptor is not loaded into the TLB, but if
> it is, then you have to ensure that the TLB entry matches the page
> table; this may not be a quick operation.
>
> I can think of a lot of other possible complications to this.

I stopped writing about this question a lot of time ago.
However, here we are.

If you modify the page tables as in my example (and then if you do so only
for B's pagetable)
you can be sure the things you're modifying were not present in Firmware
TLBs, not yet.

Because those pagetable entries refer to a logical address interval
you've just allocated in B address space.

Bye,
Luca

2003-09-12 22:39:06

by Alan

[permalink] [raw]
Subject: Re: Efficient IPC mechanism on Linux

On Gwe, 2003-09-12 at 20:05, Luca Veraldi wrote:
> If you modify the page tables as in my example (and then if you do so only
> for B's pagetable)
> you can be sure the things you're modifying were not present in Firmware
> TLBs, not yet.
>
> Because those pagetable entries refer to a logical address interval
> you've just allocated in B address space.

They may be present in some form, but you are right the task switch will
flush them in the non SMP case on the x86 platform. In the SMP case it
wont work.

Alan

2003-09-13 15:40:08

by Pavel Machek

[permalink] [raw]
Subject: Re: Inappropriate signatures

Hi!

> We're providing the service which enables all of the below and without
> our good will that service is at risk.

Eh? You are providing a service and he provides you advertising. I
can't see anything about slamming you. I guess that Andrea does not
think about kernel developing as a business, so it is not really
targeted at you.

But anyway its *his* signature.
Pavel

> > /*
> > * If you refuse to depend on closed software for a critical
> > * part of your business, these links may be useful:
> > *
> > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
> > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
> > * http://www.cobite.com/cvsps/
> > *
> > * svn://svn.kernel.org/linux-2.6/trunk
> > * svn://svn.kernel.org/linux-2.4/trunk
> > */

--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-09-13 16:50:11

by Larry McVoy

[permalink] [raw]
Subject: Re: Inappropriate signatures

Other people may not agree with your view on this one Pavel. Most people
read that signature and saw it as a negative comment about BitKeeper.

People have told me they believe that if BitMover isn't getting benefit
from the free use of BK, free BK will go away. People don't want to have
to depend on my goodwill, they want BitMover to derive benefit so that
my goodwill doesn't matter, it's just smart business to give BK away.

If that is really what people here think then that means there has to
be some benefit for BitMover. One of the benefits is that we get to say
that the kernel team uses it and get some marketing advantage out of that.
But that benefit diminishes if what people are saying is negative.

It makes sense that people are uneasy about depending on BK given the
amount of work it takes to provide it and the amount of grief we take
for providing it. I don't know why I didn't see that earlier, it's an
unstable situation.

I know there are some people who will never be happy until everything
is GPLed, I can't help those people other than provide the gateway.
In return, those people need to stop whining, the gateway has to be
enough.

For the rest of the people, I'm looking for suggestions on how to make
this situation more stable. It took me a while but I can see why you
are nervous, I'd be nervous in your position. I'm nervous about doing
any real marketing of the kernel's use of BK because I figured it would
lead to more flame wars. I'm starting to think that if we were doing
that it might actually lead to less flames, based on the theory that we
would then need you so you continue to get BK for free. If you have an
opinion on that I'd like to know it.

On Sat, Sep 13, 2003 at 05:39:56PM +0200, Pavel Machek wrote:
> Hi!
>
> > We're providing the service which enables all of the below and without
> > our good will that service is at risk.
>
> Eh? You are providing a service and he provides you advertising. I
> can't see anything about slamming you. I guess that Andrea does not
> think about kernel developing as a business, so it is not really
> targeted at you.
>
> But anyway its *his* signature.
> Pavel
>
> > > /*
> > > * If you refuse to depend on closed software for a critical
> > > * part of your business, these links may be useful:
> > > *
> > > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
> > > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
> > > * http://www.cobite.com/cvsps/
> > > *
> > > * svn://svn.kernel.org/linux-2.6/trunk
> > > * svn://svn.kernel.org/linux-2.4/trunk
> > > */
>
> --
> When do you have a heart between your knees?
> [Johanka's followup: and *two* hearts?]

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-09-17 09:52:44

by Terje Eggestad

[permalink] [raw]
Subject: Re: Rik's list of CS challenges

I got one:

I've been paying attention the MRAM
http://www.research.ibm.com/resources/news/20030610_mram.shtml (and the
other NV rams comming up, most notabily the optical/polymer
http://www.pcw.co.uk/news/1143633 techs).

What interest me is that with multiple candidates in the pipeline, it's
a high probability that we'll have a non-volatile DRAM replacement
hitting mainstream computer market within 5-10 years.

You have the obvious PDA/Laptop application, which I'll not dwell on.

You have the HA server application where you want fast reboot on power
failure etc.

I'd expect that there will be a demand of never reboot, only seemless
continue. Booting is after all a "hack" to get around
- OS bugs
- an easy way of detecting new HW and config the OS according to HW.
- An easy way to upgrade the kernel

hotplug demands are already rendering pt 2 obsolete.
UPgrading the kernel while running would be a cool feature. One of the
promises of the micro kernel archs in the early 90's

Is it possible to "reboot" the kernel without affecting the running apps
on the machine?



What become more interesting is that while you may have NV RAM, it's not
likely that MRAM is viable on the processor chip. The manufacture
process may be too expensive, or outright impossible, (polymers on chips
that hold 80 degrees C in not likely), leaving you with volatile
register and cache but NV Main RAM.
- should the OS now schedule "paging" between cache and RAM?
- atomic syncing/updates of regs/cache to RAM? (checkpointing)

A merge of FS and RAM? (didn't the AS/400 have mmap'ed disks?)


And all the applications I haven't even thought of.

TJ


On Wed, 2003-09-10 at 22:05, Rik van Riel wrote:
> On Wed, 10 Sep 2003, Luca Veraldi wrote:
>
> > I'm not responsible for microarchitecture designer stupidity.
> > If a simple STORE assembler instruction will eat up 4000 clock cycles,
> > as you say here, well,
>
> If current trends continue, a L2 cache miss will be
> taking 5000 cycles in 15 to 20 years.
>
> > I think all we Computer Scientists can go home and give it up now.
>
> While I have seen some evidence of computer scientists
> going home and ignoring the problems presented to them
> by current hardware constraints, I'd really prefer it
> if they took up the challenge and did the research on
> how we should deal with hardware in the future.
>
> In fact, I've made up a little (incomplete) list of
> things that I suspect are in need of serious CS research,
> because otherwise both OS theory and practice will be
> unable to deal with the hardware of the next decade.
>
> http://surriel.com/research_wanted/
>
> If you have any suggestions for the list, please let
> me know.
--

Terje Eggestad
Senior Software Engineer
dir. +47 22 62 89 61
mob. +47 975 31 57
fax. +47 22 62 89 51
[email protected]

Scali - http://www.scali.com
High Performance Clustering

2003-09-17 13:41:51

by Alan

[permalink] [raw]
Subject: Re: Rik's list of CS challenges

On Mer, 2003-09-17 at 10:52, Terje Eggestad wrote:
> What become more interesting is that while you may have NV RAM, it's not
> likely that MRAM is viable on the processor chip. The manufacture
> process may be too expensive, or outright impossible, (polymers on chips
> that hold 80 degrees C in not likely), leaving you with volatile
> register and cache but NV Main RAM.

We effectively handle that case now with the suspend-to-ram feature.

> A merge of FS and RAM? (didn't the AS/400 have mmap'ed disks?)

Persistant storage systems. These tend to look very unlike Linux because
they throw out the idea of a file system as such. The issues with
debugging if they break and backups make my head hurt but other folk
seem to think they are solved problems