2000-10-29 23:23:57

by Jeff Merkey

[permalink] [raw]
Subject: 2.2.18Pre Lan Performance Rocks!


Alan,

I don't know what all changes you guys did to 2.2.18pre, but the LAN I/O
performance absolutely rocks on our tests here vs. NetWare. Good job.
It's still got some problems with NFS (I am seeing a few RPC timeout
errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
but it's most impressive.

:-)

Jeff


2000-10-30 01:39:16

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > It's still got some problems with NFS (I am seeing a few RPC timeout
> > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > but it's most impressive.
>
> Can you send a summary of the NFS reports to [email protected]

Yes. I just went home, so I am emailing from my house. I'll post late
tonight or in the morning. Performance on 100Mbit with NFS going
Linux->Linux is getting better throughput than IPX NetWare Client ->
NetWare 5.x on the same network by @ 3%. When you start loading up a
Linux server, it drops off sharply and NetWare keeps scaling, however,
this does indicate that the LAN code paths are equivalent relative to
latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
(but we'll fix this in Linux next). I think the ring transitions to
user space daemons are what are causing the scaling problems
Linux vs. NetWare.

:-)

Jeff

2000-10-30 06:47:31

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > but it's most impressive.
> >
> > Can you send a summary of the NFS reports to [email protected]
>
> Yes. I just went home, so I am emailing from my house. I'll post late
> tonight or in the morning. Performance on 100Mbit with NFS going
> Linux->Linux is getting better throughput than IPX NetWare Client ->
> NetWare 5.x on the same network by @ 3%. When you start loading up a
> Linux server, it drops off sharply and NetWare keeps scaling, however,
> this does indicate that the LAN code paths are equivalent relative to
> latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> (but we'll fix this in Linux next). I think the ring transitions to
> user space daemons are what are causing the scaling problems
> Linux vs. NetWare.

There are no user space daemons involved in the knfsd fast path, only in slow paths
like mounting.
The main problem I think in knfsd are the numerous copies of the data (e.g. 2+checksumming for
RX with fragments, upto 4 in some specific configurations). They're unfortunately
not trivial to fix. TX is a bit better, it does only one copy usually out of
the page cache. For RX it also helps to have a network card that supports hardware
checksumming.



-Andi

2000-10-30 07:02:07

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 07:47:00AM +0100, Andi Kleen wrote:
> On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> > On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > > but it's most impressive.
> > >
> > > Can you send a summary of the NFS reports to [email protected]
> >
> > Yes. I just went home, so I am emailing from my house. I'll post late
> > tonight or in the morning. Performance on 100Mbit with NFS going
> > Linux->Linux is getting better throughput than IPX NetWare Client ->
> > NetWare 5.x on the same network by @ 3%. When you start loading up a
> > Linux server, it drops off sharply and NetWare keeps scaling, however,
> > this does indicate that the LAN code paths are equivalent relative to
> > latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> > (but we'll fix this in Linux next). I think the ring transitions to
> > user space daemons are what are causing the scaling problems
> > Linux vs. NetWare.
>
> There are no user space daemons involved in the knfsd fast path, only in slow paths
> like mounting.

So why is it spawning off nfsd servers in user space? I did notice that all
the connect logic is down in the kernel with the RPC stuff in xprt.c, and this
looks to be pretty fast. I know about the checksumming problem with TCPIP.
But cycles are cheap on todays processors, so even with this overhead, it
could still get faster. IPX uses small packets that are less wire efficient
since the ratio of header size to payload is larger than what NFS in Linux
is doing, plus there's more of them for an equivalent data transfer, even
with packet burst. IPX would tend to be faster if there were multiple
routers involved, since the latency of smaller packets would be less and
IPX never has to deal with the problem of fragmentation.

Is there any CR3 reloading or tasking switching going on for each interrupt
that comes in you know of, BTW? EMON stats show the server's bus utilization
to get disproportiantely higher in Linux than NetWare 5.x and when it hits
60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
the bus utilitization, which would suggest clocks are being used for something
other than pushing packets in and out of the box (the checksumming is done
in the tcp code when it copies the fragments, which is very low overhead).

Jeff


> The main problem I think in knfsd are the numerous copies of the data (e.g. 2+checksumming for
> RX with fragments, upto 4 in some specific configurations). They're unfortunately
> not trivial to fix. TX is a bit better, it does only one copy usually out of
> the page cache. For RX it also helps to have a network card that supports hardware
> checksumming.
>
>
>
> -Andi

2000-10-30 07:09:18

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Sun, Oct 29, 2000 at 11:58:21PM -0700, Jeff V. Merkey wrote:
> On Mon, Oct 30, 2000 at 07:47:00AM +0100, Andi Kleen wrote:
> > On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> > > On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > > > but it's most impressive.
> > > >
> > > > Can you send a summary of the NFS reports to [email protected]
> > >
> > > Yes. I just went home, so I am emailing from my house. I'll post late
> > > tonight or in the morning. Performance on 100Mbit with NFS going
> > > Linux->Linux is getting better throughput than IPX NetWare Client ->
> > > NetWare 5.x on the same network by @ 3%. When you start loading up a
> > > Linux server, it drops off sharply and NetWare keeps scaling, however,
> > > this does indicate that the LAN code paths are equivalent relative to
> > > latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> > > (but we'll fix this in Linux next). I think the ring transitions to
> > > user space daemons are what are causing the scaling problems
> > > Linux vs. NetWare.
> >
> > There are no user space daemons involved in the knfsd fast path, only in slow paths
> > like mounting.
>
> So why is it spawning off nfsd servers in user space? I did notice that all

They just provide a process context, the actual work is only done in kernel mode.

> the connect logic is down in the kernel with the RPC stuff in xprt.c, and this
> looks to be pretty fast. I know about the checksumming problem with TCPIP.
> But cycles are cheap on todays processors, so even with this overhead, it
> could still get faster. IPX uses small packets that are less wire efficient
> since the ratio of header size to payload is larger than what NFS in Linux
> is doing, plus there's more of them for an equivalent data transfer, even
> with packet burst. IPX would tend to be faster if there were multiple
> routers involved, since the latency of smaller packets would be less and
> IPX never has to deal with the problem of fragmentation.
>
> Is there any CR3 reloading or tasking switching going on for each interrupt
> that comes in you know of, BTW? EMON stats show the server's bus utilization

No, interrupts do not change CR3.

One problem in Linux 2.2 is that kernel threads reload their VM on context switch
(that would include the nfsd thread), this should be fixed in 2.4 with lazy mm.
Hmm actually it should be only fixed for true kernel threads that have been started
with kernel_thread(), the "pseudo kernel threads" like nfsd uses probably do not
get that optimization because they don't set their MM to init_mm.

> to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8

I think that can be explained by the copying.

To be sure you could e.g. use the ktrace patch from IKD, it will give you
cycle accurate traces of execution of functions.

> the bus utilitization, which would suggest clocks are being used for something
> other than pushing packets in and out of the box (the checksumming is done
> in the tcp code when it copies the fragments, which is very low overhead).

No, unfortunately not. NFS uses UDP and the UDP defragmenting doesn't do copy-checksum
currently.


-Andi

2000-10-30 07:17:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Andi Kleen wrote:

> One problem in Linux 2.2 is that kernel threads reload their VM on
> context switch (that would include the nfsd thread), this should be
> fixed in 2.4 with lazy mm. Hmm actually it should be only fixed for
> true kernel threads that have been started with kernel_thread(), the
> "pseudo kernel threads" like nfsd uses probably do not get that
> optimization because they don't set their MM to init_mm.

yes, but for this there is an explicit mechanizm to lazy-MM during lengthy
system calls, an example is in buffer.c:

user_mm = start_lazy_tlb();
error = sync_old_buffers();
end_lazy_tlb(user_mm);

> > to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> > 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
>
> I think that can be explained by the copying.

yes. Constant copying contaminates the L1/L2 caches and creates dirty
cachelines all around the place. Fixed in 2.4 + TUX ;-)

Ingo

2000-10-30 07:20:32

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 08:08:58AM +0100, Andi Kleen wrote:
> On Sun, Oct 29, 2000 at 11:58:21PM -0700, Jeff V. Merkey wrote:
> > On Mon, Oct 30, 2000 at 07:47:00AM +0100, Andi Kleen wrote:
> > > On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> > > > On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > > > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > > > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > > > > but it's most impressive.
> > > > >
> > > > > Can you send a summary of the NFS reports to [email protected]
> > > >
> > > > Yes. I just went home, so I am emailing from my house. I'll post late
> > > > tonight or in the morning. Performance on 100Mbit with NFS going
> > > > Linux->Linux is getting better throughput than IPX NetWare Client ->
> > > > NetWare 5.x on the same network by @ 3%. When you start loading up a
> > > > Linux server, it drops off sharply and NetWare keeps scaling, however,
> > > > this does indicate that the LAN code paths are equivalent relative to
> > > > latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> > > > (but we'll fix this in Linux next). I think the ring transitions to
> > > > user space daemons are what are causing the scaling problems
> > > > Linux vs. NetWare.
> > >
> > > There are no user space daemons involved in the knfsd fast path, only in slow paths
> > > like mounting.
> >
> > So why is it spawning off nfsd servers in user space? I did notice that all
>
> They just provide a process context, the actual work is only done in kernel mode.
>
> > the connect logic is down in the kernel with the RPC stuff in xprt.c, and this
> > looks to be pretty fast. I know about the checksumming problem with TCPIP.
> > But cycles are cheap on todays processors, so even with this overhead, it
> > could still get faster. IPX uses small packets that are less wire efficient
> > since the ratio of header size to payload is larger than what NFS in Linux
> > is doing, plus there's more of them for an equivalent data transfer, even
> > with packet burst. IPX would tend to be faster if there were multiple
> > routers involved, since the latency of smaller packets would be less and
> > IPX never has to deal with the problem of fragmentation.
> >
> > Is there any CR3 reloading or tasking switching going on for each interrupt
> > that comes in you know of, BTW? EMON stats show the server's bus utilization
>
> No, interrupts do not change CR3.
>
> One problem in Linux 2.2 is that kernel threads reload their VM on context switch
> (that would include the nfsd thread), this should be fixed in 2.4 with lazy mm.

When you say it reloads it's VM, you mean it reloads the CR3 register?
This will cost you about 15 clocks up front, and about 150 clocks over
time as each TLB is reloaded (plus is will do LOCK# assertions invisibly
underneath for PTE fetches) One major difference is that in NetWare,
CR3 does not ever get changed in any network I/O fast paths, and none of
the TCP or IPX processes exist in user space, so there's never this
background overhead with page tables being loaded in and out. CR3 only
gets mucked with when someone allocates memory and pages in an address
frame (when memory is alloced and freed).

The user space activity is what's causing this, plus the copying. Is there
an easy way to completely disable multiple PTE/PDE address spaces and just
map the entire address space linear (like NetWare) with a compile option
so I could do a true apples to apples comparison?

Jeff


> Hmm actually it should be only fixed for true kernel threads that have been started
> with kernel_thread(), the "pseudo kernel threads" like nfsd uses probably do not
> get that optimization because they don't set their MM to init_mm.
>
> > to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> > 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
>
> I think that can be explained by the copying.

It's more. I am seeing tons of segment register reloads, which are very
heavy, and atomic reads of PTE entries.

>
> To be sure you could e.g. use the ktrace patch from IKD, it will give you
> cycle accurate traces of execution of functions

I'll look at this.

> > the bus utilitization, which would suggest clocks are being used for something
> > other than pushing packets in and out of the box (the checksumming is done
> > in the tcp code when it copies the fragments, which is very low overhead).
>
> No, unfortunately not. NFS uses UDP and the UDP defragmenting doesn't do copy-checksum

Correct. You're right -- it's in the tcp code, however, I remember going over
this when I was reviewing Alan's code -- that's what I was think of.

Jeff



> currently.
>
>
> -Andi

2000-10-30 07:24:12

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 09:26:59AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Andi Kleen wrote:
>
> > One problem in Linux 2.2 is that kernel threads reload their VM on
> > context switch (that would include the nfsd thread), this should be
> > fixed in 2.4 with lazy mm. Hmm actually it should be only fixed for
> > true kernel threads that have been started with kernel_thread(), the
> > "pseudo kernel threads" like nfsd uses probably do not get that
> > optimization because they don't set their MM to init_mm.
>
> yes, but for this there is an explicit mechanizm to lazy-MM during lengthy
> system calls, an example is in buffer.c:
>
> user_mm = start_lazy_tlb();
> error = sync_old_buffers();
> end_lazy_tlb(user_mm);
>
> > > to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> > > 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
> >
> > I think that can be explained by the copying.
>
> yes. Constant copying contaminates the L1/L2 caches and creates dirty
> cachelines all around the place. Fixed in 2.4 + TUX ;-)
>

Ingo, we need a build option to completely disable multiple address spaces
for a start, and just map everything to a linear address space. This
will eliminate the overhead of the CR3 activity. The use of segment registers
for copy_to_user, etc. causes segment register reloads, which are very
heavyweight on Intel.

Is there an option to map Linux into a flat address space like NetWare so
I can do an apples to apples comparison of raw LAN I/O scaling?


> Ingo

2000-10-30 07:30:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> Is there an option to map Linux into a flat address space [...]

nope, Linux is fundamentally multitasked.

what you can do to hack around this is to not switch to the idle thread
after having done work in nfsd. Some simple & stupid thing in schedule:

if (next == idle_task) {
while (nr_running)
barrier();
goto repeat_schedule;
}

(provided you are testing this on a UP system.) This way we do not destroy
the TLB cache when we wait a few microseconds for the next network
interrupt.

we do this in 2.4 already - ie. nfsd doesnt have to mark itself lazy-MM,
the idle thread will automatically 'inherit' the MM of nfsd, and is going
to switch CR3 only if the next process is not nfsd. So you can get an
apples to apples comparison by using 2.4.

Ingo

2000-10-30 07:38:43

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:16:46AM -0700, Jeff V. Merkey wrote:
> On Mon, Oct 30, 2000 at 08:08:58AM +0100, Andi Kleen wrote:
> > On Sun, Oct 29, 2000 at 11:58:21PM -0700, Jeff V. Merkey wrote:
> > > On Mon, Oct 30, 2000 at 07:47:00AM +0100, Andi Kleen wrote:
> > > > On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> > > > > On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > > > > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > > > > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > > > > > but it's most impressive.
> > > > > >
> > > > > > Can you send a summary of the NFS reports to [email protected]
> > > > >
> > > > > Yes. I just went home, so I am emailing from my house. I'll post late
> > > > > tonight or in the morning. Performance on 100Mbit with NFS going
> > > > > Linux->Linux is getting better throughput than IPX NetWare Client ->
> > > > > NetWare 5.x on the same network by @ 3%. When you start loading up a
> > > > > Linux server, it drops off sharply and NetWare keeps scaling, however,
> > > > > this does indicate that the LAN code paths are equivalent relative to
> > > > > latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> > > > > (but we'll fix this in Linux next). I think the ring transitions to
> > > > > user space daemons are what are causing the scaling problems
> > > > > Linux vs. NetWare.
> > > >
> > > > There are no user space daemons involved in the knfsd fast path, only in slow paths
> > > > like mounting.
> > >
> > > So why is it spawning off nfsd servers in user space? I did notice that all
> >
> > They just provide a process context, the actual work is only done in kernel mode.
> >
> > > the connect logic is down in the kernel with the RPC stuff in xprt.c, and this
> > > looks to be pretty fast. I know about the checksumming problem with TCPIP.
> > > But cycles are cheap on todays processors, so even with this overhead, it
> > > could still get faster. IPX uses small packets that are less wire efficient
> > > since the ratio of header size to payload is larger than what NFS in Linux
> > > is doing, plus there's more of them for an equivalent data transfer, even
> > > with packet burst. IPX would tend to be faster if there were multiple
> > > routers involved, since the latency of smaller packets would be less and
> > > IPX never has to deal with the problem of fragmentation.
> > >
> > > Is there any CR3 reloading or tasking switching going on for each interrupt
> > > that comes in you know of, BTW? EMON stats show the server's bus utilization
> >
> > No, interrupts do not change CR3.
> >
> > One problem in Linux 2.2 is that kernel threads reload their VM on context switch
> > (that would include the nfsd thread), this should be fixed in 2.4 with lazy mm.
>
> When you say it reloads it's VM, you mean it reloads the CR3 register?

Yes.

> This will cost you about 15 clocks up front, and about 150 clocks over
> time as each TLB is reloaded (plus is will do LOCK# assertions invisibly
> underneath for PTE fetches) One major difference is that in NetWare,
> CR3 does not ever get changed in any network I/O fast paths, and none of
> the TCP or IPX processes exist in user space, so there's never this
> background overhead with page tables being loaded in and out. CR3 only
> gets mucked with when someone allocates memory and pages in an address
> frame (when memory is alloced and freed).
>
> The user space activity is what's causing this, plus the copying. Is there
> an easy way to completely disable multiple PTE/PDE address spaces and just
> map the entire address space linear (like NetWare) with a compile option
> so I could do a true apples to apples comparison?

No. In 2.4 you could probably use the on demand lazy vm mechanism ingo described for
the nfsd processes. In 2.2 it is a bit more tricky, if I remember right lazy mm needed
quite a few changes.

But before doing too many changes i would first verify if that is really the problem.


> > Hmm actually it should be only fixed for true kernel threads that have been started
> > with kernel_thread(), the "pseudo kernel threads" like nfsd uses probably do not
> > get that optimization because they don't set their MM to init_mm.
> >
> > > to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> > > 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
> >
> > I think that can be explained by the copying.
>
> It's more. I am seeing tons of segment register reloads, which are very
> heavy, and atomic reads of PTE entries.

PTEs are read for aging on memory pressure in vmscan.

segment register reload happen on interrupt entry, to load the kernel cs/ds (are they that
costly?). If it was really costly you could probably check in interrupt entry if you're
already running in kernel space and skip it.

>
> > > the bus utilitization, which would suggest clocks are being used for something
> > > other than pushing packets in and out of the box (the checksumming is done
> > > in the tcp code when it copies the fragments, which is very low overhead).
> >
> > No, unfortunately not. NFS uses UDP and the UDP defragmenting doesn't do copy-checksum
>
> Correct. You're right -- it's in the tcp code, however, I remember going over
> this when I was reviewing Alan's code -- that's what I was think of.

2.2 does not use checksum-copy-to-user in TCP RX, csum is either done separate or in hardware.
It does csum-copy-fragment for TX, as well as UDP.

-Andi

2000-10-30 08:08:23

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> >
> > When you say it reloads it's VM, you mean it reloads the CR3 register?
>
> Yes.
>
> No. In 2.4 you could probably use the on demand lazy vm mechanism ingo described for
> the nfsd processes. In 2.2 it is a bit more tricky, if I remember right lazy mm needed
> quite a few changes.
>
> But before doing too many changes i would first verify if that is really the problem.

We will never beat NetWare on scaling if this is the case, even in 2.4.
Andre and my first job will be to create an arch port with MANOS that
disables this and restructures the VM.

> PTEs are read for aging on memory pressure in vmscan.
>
> segment register reload happen on interrupt entry, to load the kernel cs/ds (are they that
> costly?). If it was really costly you could probably check in interrupt entry if you're
> already running in kernel space and skip it.

They are. segment register reloads will trigger the following:

IDT table atomic fetch to verify (LOCK#) (if triggered by task gate from INTR)
GDT table atomic fetch to verify (LOCK#)
LDT table atomic fetch to verify (LOCK#) (if present)
PDE table atomic fetch to verify (LOCK#)

The process has to verify that the loaded segment descriptor is valid, and
it will fetch from all these tables to do it, with up to 4 (LOCK#)
assertions occurring invisibly in the hardware underneath (which will
generate 4 non-cacheable memory references, in addition to wrecking
havoc on the affected L1/L2 cache lines). Oink. It only does this
when you load one, not when you save one, like pushing it on the
stack. Since you look at MANOS code, you'll note that in
CONTEXT.386, I do and add esp, 3 * 4 instead of poppping the segment
registers off the stack if they are in the kernel address space.

Linux should do the same, if possible as an optimization.


>
> >
>
> 2.2 does not use checksum-copy-to-user in TCP RX, csum is either done separate or in hardware.
> It does csum-copy-fragment for TX, as well as UDP.

Yes, I know, this is what I was referrring to. Received are always ugly,
there's always at least one copy into cache for the write, sometimes
more...

Jeff

>
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-30 08:12:04

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 09:39:37AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > Is there an option to map Linux into a flat address space [...]
>
> nope, Linux is fundamentally multitasked.
>
> what you can do to hack around this is to not switch to the idle thread
> after having done work in nfsd. Some simple & stupid thing in schedule:
>
> if (next == idle_task) {
> while (nr_running)
> barrier();
> goto repeat_schedule;
> }
>
> (provided you are testing this on a UP system.) This way we do not destroy
> the TLB cache when we wait a few microseconds for the next network
> interrupt.
>
> we do this in 2.4 already - ie. nfsd doesnt have to mark itself lazy-MM,
> the idle thread will automatically 'inherit' the MM of nfsd, and is going
> to switch CR3 only if the next process is not nfsd. So you can get an
> apples to apples comparison by using 2.4.

Ingo, I will attempt this, but I doubt seriously it will allow
Linux to defeat NetWare 5.x on LAN I/O scaling. ALl protection
has to go away in all LAN paths for this to happen, and user space
apps set to ring 0. NetWare 5.cx does support ring 3 applications,
but the model is different than Linux. I will look at a potential
compromise between the two with the MANOS /arch merge. It may
allow some incarnation of Linux to smoke NetWare 5.x on LAN
performance. Doing this would move MARS-NWE and SAMBA into the kernel
without changing a line of code in either...

Jeff


>
> Ingo

2000-10-30 08:16:34

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 01:04:34AM -0700, Jeff V. Merkey wrote:
> > >
> > > When you say it reloads it's VM, you mean it reloads the CR3 register?
> >
> > Yes.
> >
> > No. In 2.4 you could probably use the on demand lazy vm mechanism ingo described for
> > the nfsd processes. In 2.2 it is a bit more tricky, if I remember right lazy mm needed
> > quite a few changes.
> >
> > But before doing too many changes i would first verify if that is really the problem.
>
> We will never beat NetWare on scaling if this is the case, even in 2.4.
> Andre and my first job will be to create an arch port with MANOS that
> disables this and restructures the VM.

I just guess the end result will be as crash prone as Netware when you install any third
party software ;)

lazy mm is probably a better path, as long as you stay in kernel threads and a single user mm
it'll never switch VMs.

>
> > PTEs are read for aging on memory pressure in vmscan.
> >
> > segment register reload happen on interrupt entry, to load the kernel cs/ds (are they that
> > costly?). If it was really costly you could probably check in interrupt entry if you're
> > already running in kernel space and skip it.
>
> They are. segment register reloads will trigger the following:
>
> IDT table atomic fetch to verify (LOCK#) (if triggered by task gate from INTR)
> GDT table atomic fetch to verify (LOCK#)
> LDT table atomic fetch to verify (LOCK#) (if present)
> PDE table atomic fetch to verify (LOCK#)
>
> The process has to verify that the loaded segment descriptor is valid, and
> it will fetch from all these tables to do it, with up to 4 (LOCK#)
> assertions occurring invisibly in the hardware underneath (which will
> generate 4 non-cacheable memory references, in addition to wrecking
> havoc on the affected L1/L2 cache lines). Oink. It only does this
> when you load one, not when you save one, like pushing it on the
> stack. Since you look at MANOS code, you'll note that in
> CONTEXT.386, I do and add esp, 3 * 4 instead of poppping the segment
> registers off the stack if they are in the kernel address space.
>
> Linux should do the same, if possible as an optimization.

Interesting. You could do easily the same in Linux by changing (in 2.2) the SAVE_ALL macro
in arch/i386/kernel/irq.h and doing the same in ret_from_intr's RESTORE_ALL macro (after
changing it to a RESTORE_ALL_INT or you'll break system calls)

Actually I don't even know why irq.h's SAVE_ALL even loads __KERNEL_CS, it should be already
set by the interrupt gate. __KERNEL_DS probably needs to be still loaded, but only when
you came from user space.

-Andi

2000-10-30 08:40:27

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> >
> > I just guess the end result will be as crash prone as Netware when you install any third
> > party software ;)
> >
> > lazy mm is probably a better path, as long as you stay in kernel threads and a single user mm
> > it'll never switch VMs.
>
> It's not that bad. Some code has been running for years and will be OK.
> Also, the number of Ooops people get anyway on Linux today is about the
> same state as Abends were in NetWare in their frequency in Live customers
> accounts. I've also seen miscreant user apps in Linux wreck havoc
> from user space as deadly as an Oops.
>
> It's not the activity in the lan path that's the problem, it's the switching
> period that's the problem. The fix goes in the loader -- trusted apps get
> loaded in the kernel, non-trusted apps get loaded in a ring 3 address space.
> The WTD thing I described stops them from context switching until the
> system goes ile, since LAN I/O always gets moved to the front of the
> work queue, if you remember my post describing it. NetWare does this
> trick by holding off CR3 reloads to apps until the server goes idle
> relative to incoming or outging I/O. Linus just lets both happen
> all over the place. I'll show you end of December with the port. Ring 3
> apps will still work - you'll just be able to load some of them at ring 0,
> and run like bat out of hell.
>
> You are right though, Linux is like a tank going 25 miles an hour crushing
> everything in it's path, while NetWare is an aluminum frame speed racer
> that weighs 20 lbs., gets 1000 miles to the gallon, runs at MACH VI, and
> will explode in flames if it hits a tack in the road...
>
> :-)
>
> > > stack. Since you look at MANOS code, you'll note that in
> > > CONTEXT.386, I do and add esp, 3 * 4 instead of poppping the segment
> > > registers off the stack if they are in the kernel address space.
> > >
> > > Linux should do the same, if possible as an optimization.
> >
>
>
>
> > Interesting. You could do easily the same in Linux by changing (in 2.2) the SAVE_ALL macro
> > in arch/i386/kernel/irq.h and doing the same in ret_from_intr's RESTORE_ALL macro (after
> > changing it to a RESTORE_ALL_INT or you'll break system calls)
> >
> > Actually I don't even know why irq.h's SAVE_ALL even loads __KERNEL_CS, it should be already
> > set by the interrupt gate. __KERNEL_DS probably needs to be still loaded, but only when
> > you came from user space.
>
> Sounds like a fun patch. I'll get to work on it. Might want to ping Linus,
> and let him put this one in, since it's his code, and would be a very easy
> fix. The more lipo-suction we could do, the better.
>
> :-)
>
> Jeff
>
>
>
> >
> > -Andi

2000-10-30 08:42:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> [...] All protection has to go away in all LAN paths for this to
> happen, and user space apps set to ring 0. [...]

i found that this is not a requirement for good network scalability. We do
not do a syscall for every packet, so the cost evens out. Sure, it does
not hurt to not eat ~1 microsecond per system-call, but it causes no
overhead or scalability limit otherwise. In the TUX webserver we have
user-space modules doing context-switch-less webserving, and it scales
quite well, and is generic.

Ingo

2000-10-30 08:59:43

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 10:52:08AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > [...] All protection has to go away in all LAN paths for this to
> > happen, and user space apps set to ring 0. [...]
>
> i found that this is not a requirement for good network scalability. We do
> not do a syscall for every packet, so the cost evens out. Sure, it does
> not hurt to not eat ~1 microsecond per system-call, but it causes no
> overhead or scalability limit otherwise. In the TUX webserver we have
> user-space modules doing context-switch-less webserving, and it scales
> quite well, and is generic.
>
> Ingo

No argument here, but the overhead of reloading CR3 period will kill
performance. 2.4 does not beat NetWare, BTW, it gets a little further,
but still hits the wall, and Netware keeps going strong, and scales
up to 5000 users on file and print, with a SINGLE processor at less <
40% utilization. Linux hits the wall at a few hundred file and print
users. It's the overhead of CR3 switching at all that is causing this
and the heavy usage of Intel's protection model.

For example, if you put a MOV EAX, CR3; MOV CR3, EAX; in a context switching
path, on a PPro 200, you can do about 35,000 context switches/second
(I did this in NetWare 4.x in 1994) with the same code without the
CR3 reload, I could do over 2,000,000 context switches/second without
the CR3 load. INVL Page[0] in the same path is less heavy, but still
reduced context switches to 135,000/second. It's the CR3 activity period
that kills performance. There's also the use of segment registers
all over the place to copy from kernel to user and user to kernel
space memory. Having the fast paths you mention does help a lot,
but it's the fact that this goes on at all that will make it tough
to walk into a NetWare shop with Linux and rip out NetWare servers
and replace them unless we look at a NetWare vs. NetLinux (that's
what we call it! a NetWare-like Linux platform).

If we had this, MS would be running for the trenches. Coupling the
internet application base of Linux with the speed of NetWare would
be Mr. Gate's worst nightmare. They've been trying to kill off NetWare
for almost 15 years, and it's still out there and they are still
slower, and everyone knows it. It's not Linux or NT that's killing
off NetWare, right now, it's the incompetent management in San Jose
who are trying to rape Novell's rich bank accounts
(Schmidt and Sonsini and friends), and the fact they have no IA64
NetWare (they would be shipping it not if I were still there ....).

It could be as easy as a compile option in .config (NETWARE_SERVER_MODE [Y/N])
with WTD (which is a more sophistacted method for doing lazy VM and
controlling context switching activity for fast path I/O) and
mapping of te Linux address space as linear -- you can still have
protection with this model, just some apps would be able to load
in the kernel address space, and run at ring 0).

:-)

Jeff

2000-10-30 09:04:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> No argument here, but the overhead of reloading CR3 period will kill
> performance. [...]

2.4 does not reload CR3, unless you are using multiple user-space
processes.

> 2.4 does not beat NetWare, BTW, it gets a little further, but still
> hits the wall, [...]

as i told you in the previous mail, the main overhead is not CR3, it's the
copying & dirtying of all data, and the subsequent DMA-initiated dirty
cacheline writeback. I can serve 100 MB/sec web content with 2.4 & TUX
just fine - it relies on a zero-copying infrastructure.

Ingo

2000-10-30 09:14:57

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:13:58AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > No argument here, but the overhead of reloading CR3 period will kill
> > performance. [...]
>
> 2.4 does not reload CR3, unless you are using multiple user-space
> processes.
>
> > 2.4 does not beat NetWare, BTW, it gets a little further, but still
> > hits the wall, [...]
>
> as i told you in the previous mail, the main overhead is not CR3, it's the
> copying & dirtying of all data, and the subsequent DMA-initiated dirty
> cacheline writeback. I can serve 100 MB/sec web content with 2.4 & TUX
> just fine - it relies on a zero-copying infrastructure.
>
> Ingo


Great Ingo, you've got the web server covered. What about file and print.
I think this is great, but most web servers are connected to a T1 or T3
line, and all the fancy optimization means absolutely squat since about
99.999999% of the planet has a max baandwidth of a T1, ADSL, or T3 Line,
this is a far cry from Gigabit ethernet, or even 100Mbit ethernet.

How many users can you put on the web server? Web servers are also
read only data, the easiest of all LAN cases to deal with. It's
incoming writes that present all the tough problems, as Andi Klein
wisely observed. Not to knock Tux, I think it's great, but it does
not solve the file and print scaling problem, no does it?

Jeff

2000-10-30 09:17:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> For example, if you put a MOV EAX, CR3; MOV CR3, EAX; in a context
> switching path, on a PPro 200, you can do about 35,000 context
> switches/second

in 2.4 & Xeons we can do more than 100,000 context switches/second, and
that is more than enough. But the point is: network IO performance does
not depend on context switching speed too much. Also, in Linux we are
using global pages which makes kernel-space TLBs persistent even across
CR3 flushes.

> [...] There's also the use of segment registers all over the place to
> copy from kernel to user and user to kernel space memory. [...]

we do not use the fs segment register for user-space copies anymore,
neither in 2.2, nor in 2.4. You must be reading old books and probably
forgot to cross-check with the kernel source? :-)

> [...] Having the fast paths you mention does help a lot, but it's the
> fact that this goes on at all that will make it tough to walk into a
> NetWare shop with Linux and rip out NetWare servers and replace them
> unless we look at a NetWare vs. NetLinux (that's what we call it! a
> NetWare-like Linux platform).

the worst thing you can do is to mis-identify performance problems and
spend braincells on the wrong problem. The problems limiting Linux network
scalability have been identified during the last 12 months by a small
team, and solved in TUX. TUX is a fileserver, it shouldnt be alot of work
to enable it for (TCP-only?) netware serving. It's *done*, Jeff, it's not
a hypotetical thing, it's here, it works and it performs.

Ingo

2000-10-30 09:24:18

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:27:04AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > For example, if you put a MOV EAX, CR3; MOV CR3, EAX; in a context
> > switching path, on a PPro 200, you can do about 35,000 context
> > switches/second
>
> in 2.4 & Xeons we can do more than 100,000 context switches/second, and
> that is more than enough. But the point is: network IO performance does
> not depend on context switching speed too much. Also, in Linux we are
> using global pages which makes kernel-space TLBs persistent even across
> CR3 flushes.

This is putrid. NetWare does 353,00,000/second on a Xenon, pumping out
gobs of packets in between them. MANOS does 857,000,000/second. This
is terrible. No wonder it's so f_cking slow!!!

>
> > [...] There's also the use of segment registers all over the place to
> > copy from kernel to user and user to kernel space memory. [...]
>
> we do not use the fs segment register for user-space copies anymore,
> neither in 2.2, nor in 2.4. You must be reading old books and probably
> forgot to cross-check with the kernel source? :-)


ds: and es: are both used in copy-to-user and copy-from-user and they get
reloaded.


>
> > [...] Having the fast paths you mention does help a lot, but it's the
> > fact that this goes on at all that will make it tough to walk into a
> > NetWare shop with Linux and rip out NetWare servers and replace them
> > unless we look at a NetWare vs. NetLinux (that's what we call it! a
> > NetWare-like Linux platform).
>
> the worst thing you can do is to mis-identify performance problems and
> spend braincells on the wrong problem. The problems limiting Linux network
> scalability have been identified during the last 12 months by a small
> team, and solved in TUX. TUX is a fileserver, it shouldnt be alot of work
> to enable it for (TCP-only?) netware serving. It's *done*, Jeff, it's not
> a hypotetical thing, it's here, it works and it performs.
>

NetWare is here too, and it handles 5000+ file and print users, Linux does not.
Let's fix it. I know why NetWare is fast. Let's apply some of the same
principles and see what happens. Love to have you involved.

> Ingo

2000-10-30 09:32:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> [...] you've got the web server covered. What about file and print.

a web server, as you probably know, is a read-mostly fileserver that
serves files via the HTTP protocol. The rest is only protocol fluff.

> I think this is great, but most web servers are connected to a T1 or
> T3 line, and all the fancy optimization means absolutely squat since
> about 99.999999% of the planet has a max baandwidth of a T1, ADSL, or
> T3 Line, this is a far cry from Gigabit ethernet, or even 100Mbit
> ethernet.

Your argument is curious - first you cry for performance, then you say
'nobody wants that much bandwidth'. Of course, if the network is bandwidth
limited then we cannot scale above that bandwidth. But thats not the
point. The point is to put 10 cards into a server and still being able to
saturate them. The point is also to spend less cycles on saturating
available bandwidth. The point is also to not flush the L1 just because
someone requested a 10K webpage.

> How many users can you put on the web server? [...]

tens of thousands, on a single CPU. Can probably handle over 100 thousand
users as well, with IP aliasing so the socket space is spread out.

> Web servers are also read only data, the easiest of all LAN cases to
> deal with. It's incoming writes that present all the tough problems,

reads dominate writes in almost all workloads, thats common wisdom. Why
write if nobody reads the data? And while web servers are mostly read only
data, they can write data as well, see POST and PUT. The fact that
incoming writes are hard should not let you distract from the fact that
reads are also extremely important.

Ingo

2000-10-30 09:34:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> This is putrid. NetWare does 353,00,000/second on a Xenon, pumping out
> gobs of packets in between them. MANOS does 857,000,000/second. This
> is terrible. No wonder it's so f_cking slow!!!

(no need to get emotional.) And please check your numbers, 857 million
context switches per second means that on a 1 GHZ CPU you do one context
switch per 1.16 clock cycles. Wow!

Ingo

2000-10-30 09:37:29

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:41:35AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > [...] you've got the web server covered. What about file and print.
>
> a web server, as you probably know, is a read-mostly fileserver that
> serves files via the HTTP protocol. The rest is only protocol fluff.
>
> > I think this is great, but most web servers are connected to a T1 or
> > T3 line, and all the fancy optimization means absolutely squat since
> > about 99.999999% of the planet has a max baandwidth of a T1, ADSL, or
> > T3 Line, this is a far cry from Gigabit ethernet, or even 100Mbit
> > ethernet.
>
> Your argument is curious - first you cry for performance, then you say
> 'nobody wants that much bandwidth'. Of course, if the network is bandwidth
> limited then we cannot scale above that bandwidth. But thats not the
> point. The point is to put 10 cards into a server and still being able to
> saturate them. The point is also to spend less cycles on saturating
> available bandwidth. The point is also to not flush the L1 just because
> someone requested a 10K webpage.

It's not curious, it's not about bandwidth, it's about latency, and getting
packets in and out of the server as fast as possible, and ahead of everything
else. Cache affinity and all the Tux stuff is a great piece of work. Let's
talk about file and print, that's the present problem, not how to
pump out read-only data from a web server.


>
> > How many users can you put on the web server? [...]
>
> tens of thousands, on a single CPU. Can probably handle over 100 thousand
> users as well, with IP aliasing so the socket space is spread out.
>
> > Web servers are also read only data, the easiest of all LAN cases to
> > deal with. It's incoming writes that present all the tough problems,
>
> reads dominate writes in almost all workloads, thats common wisdom. Why
> write if nobody reads the data? And while web servers are mostly read only
> data, they can write data as well, see POST and PUT. The fact that
> incoming writes are hard should not let you distract from the fact that
> reads are also extremely important.
>
> Ingo


Web servers don't do writes, unless a CGI script is running somewhere
or some Java or Perl or something, then this stuff goes through a
wrapper, which is slow, or did I miss something.

Jeff

2000-10-30 09:41:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> ds: and es: are both used in copy-to-user and copy-from-user and they
> get reloaded.

And they all share the same segment descriptor. Whats your point? ES is
the default target segment for string operations. DS is the default data
segment. Have you ever profiled how many cycles it takes to do a "mov
__KERNEL_DS, %es" in entry.S, before making your (ridiculous) claim? I
have.

Ingo

2000-10-30 09:41:59

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:44:26AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > This is putrid. NetWare does 353,00,000/second on a Xenon, pumping out
> > gobs of packets in between them. MANOS does 857,000,000/second. This
> > is terrible. No wonder it's so f_cking slow!!!

And please check your numbers, 857 million
> context switches per second means that on a 1 GHZ CPU you do one context
> switch per 1.16 clock cycles. Wow!

Excuse me, 857,000,000 instructions executed and 460,000,000 context switches
a second -- on a PII system at 350 Mhz. It's due to AGI optimization.
Download MANOS and verify for yourself, it has a built in EMON in monitor.
After I complete the port, not even NetWare will be able to touch it.

Your Tux web server will also run on it, at significantly increased
performance.

Jeff

>
> Ingo

2000-10-30 09:44:29

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:50:24AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > ds: and es: are both used in copy-to-user and copy-from-user and they
> > get reloaded.
>
> And they all share the same segment descriptor. Whats your point? ES is
> the default target segment for string operations. DS is the default data
> segment. Have you ever profiled how many cycles it takes to do a "mov
> __KERNEL_DS, %es" in entry.S, before making your (ridiculous) claim? I
> have.
>

No. I used a hardware analyzer to show me how many LOCK# assertions it does
invisible to your software tools underneath. Try using EMON to profile,
it gives hardware numbers and let's you watch the cache controllers
issue non-cacheable memory references to fetch the descriptors.

Jeff


> Ingo

2000-10-30 09:46:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> > reads dominate writes in almost all workloads, thats common wisdom. Why
> > write if nobody reads the data? And while web servers are mostly read only
> > data, they can write data as well, see POST and PUT. The fact that
> > incoming writes are hard should not let you distract from the fact that
> > reads are also extremely important.
>
> Web servers don't do writes, unless a CGI script is running somewhere
> or some Java or Perl or something, then this stuff goes through a
> wrapper, which is slow, or did I miss something.

yes, you missed TUX modules.

Ingo

2000-10-30 09:49:39

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:56:06AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > > reads dominate writes in almost all workloads, thats common wisdom. Why
> > > write if nobody reads the data? And while web servers are mostly read only
> > > data, they can write data as well, see POST and PUT. The fact that
> > > incoming writes are hard should not let you distract from the fact that
> > > reads are also extremely important.
> >
> > Web servers don't do writes, unless a CGI script is running somewhere
> > or some Java or Perl or something, then this stuff goes through a
> > wrapper, which is slow, or did I miss something.
>
> yes, you missed TUX modules.
>


Great. I can load a TUX module and use it with my T1 line. If I could
spare an extra $100,000/month, perhaps I can lease an SDM-172 or TAT-8, or
even an OC-172 then I would be able to take advantage of it in the real
world.

Jeff


> Ingo

2000-10-30 09:51:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> > And please check your numbers, 857 million
> > context switches per second means that on a 1 GHZ CPU you do one context
> > switch per 1.16 clock cycles. Wow!
>
> Excuse me, 857,000,000 instructions executed and 460,000,000 context switches
> a second -- on a PII system at 350 Mhz. [...]

so it does 1.3 context switches per clock cycle? Wow! And i can type
100000000000000000000 characters a second, just measured it. Really!

> Your Tux web server will also run on it, at significantly increased
> performance.

as i told you in the previous mails, TUX does not depend on schedule()
performance. schedule() cost does not even show up in the top 20 entries
of the profiler.

Ingo

2000-10-30 09:55:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> It's not curious, it's not about bandwidth, it's about latency, and
> getting packets in and out of the server as fast as possible, and
> ahead of everything else. [...]

TUX prepares a HTTP reply in about 30 microseconds (plus network latency),
good enough? Network latency is the limit, even on gigabit - not to talk
about T1 lines.

Ingo


2000-10-30 09:58:20

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:01:08PM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > > And please check your numbers, 857 million
> > > context switches per second means that on a 1 GHZ CPU you do one context
> > > switch per 1.16 clock cycles. Wow!
> >
> > Excuse me, 857,000,000 instructions executed and 460,000,000 context switches
> > a second -- on a PII system at 350 Mhz. [...]
>
> so it does 1.3 context switches per clock cycle? Wow! And i can type
> 100000000000000000000 characters a second, just measured it. Really!

Go download it and try it, then come back with that smirk wiped off your face.
I'll enjoy it.....

:-)

>
> > Your Tux web server will also run on it, at significantly increased
> > performance.
>
> as i told you in the previous mails, TUX does not depend on schedule()
> performance. schedule() cost does not even show up in the top 20 entries
> of the profiler.


And as I told, you, your code has nothing to do with it, it's the fact it
goes on at all. Ingo, go get a copy of NetWare 3.12 (I'll even send you
one -- I've got extra licensed copies), install it, put a load of 5000
connections on it, with 4 adapters. Dual boot Linux on it, and attempt
the same with SAMBA or MARS-NWE, and watch it oink.

Go do it. What's your address to I can ship you the CD's for 3.12. Then come
back and tell me how TUX is going to solve the File and Print performance
issues in Linux.

:-)

Jeff


>
> Ingo

2000-10-30 10:00:00

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:04:43PM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > It's not curious, it's not about bandwidth, it's about latency, and
> > getting packets in and out of the server as fast as possible, and
> > ahead of everything else. [...]
>
> TUX prepares a HTTP reply in about 30 microseconds (plus network latency),
> good enough? Network latency is the limit, even on gigabit - not to talk
> about T1 lines.

Great. Now how do we get the smae numbers on SAMBA and MARS-NWE? THat's
the question, not whether your baby TUX is pretty. I already said it
was pretty, focus on the other issue.

Jeff

>
> Ingo
>

2000-10-30 10:03:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> > > Excuse me, 857,000,000 instructions executed and 460,000,000 context switches
> > > a second -- on a PII system at 350 Mhz. [...]

> Go download it and try it, then come back with that smirk wiped off
> your face. I'll enjoy it.....

so in 0.53 clock cycles you are implementing things like address space
separation, process priorities, fairness and other essential scheduling
features? Truly awesome ...

Ingo

2000-10-30 10:04:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> > TUX prepares a HTTP reply in about 30 microseconds (plus network latency),
> > good enough? Network latency is the limit, even on gigabit - not to talk
> > about T1 lines.
>
> Great. Now how do we get the smae numbers on SAMBA and MARS-NWE? [...]

simple, write a TUX protocol module for it. FTP protocol module is on its
way. Stay tuned.

Ingo

2000-10-30 10:10:11

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:12:44PM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > > > Excuse me, 857,000,000 instructions executed and 460,000,000 context switches
> > > > a second -- on a PII system at 350 Mhz. [...]
>
> > Go download it and try it, then come back with that smirk wiped off
> > your face. I'll enjoy it.....
>
> so in 0.53 clock cycles you are implementing things like address space
> separation, process priorities, fairness and other essential scheduling
> features? Truly awesome ...

Ingo, This original thread was regarding Linux vs. NetWare 5.x performance
metrics and responses from Linux folks about how to affect and
improve them, not a diatribe on the features of TUX.

I wrote the kernel in NetWare 4.x that later became 5.x. To date, my
NOS kernel has grossed over 8 billion dollars. This is more money
than all the Linux companies have made combined in their entire
history, on your work, Linus's work and everyone else's. I don't have
anything to prove to myself, or anyone else, which is why I do as I
please in life, and no one's comments or snipes dissuade me from moving
forward. I also am not "someone's employee".

If you have some ideas on how to improve file and print or help me
get a linux incarnation that can stomp NetWare, I'd love to hear
your ideas. I think TUX is great, BTW. Otherwise, end-of-line...

:-)

Jeff


>
> Ingo

2000-10-30 10:12:21

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:13:52PM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > > TUX prepares a HTTP reply in about 30 microseconds (plus network latency),
> > > good enough? Network latency is the limit, even on gigabit - not to talk
> > > about T1 lines.
> >
> > Great. Now how do we get the smae numbers on SAMBA and MARS-NWE? [...]
>
> simple, write a TUX protocol module for it. FTP protocol module is on its
> way. Stay tuned.
>
> Ingo


I do not believe this approach will allow Linux to match NetWare's
file and print performance, but I am willing to give it a whirl. Where
is the TUX module for MARS. Let's start with this one. What help
would you require. You understand these TUX modules, so you should
take the lead.

I am listening. Instruct me....


Jeff

2000-10-30 10:22:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> Ingo, This original thread was regarding Linux vs. NetWare 5.x
> performance metrics and responses from Linux folks about how to affect
> and improve them, not a diatribe on the features of TUX.

oh, i believe you misunderstand. TUX itself is quite simple. But it
extended the Linux TCP stack and scalability to new levels (which has
nothing to do with TUX itself, it's the scalability of the Linux
networking stack that evolved gradually over the past 10 years - we spend
95% of the time outside of TUX!). And if you claim that Linux needs this
and that for scalability, then i'd like to point you humbly towards those
existing, generic and well-performing results. TUX is just a 5% 'HTTP
protocol fluff' around the generic stuff.

Ingo

2000-10-30 10:57:14

by john slee

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 03:06:25AM -0700, Jeff V. Merkey wrote:
> Ingo, This original thread was regarding Linux vs. NetWare 5.x performance
> metrics and responses from Linux folks about how to affect and
> improve them, not a diatribe on the features of TUX.

while beating netware in certain areas is certainly a noble goal,
it is far from the only one.

[snip]

> If you have some ideas on how to improve file and print or help me
> get a linux incarnation that can stomp NetWare, I'd love to hear
> your ideas. I think TUX is great, BTW. Otherwise, end-of-line...

if netware users are happy with netware, i can't see why they'd want to
switch to linux or a linux-netware-ish equivalent while netware are
still around to provide support. "ain't broke, don't fix," etc.

j.

2000-10-30 12:47:35

by Alan

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> We will never beat NetWare on scaling if this is the case, even in 2.4.
> Andre and my first job will be to create an arch port with MANOS that
> disables this and restructures the VM.

In the 2.4 case if you are just running NFS daemons then there are no tlb
reloads going on at all. Whats murdering you is mostly memory copies. I would
suspect if you rerun the profiles on a box with a much lower memory bandwidth
that the effect will be even more visible

2000-10-30 12:51:27

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:47:10PM +0000, Alan Cox wrote:
> > We will never beat NetWare on scaling if this is the case, even in 2.4.
> > Andre and my first job will be to create an arch port with MANOS that
> > disables this and restructures the VM.
>
> In the 2.4 case if you are just running NFS daemons then there are no tlb
> reloads going on at all. Whats murdering you is mostly memory copies. I would
> suspect if you rerun the profiles on a box with a much lower memory bandwidth
> that the effect will be even more visible

I don't think that's true. As far as I can see the nfsd processes do not do the
lazy VM magic (but it would be reasonably easy to add)


-Andi

2000-10-30 12:57:37

by Alan

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> one -- I've got extra licensed copies), install it, put a load of 5000
> connections on it, with 4 adapters. Dual boot Linux on it, and attempt
> the same with SAMBA or MARS-NWE, and watch it oink.

SAMBA and Mars-nwe are running user space thats why. They have flexibility,
protection and can run unpriviledged. If you want to put mars-nwe in the kernel
then it will certainly be interesting

> back and tell me how TUX is going to solve the File and Print performance
> issues in Linux.

There's an http file system around 8)

2000-10-30 17:42:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:13:52PM +0100, Ingo Molnar wrote:
> simple, write a TUX protocol module for it. FTP protocol module is on its
> way. Stay tuned.

TUX modules are kernel modules (I mean you have to write kernel space code for
doing TUX ftp). Don't you agree that zero-copy sendfile like ftp serving would
be able to perform equally well too? I mean: isn't better to spend the efforts
to make an userspace API to run fast instead of moving every network
functionality that needs high performance completly in kernel? People may need
to write high performance network code for custom protocols, this way they will
end creating kernel modules with system-crashing bugs, memory leaks and kernel
buffer overflows (chroot+nobody+logging won't work anymore). (plus they will
get into pain while debugging)

It's obvious kernel code runs faster and you can also do assumptions about
scheduler and current CPU that you can't do in userspace, but is that so
relevant in term of ftp server numbers compared to only skipping the memory
copies?

About the TUX cgi module I had a fast look and I noticed cgis run by tux
executes two clones(2) and one exec(2) and then they have to pay the startup of
the interpreters for each cgi request. So at this stage of tux I guess that for
perl/php pages tux would better redirected them to an apache. Maybe php and
perl tux (kernel) modules are in your todo list? (I think they would be a bad
idea though) One other way to handle efficiently the interpreted-cgi load
without redirecting the cgi request to a full apache is to have a background
php/perl interpreter listening to new cgis in input and filling the tux pipe in
output.

Andrea

2000-10-30 17:59:10

by Chris Evans

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Andrea Arcangeli wrote:

> functionality that needs high performance completly in kernel? People
> may need to write high performance network code for custom protocols,
> this way they will end creating kernel modules with system-crashing
> bugs, memory leaks and kernel buffer overflows (chroot+nobody+logging
> won't work anymore). (plus they will get into pain while debugging)

I'm glad _someone_ is connected to reality with regards the security
implications of throwing loads of servers into kernel space.

Cheers
Chris

2000-10-30 18:01:24

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


Alan,

I've been studying Linux for the past two years the same way a diamond
cutter studies a
prized and immensely valuable raw stone. I think I am nearing the
point I know where to
strike and I can cleave it into something Novell's installed base would
like and could move
forward with. The easiest path I see for me to create a Linux NetWare
hybrid is for you
guys to get a 2.2.X (not 2.4.X -- not just yet) tree that runs in a flat
memory space,
then I can slip in our optimizations and we will be able to run all
existing code in kernel
that's trusted. The changes needed to do this involve more than just
the /arch area and
the asm macro includes, though this will cover about 80% of it, but the
loader would need to
be able to create address domains around aps that choose to run
protected.

If I attempt these changes without proper guidance, two things will
happen. 1). people who
own some of this code will not be aware of all the aspects of the
changes, and may not
longer be able to support it, and 2). it will fork away from the tree
and become
unsupportable and diverge over time. This is the reason I have resisted
making any
kernel changes myself other than code I write for Linux I own, and I
have always
solicitied other folks to make the patches. This is about to change
relative to this
project.

The first step I need is to be able to load Linux as a ring 0 OS in
linear address space
with all protection disabled. Paging can still go on, just no separate
CR3 reloads between
context switches. profiling Ring 0 Linux vs. NetWare will give me an
excellent idea of where
the optimizations will need to be inserted. A straight MARS-NWE port to
kernel would just
happen, since we would be able to just load in kernel space and run it
with no code
changes.

:-)

Jeff

Alan Cox wrote:
>
> > one -- I've got extra licensed copies), install it, put a load of 5000
> > connections on it, with 4 adapters. Dual boot Linux on it, and attempt
> > the same with SAMBA or MARS-NWE, and watch it oink.
>
> SAMBA and Mars-nwe are running user space thats why. They have flexibility,
> protection and can run unpriviledged. If you want to put mars-nwe in the kernel
> then it will certainly be interesting
>
> > back and tell me how TUX is going to solve the File and Print performance
> > issues in Linux.
>
> There's an http file system around 8)
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-30 18:03:24

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Andrea Arcangeli wrote:
>
> On Mon, Oct 30, 2000 at 12:13:52PM +0100, Ingo Molnar wrote:
> > simple, write a TUX protocol module for it. FTP protocol module is on its
> > way. Stay tuned.
>
> TUX modules are kernel modules (I mean you have to write kernel space code for
> doing TUX ftp). Don't you agree that zero-copy sendfile like ftp serving would
> be able to perform equally well too? I mean: isn't better to spend the efforts
> to make an userspace API to run fast instead of moving every network
> functionality that needs high performance completly in kernel? People may need
> to write high performance network code for custom protocols, this way they will
> end creating kernel modules with system-crashing bugs, memory leaks and kernel
> buffer overflows (chroot+nobody+logging won't work anymore). (plus they will
> get into pain while debugging)

Ingo's helping me get the info together on this for putting a MARS-NWE
tux module in
the kernel. He had to go do some things this week he told me before he
would be ready
to look at it. He did point me over to the info, and I agreed we would
attempt to
implement it as something to look at. If it performs well enough, I
will have
something reasonable to send out to Novell Resellers (CNEs) and
Cutomers.

Jeff

2000-10-30 18:05:44

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Chris Evans wrote:
>
> On Mon, 30 Oct 2000, Andrea Arcangeli wrote:
>
> > functionality that needs high performance completly in kernel? People
> > may need to write high performance network code for custom protocols,
> > this way they will end creating kernel modules with system-crashing
> > bugs, memory leaks and kernel buffer overflows (chroot+nobody+logging
> > won't work anymore). (plus they will get into pain while debugging)
>
> I'm glad _someone_ is connected to reality with regards the security
> implications of throwing loads of servers into kernel space.
>

If we implement a ring 0 Linux, all of this will remain intact with the
need to
port modules into the kernel at all.

Jeff

> Cheers
> Chris

2000-10-30 18:08:44

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



john slee wrote:
>
> On Mon, Oct 30, 2000 at 03:06:25AM -0700, Jeff V. Merkey wrote:
> > Ingo, This original thread was regarding Linux vs. NetWare 5.x performance
> > metrics and responses from Linux folks about how to affect and
> > improve them, not a diatribe on the features of TUX.
>
> while beating netware in certain areas is certainly a noble goal,
> it is far from the only one.

You're up on that higher plane again. Send me some 'shrooms, please.

>
> [snip]
>
> > If you have some ideas on how to improve file and print or help me
> > get a linux incarnation that can stomp NetWare, I'd love to hear
> > your ideas. I think TUX is great, BTW. Otherwise, end-of-line...
>
> if netware users are happy with netware, i can't see why they'd want to
> switch to linux or a linux-netware-ish equivalent while netware are
> still around to provide support. "ain't broke, don't fix," etc.

Linux has great internet connectivity, and a real application layer with
lots of
apps. NetWare customers would love it. I am one and I love it already.

Jeff

>
> j.

2000-10-30 18:22:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:01:25AM -0700, Jeff V. Merkey wrote:
> If we implement a ring 0 Linux, all of this will remain intact with the
> need to
> port modules into the kernel at all.

I don't understand what you mean sorry, could you elaborate?

Andrea

2000-10-30 18:35:47

by Alan

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> context switches. profiling Ring 0 Linux vs. NetWare will give me an
> excellent idea of where
> the optimizations will need to be inserted. A straight MARS-NWE port to
> kernel would just
> happen, since we would be able to just load in kernel space and run it
> with no code
> changes.

There are one bunch of people running Linux on a flat memory space with no
protection although their goal was to make Linux run on mmuless embedded
hardware.

See http://www.uclinux.org; the uclinux guys started a 2.4 port recently. Basically
the idea is to have a mm-nommu/ directory which implements a mostly compatible
replacement for the mm layer (obviously stuff like mmap dont work without an
mmu and fork is odd), and a set of binary loaders to load flat binaries with
relocations.

That I think is the project that overlaps ..

2000-10-30 19:12:43

by Dan Hollis

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Andrea Arcangeli wrote:
> TUX modules are kernel modules (I mean you have to write kernel space code for
> doing TUX ftp). Don't you agree that zero-copy sendfile like ftp serving would
> be able to perform equally well too?

For this to bw useful for ftp we need a sendfile() that can write from a
socket to a diskfile also.

-Dan

2000-10-30 21:22:37

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


Thanks,

It will make merging the MANOS kernel happen faster. My DLL prototypes
are using subsets
of Linux 2.2.16 for MANOS at present, and what I really need is for the
support issues to dovetail into a supported effort. This one might fit
the bill. I have no desire for TRG to support the 100's of LAN and disk
drivers all by our little lonesome in a divergent code base.

Jeff

Alan Cox wrote:
>
> > context switches. profiling Ring 0 Linux vs. NetWare will give me an
> > excellent idea of where
> > the optimizations will need to be inserted. A straight MARS-NWE port to
> > kernel would just
> > happen, since we would be able to just load in kernel space and run it
> > with no code
> > changes.
>
> There are one bunch of people running Linux on a flat memory space with no
> protection although their goal was to make Linux run on mmuless embedded
> hardware.
>
> See http://www.uclinux.org; the uclinux guys started a 2.4 port recently. Basically
> the idea is to have a mm-nommu/ directory which implements a mostly compatible
> replacement for the mm layer (obviously stuff like mmap dont work without an
> mmu and fork is odd), and a set of binary loaders to load flat binaries with
> relocations.
>
> That I think is the project that overlaps ..
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-30 23:27:13

by David Woodhouse

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Ingo Molnar wrote:

> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > Is there an option to map Linux into a flat address space [...]
>
> nope, Linux is fundamentally multitasked.

uClinux may be able to do this, at the cost of a dramatically reduced
userspace functionality.

--
dwmw2


2000-10-30 23:53:48

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


David/Alan,

Andre Hedrick is now the CTO of TRG and Chief Scientist over Linux
Development. After talking
to him, we are going to do our own ring 0 2.4 and 2.2.x code bases for
the MANOS merge.
the uClinux is interesting, but I agree is limited.

MANOS schedules should be unaffected. The current DLL prototype of
Linux 2.2 is ring 0, but I shudder at trying to merge all the changes
I've done to it into core 2.2.X as a .config
option. There's also the gravity well forces of different views to this
effort. With Andre
on the job, I am more confident in co-opting the Linux drivers and just
biting the bullet
on the support issues, and doing a full fork of Linux.

Jeff

David Woodhouse wrote:
>
> On Mon, 30 Oct 2000, Ingo Molnar wrote:
>
> > On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
> >
> > > Is there an option to map Linux into a flat address space [...]
> >
> > nope, Linux is fundamentally multitasked.
>
> uClinux may be able to do this, at the cost of a dramatically reduced
> userspace functionality.
>
> --
> dwmw2

2000-10-31 06:59:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> Ingo's helping me get the info together on this for putting a MARS-NWE
> tux module in the kernel. [...]

TUX modules are user-space, so i certainly cannot help you in 'putting
MARS-NWE in the kernel'. While you (apparently) are trying to move server
applications into ring0, i agree with Andrea and i'm trying to move kernel
functionality out to user-space.

> He had to go do some things this week he told me before he would be
> ready to look at it. He did point me over to the info, and I agreed we
> would attempt to implement it as something to look at. If it performs
> well enough, I will have something reasonable to send out to Novell
> Resellers (CNEs) and Cutomers.

All i did was to inform you that the next release of TUX is imminent and
that you might want to take a look at the new code. You interpreted that
in a very interesting way. You are certainly free and welcome to take a
look at any code and documentation released, but as visible in the past
couple of email exchanges, our technical views about Linux networking
scalability differ in fundamental ways.

Ingo

2000-10-31 09:21:25

by Erik Andersen

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon Oct 30, 2000 at 06:34:55PM +0000, Alan Cox wrote:
>
> See http://www.uclinux.org; the uclinux guys started a 2.4 port recently. Basically
> the idea is to have a mm-nommu/ directory which implements a mostly compatible
> replacement for the mm layer (obviously stuff like mmap dont work without an
> mmu and fork is odd), and a set of binary loaders to load flat binaries with
> relocations.

mmap works -- you just can't do MAP_PRIVATE. Everything has to be
MAP_SHARED. There is no fork (though vfork works). brk doesn't work,
and things like iopl, ioperm are pointless. etc...

Oh, and when you do malloc something (malloc is mmap based) you
really want to remember to free it, since the system can't clean
up after you,

-Erik

--
Erik B. Andersen email: [email protected]
--This message was written using 73% post-consumer electrons--

2000-10-31 20:04:21

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

Hi!

> > > This is putrid. NetWare does 353,00,000/second on a Xenon, pumping out
> > > gobs of packets in between them. MANOS does 857,000,000/second. This
> > > is terrible. No wonder it's so f_cking slow!!!
>
> And please check your numbers, 857 million
> > context switches per second means that on a 1 GHZ CPU you do one context
> > switch per 1.16 clock cycles. Wow!
>
> Excuse me, 857,000,000 instructions executed and 460,000,000 context
> switches
> a second -- on a PII system at 350 Mhz. It's due to AGI
> optimization.

That's more than one context switch per clock. I do not think
so. Really go and check those numbers.
Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]

2000-10-31 20:05:31

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

Hi!

> > TUX modules are kernel modules (I mean you have to write kernel space code for
> > doing TUX ftp). Don't you agree that zero-copy sendfile like ftp serving would
> > be able to perform equally well too?
>
> For this to bw useful for ftp we need a sendfile() that can write from a
> socket to a diskfile also.

I had patch to fix sendfile this way... Sendfile is really ugly, as of
now. (It basically falled back to read/write, giving only small
performance advantage, but it made things cleaner).

Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]

2000-10-31 20:08:21

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
>
> All i did was to inform you that the next release of TUX is imminent and
> that you might want to take a look at the new code. You interpreted that
> in a very interesting way.

I seem to remember a "don't post this email on the list here's what's
up" message
from you I that I honored your instructions and did not post, nor will
I, but I recall
some reasonable suggestions from it. :-)

You are certainly free and welcome to take a
> look at any code and documentation released, but as visible in the past
> couple of email exchanges, our technical views about Linux networking
> scalability differ in fundamental ways.

I've been working on LAN scalability for almost 20 years, and NetWare
LAN performance vs. Linux (or TUX) speaks for itself. I would agree
that someone
coming from a Unix background would have radically different views than
someone
coming from the group who created the Networking industry that made
"LAN" a household
word to the world at large.

It's the differences of all those who work on Linux that make it such a
potent
mix of skills and views. I think this is a very good thing, and I take
your comment
as a compliment.

:-)

Jeff

>





> Ingo
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-31 20:11:21

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Pavel Machek wrote:
>
> Hi!
>
> > > > This is putrid. NetWare does 353,00,000/second on a Xenon, pumping out
> > > > gobs of packets in between them. MANOS does 857,000,000/second. This
> > > > is terrible. No wonder it's so f_cking slow!!!
> >
> > And please check your numbers, 857 million
> > > context switches per second means that on a 1 GHZ CPU you do one context
> > > switch per 1.16 clock cycles. Wow!
> >
> > Excuse me, 857,000,000 instructions executed and 460,000,000 context
> > switches
> > a second -- on a PII system at 350 Mhz. It's due to AGI
> > optimization.
>
> That's more than one context switch per clock. I do not think
> so. Really go and check those numbers.


Pavel, The optimization exploits the multiple piplines in Intel's
processors,
and yes, it does execute more than one instruction per clock, it's
optimized
to execute in the processors parallel pipelines. The EMON numbers are
accurate,
and you can download the kernel and verify for yourself. These types of
optimizations
are possible when people have acccess to Intel Red Cover documents, then
you
get to know just how Intel's internal architectures are affected by
different coding optimizations.

Jeff
> Pavel
> --
> I'm [email protected]. "In my country we have almost anarchy and I don't care."
> Panos Katsaloulis describing me w.r.t. patents at [email protected]

2000-10-31 20:17:52

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



"Jeff V. Merkey" wrote:
>
> Pavel Machek wrote:
> >
> > Hi!
> >
> > > > > This is putrid. NetWare does 353,00,000/second on a Xenon, pumping out
> > > > > gobs of packets in between them. MANOS does 857,000,000/second. This
> > > > > is terrible. No wonder it's so f_cking slow!!!
> > >
> > > And please check your numbers, 857 million
> > > > context switches per second means that on a 1 GHZ CPU you do one context
> > > > switch per 1.16 clock cycles. Wow!
> > >
> > > Excuse me, 857,000,000 instructions executed and 460,000,000 context
> > > switches
> > > a second -- on a PII system at 350 Mhz. It's due to AGI
> > > optimization.
> >
> > That's more than one context switch per clock. I do not think
> > so. Really go and check those numbers.
>
> Pavel, The optimization exploits the multiple piplines in Intel's
> processors,
> and yes, it does execute more than one instruction per clock, it's
> optimized
> to execute in the processors parallel pipelines. The EMON numbers are
> accurate,
> and you can download the kernel and verify for yourself. These types of
> optimizations
> are possible when people have acccess to Intel Red Cover documents, then
> you
> get to know just how Intel's internal architectures are affected by
> different coding optimizations.
>
> Jeff

There's also another optimization in this kernel that allows it to
achieve greater than
100% scaling per processor by using a strong affinity algorithm (I hold
the patent
on this algorithimn, and by posting code based on it under the GPL, I
have released
it to the general public).

It relies on an anomoly in the design of Intel's cache controllers, and
with memory based
applications, I can get 120% scaling per procesoor by jugling the
working set of
executable code cached accros each processor. There's sample code with
this kernel
you can use to verify....

:-)

Jeff


> > Pavel
> > --
> > I'm [email protected]. "In my country we have almost anarchy and I don't care."
> > Panos Katsaloulis describing me w.r.t. patents at [email protected]

2000-10-31 20:21:43

by Reto Baettig

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

When I'm following this thread, you guys seem to forget the _basics_:
The Linux networking stack sucks!

Everybody tries to work around the networking stack. We just recently
developped a rpc protocol which makes 180MBytes/second (over a Quadrics
Network) because the linux network layer was way too slow. At speeds
above 100MBytes/s, copies start to hurt.

Why not solve the problem at the source and completely redesign the
network stack? Get rid of the old sk_buff & co! Rip the whole network
layer out! Redesign it and give the user a possibility of Zero-Copy
networking!

Reto

2000-10-31 20:22:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Tue, 31 Oct 2000, Jeff V. Merkey wrote:

> It relies on an anomoly in the design of Intel's cache controllers,
> and with memory based applications, I can get 120% scaling per
> procesoor by jugling the working set of executable code cached accros
> each processor. There's sample code with this kernel you can use to
> verify....

FYI, this is a very old concept and a scalability FAQ item. It's called
"sublinear scaling", and SGI folks have already published articles about
it 10 years ago.

Ingo

2000-10-31 20:25:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Tue, 31 Oct 2000, Pavel Machek wrote:

> > Excuse me, 857,000,000 instructions executed and 460,000,000
> > context switches a second -- on a PII system at 350 Mhz. [...]

> That's more than one context switch per clock. I do not think so.
> Really go and check those numbers.

yep, you cannot have 460 million context switches on that system,
unless you have some Clintonesque definition for 'context switch' ;-)

Ingo

2000-10-31 20:26:43

by Alan

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> Why not solve the problem at the source and completely redesign the
> network stack? Get rid of the old sk_buff & co! Rip the whole network
> layer out! Redesign it and give the user a possibility of Zero-Copy
> networking!

For one because you don't need to do that to get zero copy networking for
most real world cases. Tux already implements the neccessary infrastructure
for these.

2000-10-31 20:33:55

by Reto Baettig

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

Alan Cox wrote:
>
> > Why not solve the problem at the source and completely redesign the
> > network stack? Get rid of the old sk_buff & co! Rip the whole network
> > layer out! Redesign it and give the user a possibility of Zero-Copy
> > networking!
>
> For one because you don't need to do that to get zero copy networking for
> most real world cases. Tux already implements the neccessary infrastructure
> for these.

And what if I'd like to use the network for something different than
html?

I'm sorry but I don't know TUX. Does it implement its own TCP stack?

Reto

2000-10-31 20:37:06

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, 31 Oct 2000, Reto Baettig wrote:

> When I'm following this thread, you guys seem to forget the
> _basics_: The Linux networking stack sucks!

Ummm, last I looked Linux held the Specweb99 record;
by a wide margin...

Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/ http://www.surriel.com/

2000-10-31 20:37:06

by Alan

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> And what if I'd like to use the network for something different than
> html?

Read the tux source. Then come back and ask sensible questions


2000-10-31 20:46:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


> "sublinear scaling",
^-- extralinear. whatever.

2000-10-31 20:49:49

by Jesse Pollard

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

--------- Received message begins Here ---------

>
> > And what if I'd like to use the network for something different than
> > html?
>
> Read the tux source. Then come back and ask sensible questions

Also pay attention to the security aspects of a true "zero copy" TCP stack.
It means that SOMETIMES a user buffer will recieve data that is destined
for a different process.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2000-10-31 20:51:09

by Reto Baettig

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

Rik van Riel wrote:
> Ummm, last I looked Linux held the Specweb99 record;
> by a wide margin...

...does that remove any memory copies???

To be best does not mean that there's no place for improvment.

Can anybody please help me and tell me where to start understanding what
tux does?

http://www.tux.org does not seem to be the right place :-(

I don't want to make linux bad or stand on anybodys toes. I'd just like
to improve linux like everybody else and I do not know every single
peace of source code floating around the world by heart ;-)

Reto

2000-10-31 20:58:29

by Alan

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> Also pay attention to the security aspects of a true "zero copy" TCP stack.
> It means that SOMETIMES a user buffer will recieve data that is destined
> for a different process.

The moment you try and do zero copy like that you end up playing so many MMU
games the copy is faster. We do zero copy from the kernel side of the universe
thats a lot lot saner

2000-10-31 21:05:59

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, 31 Oct 2000, Reto Baettig wrote:
> Rik van Riel wrote:
> > Ummm, last I looked Linux held the Specweb99 record;
> > by a wide margin...
>
> ...does that remove any memory copies???

> I don't want to make linux bad or stand on anybodys toes.

Good to know, your previous message might have fooled some
people when it comes to these intentions ;)

--------------
On Tue, 31 Oct 2000, Reto Baettig wrote:

> When I'm following this thread, you guys seem to forget the
> _basics_: The Linux networking stack sucks!
--------------

cheers,

Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/ http://www.surriel.com/


Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/ http://www.surriel.com/

2000-10-31 21:36:59

by Paul Menage

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, 31 Oct 2000, Rik van Riel wrote:
>
>Ummm, last I looked Linux held the Specweb99 record;
>by a wide margin...
>

... but since then IBM/Zeus appear to have taken the lead:

http://www.zeus.com/news/articles/001004-001/
http://www.spec.org/osg/web99/results/res2000q3/

But they were using a somewhat beefier machine - has anyone got Tux
SpecWeb99 figures for a 12 CPU, 64 GB, 12 NIC system?

Paul

2000-10-31 21:38:09

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Rik van Riel wrote:
>
> On Tue, 31 Oct 2000, Reto Baettig wrote:
>
> > When I'm following this thread, you guys seem to forget the
> > _basics_: The Linux networking stack sucks!
>
> Ummm, last I looked Linux held the Specweb99 record;
> by a wide margin...
>

It doesn't hold the file and print scaling record. NetWare does..

Jeff

> Rik
> --
> "What you're running that piece of shit Gnome?!?!"
> -- Miguel de Icaza, UKUUG 2000
>
> http://www.conectiva.com/ http://www.surriel.com/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-31 21:47:23

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Alan Cox wrote:
>
> > Why not solve the problem at the source and completely redesign the
> > network stack? Get rid of the old sk_buff & co! Rip the whole network
> > layer out! Redesign it and give the user a possibility of Zero-Copy
> > networking!
>
> For one because you don't need to do that to get zero copy networking for
> most real world cases. Tux already implements the neccessary infrastructure
> for these.

The code in the networking layer is fine, in fact, it's absolutely
great. This is not
the problem. The problem are all the clocks wasted reloading TLBs and
the background
memory activitiy caused by Linux's heavy dependence in Intel's
protection hardware model.

Step 1 is to load the entire OS at ring 0 with protection completely
disabled system wide,
put in the kernel optimizations for AGI and context switch locking, and
stub the top half of Linus's scheduler. The global memory structures in
his kernel may or may not hurt performance, depending on how they are
accessed by multiple processors. I will need to start profiling with a
ring 0 port first and determine frequency of access that's hardware
measured.

The next step would be to start peeling off different subsystems and
re-parallelizing them
on the merged kernel. There's an awful lot of areas in Linus' that are
going to be a
problem, but I'll work through them one at a time. When I first
parallelized NetWare, I wrote
an independent SMP kernel then loaded the entire NetWare 4.1 image as a
single process. The next step was to start peeling off layers from
NetWare and plugging them in one by one and parallelizing them. I am
using the same approach here. When I was finished, I had peeled NetWare
like a banana and completely reintegrated it on a new kernel. This
approach works because it allows you to stage each layer.

The next step will be to modify the loader to allow the existing
protection scheme to exist in true user space. WRD's will hold off CR3
switching so long as I/O is coming in or out of the box. I anticipate
this will take until March of next year to get right.

Jeff






packets via a WTD scheduler,

2000-10-31 21:49:03

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Paul Menage wrote:
>
> On Tue, 31 Oct 2000, Rik van Riel wrote:
> >
> >Ummm, last I looked Linux held the Specweb99 record;
> >by a wide margin...
> >
>
> ... but since then IBM/Zeus appear to have taken the lead:
>
> http://www.zeus.com/news/articles/001004-001/
> http://www.spec.org/osg/web99/results/res2000q3/
>
> But they were using a somewhat beefier machine - has anyone got Tux
> SpecWeb99 figures for a 12 CPU, 64 GB, 12 NIC system?

NetWare holds the file and print scaling record, which is what this
thread is about,
not web servers copying read only data from cache...

Jeff

>
> Paul
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-31 21:49:13

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
> Rik van Riel wrote:
> > On Tue, 31 Oct 2000, Reto Baettig wrote:
> >
> > > When I'm following this thread, you guys seem to forget the
> > > _basics_: The Linux networking stack sucks!
> >
> > Ummm, last I looked Linux held the Specweb99 record;
> > by a wide margin...
>
> It doesn't hold the file and print scaling record. NetWare
> does..

Indeed, we haven't made a file serving plugin for
the TUX zero-copy stuff yet...

Oh, and I haven't found a bunch of printers yet that are
fast enough to beat the printserving record ;))
*runs like hell*

cheers,

Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/ http://www.surriel.com/

2000-10-31 21:54:53

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



"Jeff V. Merkey" wrote:
>
> Alan Cox wrote:
> >
> > > Why not solve the problem at the source and completely redesign the
> > > network stack? Get rid of the old sk_buff & co! Rip the whole network
> > > layer out! Redesign it and give the user a possibility of Zero-Copy
> > > networking!
> >
> > For one because you don't need to do that to get zero copy networking for
> > most real world cases. Tux already implements the neccessary infrastructure
> > for these.
>
> The code in the networking layer is fine, in fact, it's absolutely
> great. This is not
> the problem. The problem are all the clocks wasted reloading TLBs and
> the background
> memory activitiy caused by Linux's heavy dependence in Intel's
> protection hardware model.
>
> Step 1 is to load the entire OS at ring 0 with protection completely
> disabled system wide,
> put in the kernel optimizations for AGI and context switch locking, and
> stub the top half of Linus's scheduler. The global memory structures in
> his kernel may or may not hurt performance, depending on how they are
> accessed by multiple processors. I will need to start profiling with a
> ring 0 port first and determine frequency of access that's hardware
> measured.
>
> The next step would be to start peeling off different subsystems and
> re-parallelizing them
> on the merged kernel. There's an awful lot of areas in Linus' that are
> going to be a
> problem, but I'll work through them one at a time. When I first
> parallelized NetWare, I wrote
> an independent SMP kernel then loaded the entire NetWare 4.1 image as a
> single process. The next step was to start peeling off layers from
> NetWare and plugging them in one by one and parallelizing them. I am
> using the same approach here. When I was finished, I had peeled NetWare
> like a banana and completely reintegrated it on a new kernel. This
> approach works because it allows you to stage each layer.
>
> The next step will be to modify the loader to allow the existing
> protection scheme to exist in true user space. WRD's will hold off CR3
> switching so long as I/O is coming in or out of the box. I anticipate
> this will take until March of next year to get right.
>
> Jeff
>


One other point. You guys can try as hard as you can to hack around
this problem, but it will never scale like NetWare unless a
fundamentally different approach is adopted. I've built this before,
and what we are doing is building in all the Linux functionality with
the identical code -- basically a Linux NetWare -- into a NetWare
framework. All the protection can still be there, it just needs to be
restructured with some LAN based design assumptions. I think the
industry will be pleased with the result. I do not believe any of the
proposed hacks will get there unless this different approach is
investigated.

Jeff




> packets via a WTD scheduler,

2000-10-31 21:56:43

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Ingo Molnar wrote:
>
> On Tue, 31 Oct 2000, Pavel Machek wrote:
>
> > > Excuse me, 857,000,000 instructions executed and 460,000,000
> > > context switches a second -- on a PII system at 350 Mhz. [...]
>
> > That's more than one context switch per clock. I do not think so.
> > Really go and check those numbers.
>
> yep, you cannot have 460 million context switches on that system,
> unless you have some Clintonesque definition for 'context switch' ;-)

The numbers don't lie. You know where the code is. You notice that
there is a version of
the kernel hand coded in assembly language. You'l also noticed that
it's SMP and takes ZERO LOCKS during context switching, in fact, most of
the design is completely lockless.

Jeff

>
> Ingo
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-31 21:57:33

by Reto Baettig

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

Rik

Is there any documentation about the Tux zero-copy implementation so
that I don't have to read half of the 2.4 kernel sources before having a
clue?

Are the kernel changes going to be in the mainstream kernel?

Does Tux implement a new interface so that a userspace app can do
zero-copy stuff with the network?

Sorry for the newbie questions, but it's really hard to find information
about Tux (other than the holy source code, of course ;-)

TIA

Reto

Rik van Riel wrote:
>
> On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
> > Rik van Riel wrote:
> > > On Tue, 31 Oct 2000, Reto Baettig wrote:
> > >
> > > > When I'm following this thread, you guys seem to forget the
> > > > _basics_: The Linux networking stack sucks!
> > >
> > > Ummm, last I looked Linux held the Specweb99 record;
> > > by a wide margin...
> >
> > It doesn't hold the file and print scaling record. NetWare
> > does..
>
> Indeed, we haven't made a file serving plugin for
> the TUX zero-copy stuff yet...
>
> Oh, and I haven't found a bunch of printers yet that are
> fast enough to beat the printserving record ;))
> *runs like hell*
>
> cheers,
>
> Rik
> --
> "What you're running that piece of shit Gnome?!?!"
> -- Miguel de Icaza, UKUUG 2000
>
> http://www.conectiva.com/ http://www.surriel.com/

2000-10-31 21:57:43

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Rik van Riel wrote:
>
> On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
> > Rik van Riel wrote:
> > > On Tue, 31 Oct 2000, Reto Baettig wrote:
> > >
> > > > When I'm following this thread, you guys seem to forget the
> > > > _basics_: The Linux networking stack sucks!
> > >
> > > Ummm, last I looked Linux held the Specweb99 record;
> > > by a wide margin...
> >
> > It doesn't hold the file and print scaling record. NetWare
> > does..
>
> Indeed, we haven't made a file serving plugin for
> the TUX zero-copy stuff yet...
>
> Oh, and I haven't found a bunch of printers yet that are
> fast enough to beat the printserving record ;))

You got me on this one. I would also agree that NPDS is the worst piece
of crap ever written.

:-)

Jeff


> *runs like hell*
>
> cheers,
>
> Rik
> --
> "What you're running that piece of shit Gnome?!?!"
> -- Miguel de Icaza, UKUUG 2000
>
> http://www.conectiva.com/ http://www.surriel.com/

2000-10-31 21:58:53

by David Miller

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

Date: Tue, 31 Oct 2000 14:44:51 -0700
From: "Jeff V. Merkey" <[email protected]>

not web servers copying read only data from cache...

Actually, a sizable portion of SpecWEB99 is dynamic content, so it's
not all read-only.

Later,
David S. Miller
[email protected]

2000-10-31 21:59:44

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, 31 Oct 2000, Reto Baettig wrote:

> Is there any documentation about the Tux zero-copy
> implementation so that I don't have to read half of the 2.4
> kernel sources before having a clue?

Reading the 2.4 sources won't do you much good since
the Tux layer isn't integrated ;)

> Are the kernel changes going to be in the mainstream kernel?

Dunno, ask Ingo and/or Linus...

> Sorry for the newbie questions, but it's really hard to find
> information about Tux (other than the holy source code, of
> course ;-)

Maybe there's some documentation included with the Tux
source code or maybe there's something on the Tux web
page? Have you tried looking there?

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/ http://www.surriel.com/

2000-10-31 22:02:04

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Ingo Molnar wrote:
>
> On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
>
> > It relies on an anomoly in the design of Intel's cache controllers,
> > and with memory based applications, I can get 120% scaling per
> > procesoor by jugling the working set of executable code cached accros
> > each processor. There's sample code with this kernel you can use to
> > verify....
>
> FYI, this is a very old concept and a scalability FAQ item. It's called
> "sublinear scaling", and SGI folks have already published articles about
> it 10 years ago.

Ingo,

You don't even know what it is enough to comment intelligently. You can
write the patent office and obtain a copy. The patent is currently in
dispute between Novell and several other companies over S&E ownership,
and there's a court hearing scheduled to resolve it (lukily I don't have
to deal with this one). Nice thing about being an inventor, though, is
I have rights to it, no matter who ends up with the S&E assignment.
(dogs fights over a bone ...).

Jeff

>
> Ingo

2000-10-31 22:02:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Tue, 31 Oct 2000, Jeff V. Merkey wrote:

> > > > Excuse me, 857,000,000 instructions executed and 460,000,000
> > > > context switches a second -- on a PII system at 350 Mhz. [...]
> >
> > > That's more than one context switch per clock. I do not think so.
> > > Really go and check those numbers.
> >
> > yep, you cannot have 460 million context switches on that system,
> > unless you have some Clintonesque definition for 'context switch' ;-)
>
> The numbers don't lie. [...]

sure ;) I can do infinite context switches! You dont believe? See:

#define schedule() do { } while (0)

[there is a small restriction, should only be used in single-task
systems.]

Ingo

2000-10-31 22:05:54

by Larry McVoy

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

{lots of perf stuff deleted}

I'm posting this to point out that Linux networking is getting better at
a substantial pace.

I've already sent this to Davem and Linus a while back, but I have a
pretty nice lab here at BitMover, 4 100Mbit switched networks, servers
with 4 cards, and enough clients to generate load. I actually have
two servers both of which have a NIC on each network; one server has
.2.15pre9 on it and the other has 2.4.0-test5 on it.

I don't have a lot of spare time, but if you are one of the kernel
developers and you have tests you want run, contact me privately.

I ran some tests to see how things have changed. What follows are the
details, the short summary is that 2.4 looks to me to be about 2x better
in both latency and bandwidth, no mean feat. I'm very impressed with
this, and I'm especially tickled to see the hand that Dave has had in
this, he's really come into his own as a senior kernel hacker. I'm sure
he doesn't need me to stroke his ego, but I'm doing it anyways because
I'm proud of him (with no disrespect to the many other people who have
worked on this intended).

So here's what I did. I fired up the lat_tcp and bw_tcp servers from
lmbench on the server and then generated load from all the clients.
I noodled around until I found the right mix which gave the best numbers
and that's roughly what is reported below. I don't have the 2.2 numbers
handy but I can get them if you care, it was very close to 2x worse,
like about 1.9x or so.

The server is running Linux 2.4 test9, I believe. It has 3 Intel EEpro's
and one 3c905B. It's a Ghz K7.

Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 8).
Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (#2) (rev 8).
Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (#3) (rev 8).
Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 48).

All are going into Netgear Fs308 8 port switches. There are 13 clients,
mostly Intel Linux boxes, but various others as well, let me know if
you care. A couple of the clients were behind two levels of switches
(I have 6 here).

Run a single copy of lat_tcp on each client against the server, we see:
load free cach swap pgin pgou dk0 dk1 dk2 dk3 ipkt opkt int ctx usr sys idl
4.68 443M 21M 0 0 0 0 0 0 0 42K 39K 55K 46K 4 96 0
4.68 443M 21M 0 0 2.0K 0 0 0 0 40K 38K 55K 44K 2 98 0
4.68 443M 21M 0 0 0 0 0 0 0 40K 38K 55K 44K 3 97 0
4.55 443M 21M 0 0 0 0 0 0 0 42K 40K 54K 48K 4 96 0
4.55 443M 21M 0 0 0 0 0 0 0 41K 39K 54K 45K 3 97 0
4.50 443M 21M 0 0 0 0 0 0 0 40K 38K 54K 44K 2 98 0
4.50 443M 21M 0 0 0 0 0 0 0 41K 38K 55K 44K 3 97 0
4.50 443M 21M 0 0 0 0 0 0 0 41K 41K 54K 45K 7 93 0
4.86 443M 21M 0 0 0 0 0 0 0 38K 38K 54K 44K 3 97 0


OK, now bandwidth. Each client is capable of getting at least 11MB/sec from
the server when run one at a time. I ran just 4 clients, one per network.

load free cach swap pgin pgou dk0 dk1 dk2 dk3 ipkt opkt int ctx usr sys idl
0.28 444M 22M 0 0 0 0 0 0 0 14K 27K 15K 2.9K 2 55 43
0.28 444M 22M 0 0 0 0 0 0 0 14K 29K 16K 3.1K 2 66 32
0.26 444M 22M 0 0 0 0 0 0 0 14K 29K 16K 3.0K 1 67 32
0.26 444M 22M 0 0 0 0 0 0 0 15K 29K 16K 3.0K 1 65 34
0.24 444M 22M 0 0 0 0 0 0 0 15K 29K 16K 3.0K 0 70 30
0.24 444M 22M 0 0 0 0 0 0 0 15K 29K 16K 3.0K 0 63 37
0.24 444M 22M 0 0 0 0 0 0 0 14K 28K 16K 3.0K 1 62 37
0.22 444M 22M 0 2.0K 0 0 0 0 0 14K 28K 16K 2.9K 1 65 34

It works out to an average of 10.4MB/sec per client or 41.6MB/sec on the
server on a PCI/32 @ 33Mhz bus. Same Ghz server. Note the idle cycles,
bandwidth is a lot easier than latency.

Hope this is useful to someone.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2000-10-31 22:06:04

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, Oct 31, 2000 at 02:52:11PM -0700, Jeff V. Merkey wrote:
> The numbers don't lie. You know where the code is. You notice that
> there is a version of
> the kernel hand coded in assembly language. You'l also noticed that
> it's SMP and takes ZERO LOCKS during context switching, in fact, most of
> the design is completely lockless.

I suspect most of the confusion in this thread comes because you seem to
use a different definition of context switch than Ingo and others. Could
you explain what you exactly mean with a context switch ?

-Andi

2000-10-31 22:19:47

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


Larry,

The quality of the networking code in Linux is quite excellent. There's
some scaling problems relative to NetWare. We are firmly committed to
getting something out with a Linux code base and NetWare metrics. Love
to have your help.

Jeff

Larry McVoy wrote:
>
> {lots of perf stuff deleted}
>
> I'm posting this to point out that Linux networking is getting better at
> a substantial pace.
>
> I've already sent this to Davem and Linus a while back, but I have a
> pretty nice lab here at BitMover, 4 100Mbit switched networks, servers
> with 4 cards, and enough clients to generate load. I actually have
> two servers both of which have a NIC on each network; one server has
> .2.15pre9 on it and the other has 2.4.0-test5 on it.
>
> I don't have a lot of spare time, but if you are one of the kernel
> developers and you have tests you want run, contact me privately.
>
> I ran some tests to see how things have changed. What follows are the
> details, the short summary is that 2.4 looks to me to be about 2x better
> in both latency and bandwidth, no mean feat. I'm very impressed with
> this, and I'm especially tickled to see the hand that Dave has had in
> this, he's really come into his own as a senior kernel hacker. I'm sure
> he doesn't need me to stroke his ego, but I'm doing it anyways because
> I'm proud of him (with no disrespect to the many other people who have
> worked on this intended).
>
> So here's what I did. I fired up the lat_tcp and bw_tcp servers from
> lmbench on the server and then generated load from all the clients.
> I noodled around until I found the right mix which gave the best numbers
> and that's roughly what is reported below. I don't have the 2.2 numbers
> handy but I can get them if you care, it was very close to 2x worse,
> like about 1.9x or so.
>
> The server is running Linux 2.4 test9, I believe. It has 3 Intel EEpro's
> and one 3c905B. It's a Ghz K7.
>
> Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 8).
> Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (#2) (rev 8).
> Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (#3) (rev 8).
> Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 48).
>
> All are going into Netgear Fs308 8 port switches. There are 13 clients,
> mostly Intel Linux boxes, but various others as well, let me know if
> you care. A couple of the clients were behind two levels of switches
> (I have 6 here).
>
> Run a single copy of lat_tcp on each client against the server, we see:
> load free cach swap pgin pgou dk0 dk1 dk2 dk3 ipkt opkt int ctx usr sys idl
> 4.68 443M 21M 0 0 0 0 0 0 0 42K 39K 55K 46K 4 96 0
> 4.68 443M 21M 0 0 2.0K 0 0 0 0 40K 38K 55K 44K 2 98 0
> 4.68 443M 21M 0 0 0 0 0 0 0 40K 38K 55K 44K 3 97 0
> 4.55 443M 21M 0 0 0 0 0 0 0 42K 40K 54K 48K 4 96 0
> 4.55 443M 21M 0 0 0 0 0 0 0 41K 39K 54K 45K 3 97 0
> 4.50 443M 21M 0 0 0 0 0 0 0 40K 38K 54K 44K 2 98 0
> 4.50 443M 21M 0 0 0 0 0 0 0 41K 38K 55K 44K 3 97 0
> 4.50 443M 21M 0 0 0 0 0 0 0 41K 41K 54K 45K 7 93 0
> 4.86 443M 21M 0 0 0 0 0 0 0 38K 38K 54K 44K 3 97 0
>
> OK, now bandwidth. Each client is capable of getting at least 11MB/sec from
> the server when run one at a time. I ran just 4 clients, one per network.
>
> load free cach swap pgin pgou dk0 dk1 dk2 dk3 ipkt opkt int ctx usr sys idl
> 0.28 444M 22M 0 0 0 0 0 0 0 14K 27K 15K 2.9K 2 55 43
> 0.28 444M 22M 0 0 0 0 0 0 0 14K 29K 16K 3.1K 2 66 32
> 0.26 444M 22M 0 0 0 0 0 0 0 14K 29K 16K 3.0K 1 67 32
> 0.26 444M 22M 0 0 0 0 0 0 0 15K 29K 16K 3.0K 1 65 34
> 0.24 444M 22M 0 0 0 0 0 0 0 15K 29K 16K 3.0K 0 70 30
> 0.24 444M 22M 0 0 0 0 0 0 0 15K 29K 16K 3.0K 0 63 37
> 0.24 444M 22M 0 0 0 0 0 0 0 14K 28K 16K 3.0K 1 62 37
> 0.22 444M 22M 0 2.0K 0 0 0 0 0 14K 28K 16K 2.9K 1 65 34
>
> It works out to an average of 10.4MB/sec per client or 41.6MB/sec on the
> server on a PCI/32 @ 33Mhz bus. Same Ghz server. Note the idle cycles,
> bandwidth is a lot easier than latency.
>
> Hope this is useful to someone.
> --
> ---
> Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2000-10-31 22:27:57

by Larry McVoy

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, Oct 31, 2000 at 03:15:37PM -0700, Jeff V. Merkey wrote:
> The quality of the networking code in Linux is quite excellent. There's
> some scaling problems relative to NetWare. We are firmly committed to
> getting something out with a Linux code base and NetWare metrics. Love
> to have your help.

Jeff, I'm a little concerned with some of your statements. Netware may
be the greatest thing since sliced bread, but it isn't a full operating
system, so comparing it to Linux is sort of meaningless. Consider your
recent context switch claims. Yes, I believe that you can do the moral
equiv of a longjmp() in the kernel in a few cycles, but that isn't a
context switch, at least, it isn't the same a context switch in the
operating system sense. It's different - last I checked, Netware was
essentially a kernel and nothing else. Is there a file system? Are there
processes with virtual memory? Are they preemptive? Does it support
all of P1003.1? Etc. If the answers to all of the above are "yes"
and you can support all that and get user to user context switches in a
clock cycle, well, jeez, you really do walk on water and I'll publicly
apologize for ever doubting your statements. On the other hand, if the
answers to that are not all "yes", then how about you do a little truth
in advertising with your postings? Without it, they are misleading to
the point of being purposefully deceptive.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2000-10-31 22:27:57

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


A "context" is usually assued to be a "stack". The simplest of all
context switches
is:

mov x, esp
mov esp, y

A context switch can be as short as two instructions, or as big as a TSS
with CR3 hardware switching,

i.e.

ltr ax
jmp task_gate

(500 clocks later)

ts->eip gets exec'd

you can also have a context switch that does an int X where X is a gate
or TSS.

you can also have a context switch (like linux) that does

mov x, esp
mov esp, y
mov z, CR3
mov CR3, w

etc.

In NetWare, a context switch is an in-line assembly language macro that
is 2 instructions long for a stack switch and 4 instructions for a CR3
reload -- this is a lot shorter than Linux.
Only EBX, EBP, ESI, and EDI are saved and this is never done in the
kernel, but is a natural
affect of the Watcom C compiler. There's also strict rules about
register assignments that re enforced between assembler modules in
NetWare to reduce the overhead of a context switch. The code path is
very complex in NetWare, and priorities and all this stuff exists, but
these code paths are segragated so these types of checks only happen
once in a while and check a pre-calc'd "scoreboard" that is read only
across processors and updated and recal'd by a timer every 18 ticks.

Jeff





Andi Kleen wrote:
>
> On Tue, Oct 31, 2000 at 02:52:11PM -0700, Jeff V. Merkey wrote:
> > The numbers don't lie. You know where the code is. You notice that
> > there is a version of
> > the kernel hand coded in assembly language. You'l also noticed that
> > it's SMP and takes ZERO LOCKS during context switching, in fact, most of
> > the design is completely lockless.
>
> I suspect most of the confusion in this thread comes because you seem to
> use a different definition of context switch than Ingo and others. Could
> you explain what you exactly mean with a context switch ?
>
> -Andi

2000-10-31 22:32:29

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Ingo Molnar wrote:
>
> On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
>
> > > > > Excuse me, 857,000,000 instructions executed and 460,000,000
> > > > > context switches a second -- on a PII system at 350 Mhz. [...]
> > >
> > > > That's more than one context switch per clock. I do not think so.
> > > > Really go and check those numbers.
> > >
> > > yep, you cannot have 460 million context switches on that system,
> > > unless you have some Clintonesque definition for 'context switch' ;-)
> >
> > The numbers don't lie. [...]
>
> sure ;) I can do infinite context switches! You dont believe? See:
>
> #define schedule() do { } while (0)

Actually, I think the compiler would optimize this statement completely
out of the code.

Jeff

>
> [there is a small restriction, should only be used in single-task
> systems.]
>
> Ingo
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-31 22:42:12

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Larry McVoy wrote:
>
> On Tue, Oct 31, 2000 at 03:15:37PM -0700, Jeff V. Merkey wrote:
> > The quality of the networking code in Linux is quite excellent. There's
> > some scaling problems relative to NetWare. We are firmly committed to
> > getting something out with a Linux code base and NetWare metrics. Love
> > to have your help.
>
> Jeff, I'm a little concerned with some of your statements. Netware may
> be the greatest thing since sliced bread, but it isn't a full operating
> system, so comparing it to Linux is sort of meaningless.

It's makes more money in a week than Linux has ever made.

Consider your
> recent context switch claims. Yes, I believe that you can do the moral
> equiv of a longjmp() in the kernel in a few cycles, but that isn't a
> context switch, at least, it isn't the same a context switch in the
> operating system sense.
A context switch in anoperating system context in it's simplest for is

mov x, esp
mov esp, y

It's different - last I checked, Netware was
> essentially a kernel and nothing else.


Is there a file system?
Several.

Are there
> processes with virtual memory?
Yes.

Are they preemptive?
Yes.

Does it support
> all of P1003.1?
Yes.

Etc. If the answers to all of the above are "yes"
> and you can support all that and get user to user context switches in a
> clock cycle, well, jeez, you really do walk on water and I'll publicly
> apologize for ever doubting your statements.

Apology accepted.

Jeff


> --
> ---
> Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-31 22:45:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Tue, 31 Oct 2000, Jeff V. Merkey wrote:

> A "context" is usually assued to be a "stack". [...]

a very clintonesque definition indeed ;-)

what is relevant is the latency to switch from one process to another one.
And this is what we call a context switch. It includes scheduling
decisions and all sorts of other stuff. You are comparing stack &
caller-saved register switching performance (which is just a small part of
context switching and has no standing alone) against full Linux context
switch performance (this is what i quoted), and thus you have won my
'Mindcraft benchmark of the day' award :-)

Ingo

2000-10-31 22:49:22

by Larry McVoy

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, Oct 31, 2000 at 03:38:00PM -0700, Jeff V. Merkey wrote:
> Larry McVoy wrote:
> > On Tue, Oct 31, 2000 at 03:15:37PM -0700, Jeff V. Merkey wrote:
> > > The quality of the networking code in Linux is quite excellent. There's
> > > some scaling problems relative to NetWare. We are firmly committed to
> > > getting something out with a Linux code base and NetWare metrics. Love
> > > to have your help.
> >
> > Jeff, I'm a little concerned with some of your statements. Netware may
> > be the greatest thing since sliced bread, but it isn't a full operating
> > system, so comparing it to Linux is sort of meaningless.
>
> It's makes more money in a week than Linux has ever made.

And the relevance of that to this conversation is exactly what?

> A context switch in anoperating system context in it's simplest for is
>
> mov x, esp
> mov esp, y
>
> > and you can support all that and get user to user context switches in a
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Apology accepted.

No apology was extended. You're spouting nonsense. User to user means
process A in VM 1 switching to process B in VM 2. I'm sorry, Mr Merkey,
but a

mov x, esp
mov esp, y

doesn't begin to approach a user to user context switch. Please go learn
what a user to user context switch is. Then come back when you can do
one of those in a few cycles.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2000-10-31 22:50:02

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



One more optimization it has. NetWare never "calls" functions in the
kernel. There's a template of register assignments in between kernel
modules that's very strict (esi contains a WTD head, edi has the target
thread, etc.) and all function calls are jumps in a linear space.
layout of all functions are 16 bytes aligned for speed, and all
arguments in kernel are passed via registers. It's a level of
optimization no C compiler does -- all of it was done by hand, and most
function s in fast paths are layed out in 512 byte chunks to increases
speed. Stack memory activity in the NetWare kernel is almost
non-existent in almost all of the "fast paths"

Jeff

"Jeff V. Merkey" wrote:
>
> A "context" is usually assued to be a "stack". The simplest of all
> context switches
> is:
>
> mov x, esp
> mov esp, y
>
> A context switch can be as short as two instructions, or as big as a TSS
> with CR3 hardware switching,
>
> i.e.
>
> ltr ax
> jmp task_gate
>
> (500 clocks later)
>
> ts->eip gets exec'd
>
> you can also have a context switch that does an int X where X is a gate
> or TSS.
>
> you can also have a context switch (like linux) that does
>
> mov x, esp
> mov esp, y
> mov z, CR3
> mov CR3, w
>
> etc.
>
> In NetWare, a context switch is an in-line assembly language macro that
> is 2 instructions long for a stack switch and 4 instructions for a CR3
> reload -- this is a lot shorter than Linux.
> Only EBX, EBP, ESI, and EDI are saved and this is never done in the
> kernel, but is a natural
> affect of the Watcom C compiler. There's also strict rules about
> register assignments that re enforced between assembler modules in
> NetWare to reduce the overhead of a context switch. The code path is
> very complex in NetWare, and priorities and all this stuff exists, but
> these code paths are segragated so these types of checks only happen
> once in a while and check a pre-calc'd "scoreboard" that is read only
> across processors and updated and recal'd by a timer every 18 ticks.
>
> Jeff
>
>
>
> Andi Kleen wrote:
> >
> > On Tue, Oct 31, 2000 at 02:52:11PM -0700, Jeff V. Merkey wrote:
> > > The numbers don't lie. You know where the code is. You notice that
> > > there is a version of
> > > the kernel hand coded in assembly language. You'l also noticed that
> > > it's SMP and takes ZERO LOCKS during context switching, in fact, most of
> > > the design is completely lockless.
> >
> > I suspect most of the confusion in this thread comes because you seem to
> > use a different definition of context switch than Ingo and others. Could
> > you explain what you exactly mean with a context switch ?
> >
> > -Andi

2000-10-31 22:52:02

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
> Larry McVoy wrote:

> > Consider your
> > recent context switch claims. Yes, I believe that you can do the moral
> > equiv of a longjmp() in the kernel in a few cycles, but that isn't a
> > context switch, at least, it isn't the same a context switch in the
> > operating system sense.

> A context switch in anoperating system context in it's simplest for is
>
> mov x, esp
> mov esp, y

> > processes with virtual memory?
> Yes.

Maybe you could tell us how long the context switch
between processes with virtual memory takes, that
would be a more meaningful comparison...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/ http://www.surriel.com/

2000-10-31 22:52:13

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Ingo Molnar wrote:
>
> On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
>
> > A "context" is usually assued to be a "stack". [...]
>
> a very clintonesque definition indeed ;-)
>
> what is relevant is the latency to switch from one process to another one.
> And this is what we call a context switch. It includes scheduling
> decisions and all sorts of other stuff. You are comparing stack &
> caller-saved register switching performance (which is just a small part of
> context switching and has no standing alone) against full Linux context
> switch performance (this is what i quoted), and thus you have won my
> 'Mindcraft benchmark of the day' award :-)

Ingo,

It kicks Linux's but in LAN I/O scaling. It would be nice for Linux to
have an incarnation that's competitive. The only reason people are
still buying NetWare is because nothing out there has come along to
replace it. That is going to change...

Jeff


>
> Ingo

2000-10-31 22:54:52

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Rik van Riel wrote:
>
> On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
> > Larry McVoy wrote:
>
> > > Consider your
> > > recent context switch claims. Yes, I believe that you can do the moral
> > > equiv of a longjmp() in the kernel in a few cycles, but that isn't a
> > > context switch, at least, it isn't the same a context switch in the
> > > operating system sense.
>
> > A context switch in anoperating system context in it's simplest for is
> >
> > mov x, esp
> > mov esp, y
>
> > > processes with virtual memory?
> > Yes.
>
> Maybe you could tell us how long the context switch
> between processes with virtual memory takes, that
> would be a more meaningful comparison...
>

Rik,

I'll bring down a NetWare 5 server, and snapshot the context switch code
from the NetWare debugger and email to you.

Jeff

> regards,
>
> Rik
> --
> "What you're running that piece of shit Gnome?!?!"
> -- Miguel de Icaza, UKUUG 2000
>
> http://www.conectiva.com/ http://www.surriel.com/

2000-10-31 22:57:02

by Larry McVoy

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, Oct 31, 2000 at 03:47:56PM -0700, Jeff V. Merkey wrote:
> It kicks Linux's but in LAN I/O scaling.

Really? So, since in a few messages back you claimed that it has a fully
supported userland which implements all of P1003.1 as well as sockets,
obviously, since it is a networking operating system, it should be the
work of a few minutes to download LMbench, compile it, and come back with
the lat_tcp performance numbers which "kicks Linux's but". Please do so.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2000-10-31 22:57:52

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Larry McVoy wrote:
>
> On Tue, Oct 31, 2000 at 03:38:00PM -0700, Jeff V. Merkey wrote:
> > Larry McVoy wrote:
> > > On Tue, Oct 31, 2000 at 03:15:37PM -0700, Jeff V. Merkey wrote:
> > > > The quality of the networking code in Linux is quite excellent. There's
> > > > some scaling problems relative to NetWare. We are firmly committed to
> > > > getting something out with a Linux code base and NetWare metrics. Love
> > > > to have your help.
> > >
> > > Jeff, I'm a little concerned with some of your statements. Netware may
> > > be the greatest thing since sliced bread, but it isn't a full operating
> > > system, so comparing it to Linux is sort of meaningless.
> >
> > It's makes more money in a week than Linux has ever made.
>
> And the relevance of that to this conversation is exactly what?
>
> > A context switch in anoperating system context in it's simplest for is
> >
> > mov x, esp
> > mov esp, y
> >
> > > and you can support all that and get user to user context switches in a
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > Apology accepted.
>
> No apology was extended.

hipocrite.

You're spouting nonsense. User to user means
> process A in VM 1 switching to process B in VM 2. I'm sorry, Mr Merkey,
> but a
>
> mov x, esp
> mov esp, y
>
> doesn't begin to approach a user to user context switch. Please go learn
> what a user to user context switch is. Then come back when you can do
> one of those in a few cycles.

You have angry fingers (one of my problems). You don't need a user
context switch for kernel paths in a NOS kernel. In NetWare, user
context switches are done in gates in the GDT with TSS descriptors, not
in kernel fast paths with LAN I/O, which isn't what I was talking about.

Jeff

> --
> ---
> Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2000-10-31 22:58:22

by David Lang

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

-----BEGIN PGP SIGNED MESSAGE-----

Jeff, one other thing. Linux is not x86 hand-crafted assembler, it's
capable of running on many platforms. are you planning on giving up this
capability or hand crafting the kernel for each chip?

Linux on x86 is nice (and I do use it a lot) but one of the things that
makes it so useful is that when you do outgrow what youcan do on a x86
platform you cna move to a more powerful platform without having to change
to a different OS.

David Lang

On Tue, 31 Oct 2000,
Jeff V. Merkey wrote:

> Date: Tue, 31 Oct 2000 15:45:45 -0700
> From: Jeff V. Merkey <[email protected]>
> To: Andi Kleen <[email protected]>, [email protected], Pavel Machek <[email protected]>,
> [email protected]
> Subject: Re: 2.2.18Pre Lan Performance Rocks!
>
>
>
> One more optimization it has. NetWare never "calls" functions in the
> kernel. There's a template of register assignments in between kernel
> modules that's very strict (esi contains a WTD head, edi has the target
> thread, etc.) and all function calls are jumps in a linear space.
> layout of all functions are 16 bytes aligned for speed, and all
> arguments in kernel are passed via registers. It's a level of
> optimization no C compiler does -- all of it was done by hand, and most
> function s in fast paths are layed out in 512 byte chunks to increases
> speed. Stack memory activity in the NetWare kernel is almost
> non-existent in almost all of the "fast paths"
>
> Jeff
>
> "Jeff V. Merkey" wrote:
> >
> > A "context" is usually assued to be a "stack". The simplest of all
> > context switches
> > is:
> >
> > mov x, esp
> > mov esp, y
> >
> > A context switch can be as short as two instructions, or as big as a TSS
> > with CR3 hardware switching,
> >
> > i.e.
> >
> > ltr ax
> > jmp task_gate
> >
> > (500 clocks later)
> >
> > ts->eip gets exec'd
> >
> > you can also have a context switch that does an int X where X is a gate
> > or TSS.
> >
> > you can also have a context switch (like linux) that does
> >
> > mov x, esp
> > mov esp, y
> > mov z, CR3
> > mov CR3, w
> >
> > etc.
> >
> > In NetWare, a context switch is an in-line assembly language macro that
> > is 2 instructions long for a stack switch and 4 instructions for a CR3
> > reload -- this is a lot shorter than Linux.
> > Only EBX, EBP, ESI, and EDI are saved and this is never done in the
> > kernel, but is a natural
> > affect of the Watcom C compiler. There's also strict rules about
> > register assignments that re enforced between assembler modules in
> > NetWare to reduce the overhead of a context switch. The code path is
> > very complex in NetWare, and priorities and all this stuff exists, but
> > these code paths are segragated so these types of checks only happen
> > once in a while and check a pre-calc'd "scoreboard" that is read only
> > across processors and updated and recal'd by a timer every 18 ticks.
> >
> > Jeff
> >
> >
> >
> > Andi Kleen wrote:
> > >
> > > On Tue, Oct 31, 2000 at 02:52:11PM -0700, Jeff V. Merkey wrote:
> > > > The numbers don't lie. You know where the code is. You notice that
> > > > there is a version of
> > > > the kernel hand coded in assembly language. You'l also noticed that
> > > > it's SMP and takes ZERO LOCKS during context switching, in fact, most of
> > > > the design is completely lockless.
> > >
> > > I suspect most of the confusion in this thread comes because you seem to
> > > use a different definition of context switch than Ingo and others. Could
> > > you explain what you exactly mean with a context switch ?
> > >
> > > -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/
>

-----BEGIN PGP SIGNATURE-----
Version: PGP 6.5.2

iQEVAwUBOf9LYj7msCGEppcbAQFdZQgAocWtMwnNmmLnfSS+cGHZd8V0IFfmoVb7
fDRoNWMmOzU5g5W8aAudQFqGpGqizBR8/AA9ziqHRfKxwoo5/nuHtEMDpfw0nV5e
ghsd7qtzv1kTk0l5zp6bN2qPlGgs7Ke72od10X6pTGDyuUDQK71YNQ9UUcCv8GEO
2PpPOnCHw3atuQ0hetMNcFfIdvvslTB2+pcVzYxWWGhCYIeWreF8w1qf8XDYQil9
Ih22vmu69LP03RwXkFioikVSK8F8m+31DUBA67exN+R4qXy8+U5ZtyPQ+onIeguh
SnADBNjGjWK2mPyLNSVyAwH6EsIaGzk1QJ5hYULzVFi4zVl3pyRi8w==
=91LL
-----END PGP SIGNATURE-----

2000-10-31 22:59:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Tue, 31 Oct 2000, Jeff V. Merkey wrote:

> One more optimization it has. NetWare never "calls" functions in the
> kernel. There's a template of register assignments in between kernel
> modules that's very strict (esi contains a WTD head, edi has the target
> thread, etc.) and all function calls are jumps in a linear space.

this might be a win on a i486, but is a loss with any branch predicting,
large-pipeline CPUs (think Pentium IV), which are optimized for CALLs, not
for JMP *EAX instructions. This is the problem with assembly optimizations
that try to compete with the compiler's work: hand-made assembly can only
get worse over time (stay constant in the best case), while compilers are
known to improve slowly but steadily. Plus hand-made assembly is a huge
stone tied to your legs if you try to swim to other architectures. Eg. we
quite often make use of GCC's register-based function parameter passing
optimization. We do use hand-made assembly in a number of cases in Linux
as well, and double-check GCC's assembly output in critical code paths,
but we try to not make it an essential facility.

Ingo

2000-10-31 22:59:43

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


Larry,

What's your mailing address and I'll send you out a legally licensed
copy of NetWare 3.12 and transfer the license to you then you can do the
comparison and see for yourself.

:-)

Jeff

Larry McVoy wrote:
>
> On Tue, Oct 31, 2000 at 03:47:56PM -0700, Jeff V. Merkey wrote:
> > It kicks Linux's but in LAN I/O scaling.
>
> Really? So, since in a few messages back you claimed that it has a fully
> supported userland which implements all of P1003.1 as well as sockets,
> obviously, since it is a networking operating system, it should be the
> work of a few minutes to download LMbench, compile it, and come back with
> the lat_tcp performance numbers which "kicks Linux's but". Please do so.
> --
> ---
> Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2000-10-31 23:00:52

by Michael H. Warfield

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, Oct 31, 2000 at 02:52:11PM -0700, Jeff V. Merkey wrote:

> Ingo Molnar wrote:
> >
> > On Tue, 31 Oct 2000, Pavel Machek wrote:
> >
> > > > Excuse me, 857,000,000 instructions executed and 460,000,000
> > > > context switches a second -- on a PII system at 350 Mhz. [...]
> >
> > > That's more than one context switch per clock. I do not think so.
> > > Really go and check those numbers.
> >
> > yep, you cannot have 460 million context switches on that system,
> > unless you have some Clintonesque definition for 'context switch' ;-)

> The numbers don't lie. You know where the code is. You notice that
> there is a version of
> the kernel hand coded in assembly language. You'l also noticed that
> it's SMP and takes ZERO LOCKS during context switching, in fact, most of
> the design is completely lockless.

Ah ha ha ha!!!! Sure they do! You're just quoting statistics
measured under whatever conditions you imposed.

Numbers lying? I think the famous line has been variously
attributed to either Mark Twain or Disraeli (don't know which really
coined the phrase) but it's been said that there are three kinds of
lies "Lies, Damn Lies, and Statistics". Yes numbers do lie. Sometimes
it's the GIGO law and sometimes its just the fact that if you abuse
statistics long enough they will tell you ANYTHING. Sometimes it merely
the person manipulating^H^H^H^H^H^H^H^H^H^H^H^Hproviding the numbers.

BTW... I was going to stay OUT of this rat trap, but since I'm
in for a dime, I might as well be in for a dolar...

<Minor Rant>

Comparisons have been made between the performance of Linux with
early (3.x, 4.x) versions of Novell. ANYONE who wants to compare Linux
with that bug ridden, unreliable swamp of headaches and security holes
(somewhere in my archives I have the virus launcher that BYPASSES the
3.x login program) should be beaten about the head with a good textbook
on reliable coding techniques. Novell made its hayday by beating the
bejesus out of TCP/IP and others primarily by disabling checksumming,
memory protection, and other reliability techniques. Yes, they got
better performance on low performace processors, but at what cost? Now
we cover their performance with reliability and superior hardware. I
remember one misguided soul wanting to run IPX over SLIP pleading for help
on the Novell mailing list years ago. Let's see... SLIP eliminates the
MAC layer checksumming and IPX eliminates the error checking on the next
layer up... Yup... There was a receipe for random acts of terrorism.
Now we have PPP (this was in the days BEFORE PPP) and you could do it.
IPX depended on the lower layers for data integrity and and SLIP depended
on the layer above it. Ooooppppssss....

Then we had the Novel 5.x NFS server that allow you to create
scripts that were SUID to root just by making them read only to the
Novell workstations (ok - that's not performace related - I just think
that security should be given a LITTLE thought). I worked at an outfit
(Syntrex) that saw themselves as becoming the "K-Mart of Novell" and I
was told that Novell was the be all and end all of networking and there
was really no future in this antiquated TCP/IP stuff. I was laid off and
given all sorts of nice neat little toys like an AT&T source license
because they saw no future in Unix or TCP/IP. (Bitter - no... I have
had my revenge in spades... They had no clue what they gave away and
let slip through their fingers! :-) )

Now, Novell has been dragged kicking and screaming into the TCP/IP
world, and Novell has been forced to add memory protection (at a performace
cost) to their servers, and the outfit that thought TCP/IP was history
is now history (Syntrex went Chapter 13 about 10 years ago), and I've had
the pleasure of slamming one particularly simple minded Novell rep (another
ex Syntrex inDUHvidual) with more than one security hole (the perl module
on the Novell web server was an absolute classic).

My point here is that packets per second don't mean jack shit if
you can't do it reliably and you can't do it securely. Novell failed on
both of those counts and those are a contributing factor in their current
troubles. They built their reputation on performance that was achieved at
the expense of reliability and security. Now they have to play with the
big boys and all the nasty kiddies out there who don't play nice.

Performance is important. Performance is desirable. Efforts to
improve performance are worthwhile. But performance should NEVER come at
the cost of security or reliability or integrity. Comparisions with high
performance systems which lacked security, reliability, and data integrity
are suspect AT BEST. We should NEVER give up the quest for better
performance but comparisons to an inferior operating system which can pump
out packets faster than us is not the threat some people would like it to be.

</Minor Rant>

My regards and respects to Jeff. He says he was responsible for the
Novell 4.x and 5.x systems. I note that he omitted the 3.x OS. Acknowledged
and respected! In my earlier days, I was the kernel jock responsible for a
proprietary version of XENIX and worked on Microport UNIX (may someone
forever drive a stake through that bastard's heart) and SCO Unix. I'm
"contaminated goods" for certain projects so I can't contribute to certain
applications like Taylor UUCP because I have legal source code to HDB UUCP
(as if UUCP means jack in today world). Does the current taylor UUCP have
the most possible efficient checksumming algorithm for the UUCP 'g' protocol?
No... No way... I've seen one that stomps Taylor UUCP's ass and takes names
from the AT&T SVR5 release 3.2 source tapes (took me a week to figure out just
how it worked - damn those comment strippers). Does it matter? Not one bit.

I want to see Linux excel. Does that mean that it has to beat
every benchmark set by an operating system that cut every corner that
would cause Linus to turn into a Quake Balrog? I think not.

> Jeff

> > Ingo

Mike
--
Michael H. Warfield | (770) 985-6132 | [email protected]
(The Mad Wizard) | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0xDF1DD471 | possible worlds. A pessimist is sure of it!

2000-10-31 23:01:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Tue, 31 Oct 2000, Jeff V. Merkey wrote:

> It kicks Linux's but in LAN I/O scaling. [...]

brain cacheflush? Restart the same thread? Sorry i've got better things to
do.

Ingo

2000-10-31 23:01:32

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



David Lang wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
>
> Jeff, one other thing. Linux is not x86 hand-crafted assembler, it's
> capable of running on many platforms. are you planning on giving up this
> capability or hand crafting the kernel for each chip?
>
> Linux on x86 is nice (and I do use it a lot) but one of the things that
> makes it so useful is that when you do outgrow what youcan do on a x86
> platform you cna move to a more powerful platform without having to change
> to a different OS.

How about a NetWare like NOS with all the application support of Linux?
I think a lot of people will love it. Novell's largest customers have
told me they want to see it, and would deploy it. I do believe there's
a market for such a beast. A very lucrative one...

Jeff

>
> David Lang
>
> On Tue, 31 Oct 2000,
> Jeff V. Merkey wrote:
>
> > Date: Tue, 31 Oct 2000 15:45:45 -0700
> > From: Jeff V. Merkey <[email protected]>
> > To: Andi Kleen <[email protected]>, [email protected], Pavel Machek <[email protected]>,
> > [email protected]
> > Subject: Re: 2.2.18Pre Lan Performance Rocks!
> >
> >
> >
> > One more optimization it has. NetWare never "calls" functions in the
> > kernel. There's a template of register assignments in between kernel
> > modules that's very strict (esi contains a WTD head, edi has the target
> > thread, etc.) and all function calls are jumps in a linear space.
> > layout of all functions are 16 bytes aligned for speed, and all
> > arguments in kernel are passed via registers. It's a level of
> > optimization no C compiler does -- all of it was done by hand, and most
> > function s in fast paths are layed out in 512 byte chunks to increases
> > speed. Stack memory activity in the NetWare kernel is almost
> > non-existent in almost all of the "fast paths"
> >
> > Jeff
> >
> > "Jeff V. Merkey" wrote:
> > >
> > > A "context" is usually assued to be a "stack". The simplest of all
> > > context switches
> > > is:
> > >
> > > mov x, esp
> > > mov esp, y
> > >
> > > A context switch can be as short as two instructions, or as big as a TSS
> > > with CR3 hardware switching,
> > >
> > > i.e.
> > >
> > > ltr ax
> > > jmp task_gate
> > >
> > > (500 clocks later)
> > >
> > > ts->eip gets exec'd
> > >
> > > you can also have a context switch that does an int X where X is a gate
> > > or TSS.
> > >
> > > you can also have a context switch (like linux) that does
> > >
> > > mov x, esp
> > > mov esp, y
> > > mov z, CR3
> > > mov CR3, w
> > >
> > > etc.
> > >
> > > In NetWare, a context switch is an in-line assembly language macro that
> > > is 2 instructions long for a stack switch and 4 instructions for a CR3
> > > reload -- this is a lot shorter than Linux.
> > > Only EBX, EBP, ESI, and EDI are saved and this is never done in the
> > > kernel, but is a natural
> > > affect of the Watcom C compiler. There's also strict rules about
> > > register assignments that re enforced between assembler modules in
> > > NetWare to reduce the overhead of a context switch. The code path is
> > > very complex in NetWare, and priorities and all this stuff exists, but
> > > these code paths are segragated so these types of checks only happen
> > > once in a while and check a pre-calc'd "scoreboard" that is read only
> > > across processors and updated and recal'd by a timer every 18 ticks.
> > >
> > > Jeff
> > >
> > >
> > >
> > > Andi Kleen wrote:
> > > >
> > > > On Tue, Oct 31, 2000 at 02:52:11PM -0700, Jeff V. Merkey wrote:
> > > > > The numbers don't lie. You know where the code is. You notice that
> > > > > there is a version of
> > > > > the kernel hand coded in assembly language. You'l also noticed that
> > > > > it's SMP and takes ZERO LOCKS during context switching, in fact, most of
> > > > > the design is completely lockless.
> > > >
> > > > I suspect most of the confusion in this thread comes because you seem to
> > > > use a different definition of context switch than Ingo and others. Could
> > > > you explain what you exactly mean with a context switch ?
> > > >
> > > > -Andi
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
> -----BEGIN PGP SIGNATURE-----
> Version: PGP 6.5.2
>
> iQEVAwUBOf9LYj7msCGEppcbAQFdZQgAocWtMwnNmmLnfSS+cGHZd8V0IFfmoVb7
> fDRoNWMmOzU5g5W8aAudQFqGpGqizBR8/AA9ziqHRfKxwoo5/nuHtEMDpfw0nV5e
> ghsd7qtzv1kTk0l5zp6bN2qPlGgs7Ke72od10X6pTGDyuUDQK71YNQ9UUcCv8GEO
> 2PpPOnCHw3atuQ0hetMNcFfIdvvslTB2+pcVzYxWWGhCYIeWreF8w1qf8XDYQil9
> Ih22vmu69LP03RwXkFioikVSK8F8m+31DUBA67exN+R4qXy8+U5ZtyPQ+onIeguh
> SnADBNjGjWK2mPyLNSVyAwH6EsIaGzk1QJ5hYULzVFi4zVl3pyRi8w==
> =91LL
> -----END PGP SIGNATURE-----

2000-10-31 23:02:22

by Alan

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> One more optimization it has. NetWare never "calls" functions in the
> kernel. There's a template of register assignments in between kernel
> modules that's very strict (esi contains a WTD head, edi has the target
> thread, etc.) and all function calls are jumps in a linear space.

What if I jump to an invalid address - does it crash ?

2000-10-31 23:03:32

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Ingo Molnar wrote:
>
> On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
>
> > One more optimization it has. NetWare never "calls" functions in the
> > kernel. There's a template of register assignments in between kernel
> > modules that's very strict (esi contains a WTD head, edi has the target
> > thread, etc.) and all function calls are jumps in a linear space.
>
> this might be a win on a i486, but is a loss with any branch predicting,
> large-pipeline CPUs (think Pentium IV), which are optimized for CALLs, not
> for JMP *EAX instructions. This is the problem with assembly optimizations
> that try to compete with the compiler's work: hand-made assembly can only
> get worse over time (stay constant in the best case), while compilers are
> known to improve slowly but steadily. Plus hand-made assembly is a huge
> stone tied to your legs if you try to swim to other architectures. Eg. we
> quite often make use of GCC's register-based function parameter passing
> optimization. We do use hand-made assembly in a number of cases in Linux
> as well, and double-check GCC's assembly output in critical code paths,
> but we try to not make it an essential facility.

It's hand optimized for all these cases to one of the highest degrees
that exist anywhere in the world. Intel and Novell have been in bed
together for a very long time...

Jeff

>
> Ingo

2000-10-31 23:06:12

by David Lang

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

I don't doubt it. a port of netware that can run linux apps would be very
useful to people who want to run netware, but this is not the same thing
as what it has sounded like you were working on.

David Lang

On Tue, 31 Oct 2000, Jeff
V. Merkey wrote:

> Date: Tue, 31 Oct 2000 15:57:23 -0700
> From: Jeff V. Merkey <[email protected]>
> To: David Lang <[email protected]>
> Cc: [email protected]
> Subject: Re: 2.2.18Pre Lan Performance Rocks!
>
>
>
> David Lang wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> >
> > Jeff, one other thing. Linux is not x86 hand-crafted assembler, it's
> > capable of running on many platforms. are you planning on giving up this
> > capability or hand crafting the kernel for each chip?
> >
> > Linux on x86 is nice (and I do use it a lot) but one of the things that
> > makes it so useful is that when you do outgrow what youcan do on a x86
> > platform you cna move to a more powerful platform without having to change
> > to a different OS.
>
> How about a NetWare like NOS with all the application support of Linux?
> I think a lot of people will love it. Novell's largest customers have
> told me they want to see it, and would deploy it. I do believe there's
> a market for such a beast. A very lucrative one...
>
> Jeff
>
> >
> > David Lang
> >
> > On Tue, 31 Oct 2000,
> > Jeff V. Merkey wrote:
> >
> > > Date: Tue, 31 Oct 2000 15:45:45 -0700
> > > From: Jeff V. Merkey <[email protected]>
> > > To: Andi Kleen <[email protected]>, [email protected], Pavel Machek <[email protected]>,
> > > [email protected]
> > > Subject: Re: 2.2.18Pre Lan Performance Rocks!
> > >
> > >
> > >
> > > One more optimization it has. NetWare never "calls" functions in the
> > > kernel. There's a template of register assignments in between kernel
> > > modules that's very strict (esi contains a WTD head, edi has the target
> > > thread, etc.) and all function calls are jumps in a linear space.
> > > layout of all functions are 16 bytes aligned for speed, and all
> > > arguments in kernel are passed via registers. It's a level of
> > > optimization no C compiler does -- all of it was done by hand, and most
> > > function s in fast paths are layed out in 512 byte chunks to increases
> > > speed. Stack memory activity in the NetWare kernel is almost
> > > non-existent in almost all of the "fast paths"
> > >
> > > Jeff
> > >
> > > "Jeff V. Merkey" wrote:
> > > >
> > > > A "context" is usually assued to be a "stack". The simplest of all
> > > > context switches
> > > > is:
> > > >
> > > > mov x, esp
> > > > mov esp, y
> > > >
> > > > A context switch can be as short as two instructions, or as big as a TSS
> > > > with CR3 hardware switching,
> > > >
> > > > i.e.
> > > >
> > > > ltr ax
> > > > jmp task_gate
> > > >
> > > > (500 clocks later)
> > > >
> > > > ts->eip gets exec'd
> > > >
> > > > you can also have a context switch that does an int X where X is a gate
> > > > or TSS.
> > > >
> > > > you can also have a context switch (like linux) that does
> > > >
> > > > mov x, esp
> > > > mov esp, y
> > > > mov z, CR3
> > > > mov CR3, w
> > > >
> > > > etc.
> > > >
> > > > In NetWare, a context switch is an in-line assembly language macro that
> > > > is 2 instructions long for a stack switch and 4 instructions for a CR3
> > > > reload -- this is a lot shorter than Linux.
> > > > Only EBX, EBP, ESI, and EDI are saved and this is never done in the
> > > > kernel, but is a natural
> > > > affect of the Watcom C compiler. There's also strict rules about
> > > > register assignments that re enforced between assembler modules in
> > > > NetWare to reduce the overhead of a context switch. The code path is
> > > > very complex in NetWare, and priorities and all this stuff exists, but
> > > > these code paths are segragated so these types of checks only happen
> > > > once in a while and check a pre-calc'd "scoreboard" that is read only
> > > > across processors and updated and recal'd by a timer every 18 ticks.
> > > >
> > > > Jeff
> > > >
> > > >
> > > >
> > > > Andi Kleen wrote:
> > > > >
> > > > > On Tue, Oct 31, 2000 at 02:52:11PM -0700, Jeff V. Merkey wrote:
> > > > > > The numbers don't lie. You know where the code is. You notice that
> > > > > > there is a version of
> > > > > > the kernel hand coded in assembly language. You'l also noticed that
> > > > > > it's SMP and takes ZERO LOCKS during context switching, in fact, most of
> > > > > > the design is completely lockless.
> > > > >
> > > > > I suspect most of the confusion in this thread comes because you seem to
> > > > > use a different definition of context switch than Ingo and others. Could
> > > > > you explain what you exactly mean with a context switch ?
> > > > >
> > > > > -Andi
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > Please read the FAQ at http://www.tux.org/lkml/
> > >
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: PGP 6.5.2
> >
> > iQEVAwUBOf9LYj7msCGEppcbAQFdZQgAocWtMwnNmmLnfSS+cGHZd8V0IFfmoVb7
> > fDRoNWMmOzU5g5W8aAudQFqGpGqizBR8/AA9ziqHRfKxwoo5/nuHtEMDpfw0nV5e
> > ghsd7qtzv1kTk0l5zp6bN2qPlGgs7Ke72od10X6pTGDyuUDQK71YNQ9UUcCv8GEO
> > 2PpPOnCHw3atuQ0hetMNcFfIdvvslTB2+pcVzYxWWGhCYIeWreF8w1qf8XDYQil9
> > Ih22vmu69LP03RwXkFioikVSK8F8m+31DUBA67exN+R4qXy8+U5ZtyPQ+onIeguh
> > SnADBNjGjWK2mPyLNSVyAwH6EsIaGzk1QJ5hYULzVFi4zVl3pyRi8w==
> > =91LL
> > -----END PGP SIGNATURE-----
>

2000-10-31 23:20:25

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



"Richard B. Johnson" wrote:
>

Dick,

In NetWare this:

>
> One could create a 'kernel' that does:
> for(;;)
> {
> proc0();
> proc1();
> proc2();
> proc3();
> etc();
> }

would be coded like this (no C compiler):

proc0:

proc1:

proc2:

proc3:

etc:

label:
jmp proc0


I just avoided 5 x 20 bytes of pushes and pops on the stack ad optimized
for a simple fall through case.

:-)

Jeff

>
> Cheers,
> Dick Johnson
>
> Penguin : Linux version 2.2.17 on an i686 machine (801.18 BogoMips).
>
> "Memory is like gasoline. You use it up when you are running. Of
> course you get it all back when you reboot..."; Actual explanation
> obtained from the Micro$oft help desk.

2000-10-31 23:21:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Tue, 31 Oct 2000, Jeff V. Merkey wrote:

> [...] These types of optimizations are possible when people have
> acccess to Intel Red Cover documents, [...]

optimizing away AGIs has been documented by public Intel PDFs for years:

[...] Since the Pentium processor has two integer pipelines, a register
used as the base or index component of an effective address calculation
(in either pipe) causes an additional clock cycle if that register is the
destination of either instruction from the immediately preceding clock
cycle. This effect is known as Address Generation Interlock (AGI).

(ditto for the p6 core CPUs), and GCC (IIRC) tries to avoid AGI conflicts
as much as possible. (and this kind of stuff belongs into the compiler)

Ingo

2000-10-31 23:21:25

by Nathan Paul Simons

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, Oct 31, 2000 at 03:38:00PM -0700, Jeff V. Merkey wrote:
> It's makes more money in a week than Linux has ever made.

The same could be said about Windows; that doesn't make it a
technically superior solution.
Speaking of Windows, a lot of your arguments are starting to sound
more and more like arguments made a while back by a certain OS vendor from
Seattle . . .

Wish not to seem, but to be, the best.
-- Aeschylus

--
Nathan Paul Simons, Junior Software Engineer for FSMLabs
http://www.fsmlabs.com/

2000-10-31 23:22:05

by Matti Aarnio

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, Oct 31, 2000 at 01:36:32PM -0800, Paul Menage wrote:
> On Tue, 31 Oct 2000, Rik van Riel wrote:
> >Ummm, last I looked Linux held the Specweb99 record;
> >by a wide margin...
>
> ... but since then IBM/Zeus appear to have taken the lead:
>
> http://www.spec.org/osg/web99/results/res2000q3/
>
> But they were using a somewhat beefier machine - has anyone got Tux
> SpecWeb99 figures for a 12 CPU, 64 GB, 12 NIC system?

Good grief, what monster hardware...

Those are (of course) system results which give some impression of
how much users can pull out of the box.

Trying to make them a bit more comparable, scaling the number with
the number of processors:

Zeus 12x600MHz IBM RS64-III 7288 SpecWEB99 ~ 607 SpecWEB99/CPU
Zeus 4x375MHz IBM Power3-II 2175 SpecWEB99 ~ 544 SpecWEB99/CPU
TUX 1.0 8x700MHz Pentium-III-Xeon 6387 SpecWEB99 ~ 798 SpecWeb99/CPU
IIS 2x800MHz Pentium-III-Xeon 1060 SpecWEB99 ~ 530 SpecWEB99/CPU
IIS 1x700MHz Pentium-III-Xeon 971 SpecWEB99 = 971 SpecWEB99/CPU

Ok, more workers to do the thing, but each can achieve a bit less in
the IBM/Zeus case than TUX 1.0. The smaller IBM/Zeus test case with
older and slower processors yields almost as good results per CPU as
the big one. CPU clock speed increase has been lost into inter-CPU
collisions ? (that is, bad scaling)

The IIS results are also interesting in their own. Single-CPU IIS
yields impressive PER CPU result, but adding second CPU is apparently
quite useless excercise. Hmm... Can't be.. As if that DUAL CPU
result is actually run in single-CPU mode. The difference can
directly be explained by the clock rate difference..
(Surely the runners of that test *can't* make such an elementary
mistake!)


To be able to compare apples and apples, I would like to see single,
and dual CPU SpecWEB99 results with TUX. Then that apparent 20%
better "per CPU result" of the single-CPU IIS could not be explained
away with SMP inter-CPU communication overhead/collisions.

> Paul

/Matti Aarnio

2000-10-31 23:22:15

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Ingo Molnar wrote:
>
> On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
>
> > [...] These types of optimizations are possible when people have
> > acccess to Intel Red Cover documents, [...]
>
> optimizing away AGIs has been documented by public Intel PDFs for years:
>
> [...] Since the Pentium processor has two integer pipelines, a register
> used as the base or index component of an effective address calculation
> (in either pipe) causes an additional clock cycle if that register is the
> destination of either instruction from the immediately preceding clock
> cycle. This effect is known as Address Generation Interlock (AGI).
>
> (ditto for the p6 core CPUs), and GCC (IIRC) tries to avoid AGI conflicts
> as much as possible. (and this kind of stuff belongs into the compiler)

Odd. When I profile Linux with EMON, I see tons of them. Anywhere code
does

mov eax, addr
mov [addr], ebx

one will get generated. Even something as simple as:

mov eax, addr
inc eax
dec eax
mov [addr]. ebx

will avoid an AGI (since the other pipeline has now been allowed a few
clocks to fetch the address and load it).

Jeff



>
> Ingo

2000-10-31 23:22:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Tue, 31 Oct 2000, Jeff V. Merkey wrote:

> > One could create a 'kernel' that does:
> > for(;;)
> > {
> > proc0();
> > proc1();
> > proc2();
> > proc3();
> > etc();
> > }
>
> would be coded like this (no C compiler):
>
> proc0:
>
> proc1:
>
> proc2:
>
> proc3:
>
> etc:
>
> label:
> jmp proc0

oh, and what happens if it turns out that some other place wants to call
proc3 as well? Recode the assembly - cool! Not.

> I just avoided 5 x 20 bytes of pushes and pops on the stack ad optimized
> for a simple fall through case.

FYI, GCC does not generate 5 x 20 bytes of pushes and pops. In fact in the
above specific case it will not generate a single push (automatically -
you dont have to worry about it).

Ingo

2000-10-31 23:24:35

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Nathan Paul Simons wrote:
>
> On Tue, Oct 31, 2000 at 03:38:00PM -0700, Jeff V. Merkey wrote:
> > It's makes more money in a week than Linux has ever made.
>
> The same could be said about Windows; that doesn't make it a
> technically superior solution.
> Speaking of Windows, a lot of your arguments are starting to sound
> more and more like arguments made a while back by a certain OS vendor from
> Seattle . . .

Not really. We ship Linux. I just want a Linux NetWare customers won't
laugh at when they try to put over 1000 people on it (which the NetWare
server was already handling we ate trying to replace).

Jeff

>
> Wish not to seem, but to be, the best.
> -- Aeschylus
>
> --
> Nathan Paul Simons, Junior Software Engineer for FSMLabs
> http://www.fsmlabs.com/

2000-10-31 23:27:55

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Ingo Molnar wrote:
>
> On Tue, 31 Oct 2000, Jeff V. Merkey wrote:
>
> > > One could create a 'kernel' that does:
> > > for(;;)
> > > {
> > > proc0();
> > > proc1();
> > > proc2();
> > > proc3();
> > > etc();
> > > }
> >
> > would be coded like this (no C compiler):
> >
> > proc0:
> >
> > proc1:
> >
> > proc2:
> >
> > proc3:
> >
> > etc:
> >
> > label:
> > jmp proc0
>
> oh, and what happens if it turns out that some other place wants to call
> proc3 as well? Recode the assembly - cool! Not.

They would load the registers to the proper values and jump to it. The
return address would be stored in the ESI register, and when the routie
completed, it would do

jmp esi

to return, with 0 stack usage.

>
> > I just avoided 5 x 20 bytes of pushes and pops on the stack ad optimized
> > for a simple fall through case.
>
> FYI, GCC does not generate 5 x 20 bytes of pushes and pops. In fact in the
> above specific case it will not generate a single push (automatically -
> you dont have to worry about it).

If the compiler were set to optimize.

Jeff

>
> Ingo

2000-10-31 23:33:55

by Roger Larsson

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

"Jeff V. Merkey" wrote:
>
> David/Alan,
>
> Andre Hedrick is now the CTO of TRG and Chief Scientist over Linux
> Development. After talking
> to him, we are going to do our own ring 0 2.4 and 2.2.x code bases for
> the MANOS merge.
> the uClinux is interesting, but I agree is limited.
>

Jeff,

What would be missed out in this approach:
* Use Montavista "fully" preemtible kernel.
* Using Kernel threads for all services (File, Print, Web, etc.).

/RogerL

--
Home page:
http://www.norran.net/nra02596/

2000-10-31 23:37:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Tue, 31 Oct 2000, Jeff V. Merkey wrote:

> Odd. When I profile Linux with EMON, I see tons of them. Anywhere code
> does
>
> mov eax, addr
> mov [addr], ebx

AGIs were a real problem on P5 class Intel CPUs. On P6 core CPUs, most
forms of addresses (except memory writes) do not generate any AGIs. And
the AGI on P6 cores does not keep up the pipeline, unless you reuse the
same address. (which would be stupid in most cases) I bet Crusoe's have no
AGIs at all. Do you see the trend in CPU design?

Ingo



2000-10-31 23:39:45

by David Weinehall

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Wed, Nov 01, 2000 at 01:21:03AM +0200, Matti Aarnio wrote:
> On Tue, Oct 31, 2000 at 01:36:32PM -0800, Paul Menage wrote:
> > On Tue, 31 Oct 2000, Rik van Riel wrote:
> > >Ummm, last I looked Linux held the Specweb99 record;
> > >by a wide margin...
> >
> > ... but since then IBM/Zeus appear to have taken the lead:
> >
> > http://www.spec.org/osg/web99/results/res2000q3/
> >
> > But they were using a somewhat beefier machine - has anyone got Tux
> > SpecWeb99 figures for a 12 CPU, 64 GB, 12 NIC system?
>
> Good grief, what monster hardware...
>
> Those are (of course) system results which give some impression of
> how much users can pull out of the box.
>
> Trying to make them a bit more comparable, scaling the number with
> the number of processors:
>
> Zeus 12x600MHz IBM RS64-III 7288 SpecWEB99 ~ 607 SpecWEB99/CPU
> Zeus 4x375MHz IBM Power3-II 2175 SpecWEB99 ~ 544 SpecWEB99/CPU
> TUX 1.0 8x700MHz Pentium-III-Xeon 6387 SpecWEB99 ~ 798 SpecWeb99/CPU
> IIS 2x800MHz Pentium-III-Xeon 1060 SpecWEB99 ~ 530 SpecWEB99/CPU
> IIS 1x700MHz Pentium-III-Xeon 971 SpecWEB99 = 971 SpecWEB99/CPU
>
> Ok, more workers to do the thing, but each can achieve a bit less in
> the IBM/Zeus case than TUX 1.0. The smaller IBM/Zeus test case with
> older and slower processors yields almost as good results per CPU as
> the big one. CPU clock speed increase has been lost into inter-CPU
> collisions ? (that is, bad scaling)
>
> The IIS results are also interesting in their own. Single-CPU IIS
> yields impressive PER CPU result, but adding second CPU is apparently
> quite useless excercise. Hmm... Can't be.. As if that DUAL CPU
> result is actually run in single-CPU mode. The difference can
> directly be explained by the clock rate difference..
> (Surely the runners of that test *can't* make such an elementary
> mistake!)
>
>
> To be able to compare apples and apples, I would like to see single,
> and dual CPU SpecWEB99 results with TUX. Then that apparent 20%
> better "per CPU result" of the single-CPU IIS could not be explained
> away with SMP inter-CPU communication overhead/collisions.

You mean like:

TUX 1.0 1x667MHz Pentium-IIIEB 1270 SpecWeb99
TUX 1.0 2x800MHz Pentium-III-Xeon 2200 SpecWeb99
TUX 1.0 4x700MHz Pentium-III-Xeon 4200 SpecWeb99

(Check out quarter 2 instead of q3)

Truly impressive figures imho.


/David Weinehall
_ _
// David Weinehall <[email protected]> /> Northern lights wander \\
// Project MCA Linux hacker // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </

2000-10-31 23:41:55

by Davide Libenzi

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Wed, 01 Nov 2000, Jeff V. Merkey wrote:
>
> mov eax, addr
> mov [addr], ebx
>

Probably You mean this :

mov r/imm, %eax
mov (%eax), %ebx


- Davide

2000-10-31 23:45:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


On Tue, 31 Oct 2000, Richard B. Johnson wrote:

> However, these techniques are not useful with a kernel that has an
> unknown number of tasks that execute 'programs' that are not known to
> the kernel at compile-time, such as a desk-top operating system.

yep, exactly. It simply optimizes the wrong thing and restricts
architectural flexibility. It is very easy to optimize by making
a system more specific. (this is fact is a more or less automatic
engineering work) The real optimizations are the ones that do not
take away from the generic nature of the system.

Ingo

2000-11-01 00:01:59

by Michael H. Warfield

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, Oct 31, 2000 at 04:20:30PM -0700, Jeff V. Merkey wrote:

> Nathan Paul Simons wrote:

> > On Tue, Oct 31, 2000 at 03:38:00PM -0700, Jeff V. Merkey wrote:
> > > It's makes more money in a week than Linux has ever made.

> > The same could be said about Windows; that doesn't make it a
> > technically superior solution.
> > Speaking of Windows, a lot of your arguments are starting to sound
> > more and more like arguments made a while back by a certain OS vendor from
> > Seattle . . .

> Not really. We ship Linux. I just want a Linux NetWare customers won't
> laugh at when they try to put over 1000 people on it (which the NetWare
> server was already handling we ate trying to replace).

Oh! That's funny!

Back in "the bad old days" Novell got laughed right off several
campuses for this very reason. Why? Because Netware, at the time, could
not handle more than 256 users on a given server. One admin I knew
here at Georgia Tech reported on the expression on the face of the Novell
presales support person when he was informed that the average public
server had several thousand accounts.

Netware customers that are worried about more than a few hundred
users must be fairly recent (4.x and about - 3.x has come into discussion
but doesn't count here) customers. Obviously, they are big and SIGNIFICANT
customers. Do we know that Linux can't handle the load, though, or is
this just more supposition based on statistics?

> Jeff

> >
> > Wish not to seem, but to be, the best.
> > -- Aeschylus
> >
> > --
> > Nathan Paul Simons, Junior Software Engineer for FSMLabs
> > http://www.fsmlabs.com/

Mike
--
Michael H. Warfield | (770) 985-6132 | [email protected]
(The Mad Wizard) | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0xDF1DD471 | possible worlds. A pessimist is sure of it!

2000-11-01 00:05:50

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Alan Cox wrote:
>
> > One more optimization it has. NetWare never "calls" functions in the
> > kernel. There's a template of register assignments in between kernel
> > modules that's very strict (esi contains a WTD head, edi has the target
> > thread, etc.) and all function calls are jumps in a linear space.
>
> What if I jump to an invalid address - does it crash ?

The jumps are all to defined labels in the code, so there's never one
that's invalid. The only exception are the jump tables in the MSM and
TSM lan modules, and yes it does crash is a bad driver gets loaded, but
that's the same as Linux. NetWare avoids using jump tables since they
cause an AGI to get generated, which will interlock the processor
pipelines. If you load an address on Intel, then immediately attempt to
use it, the processor will generate and Address Generation Interlock
(AGI) that will interlock the piplines so the request can be serviced
immediately. This cost 2 clocks per address jump used, plus it shuts
down the paralellism in the piplines.

Jeff

2000-11-01 00:06:00

by Richard B. Johnson

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, 31 Oct 2000, Jeff V. Merkey wrote:

>
> A "context" is usually assued to be a "stack". The simplest of all
> context switches
> is:
>
> mov x, esp
> mov esp, y
>
> A context switch can be as short as two instructions, or as big as a TSS
> with CR3 hardware switching,
>
> i.e.
>
> ltr ax
> jmp task_gate
>
> (500 clocks later)
>
> ts->eip gets exec'd
>
> you can also have a context switch that does an int X where X is a gate
> or TSS.
>
> you can also have a context switch (like linux) that does
>
> mov x, esp
> mov esp, y
> mov z, CR3
> mov CR3, w
>
> etc.
>
> In NetWare, a context switch is an in-line assembly language macro that
> is 2 instructions long for a stack switch and 4 instructions for a CR3
> reload -- this is a lot shorter than Linux.
> Only EBX, EBP, ESI, and EDI are saved and this is never done in the
> kernel, but is a natural
> affect of the Watcom C compiler. There's also strict rules about
> register assignments that re enforced between assembler modules in
> NetWare to reduce the overhead of a context switch. The code path is
> very complex in NetWare, and priorities and all this stuff exists, but
> these code paths are segragated so these types of checks only happen
> once in a while and check a pre-calc'd "scoreboard" that is read only
> across processors and updated and recal'd by a timer every 18 ticks.
>
> Jeff
>
>

I have this feeling that this is an April Fools joke. Unfortunately
it's Halloween.

One could create a 'kernel' that does:
for(;;)
{
proc0();
proc1();
proc2();
proc3();
etc();
}

... and loop forever. All 'tasks' would just be procedures and no
context-switching would even be necessary. This is how some network
file-servers worked in the past (Vines comes to mind). Since all
possible 'tasks' are known at compile-time, there isn't even any
need for memory protection because every task cooperates and doesn't
destroy anything that it doesn't own.

The only time you need to save anything is for interrupt handlers.
This was some simple push/pops of only the registers actually
used in the ISR.

Now, the above example may seem absurd, however it's not. Inside
each of the proc()'s is a global state-variable that allows the
code to start executing at the place it left off the last time
through. If the code was written in 'C' it would be a 'switch'
statement. The state-variable for each of the procedures is global
and can be changed in an interrupt-service-routine. This allows
interrupts to change the state of the state-machines.

This kind of 'kernel' is very fast and very effective for things
like file-servers and routers, things that do the same stuff over
and over again.

However, there techniques are not useful with a kernel that
has an unknown number of tasks that execute 'programs' that are
not known to the kernel at compile-time, such as a desk-top
operating system.

These operating systems require context-switching. This requires
that every register that a user could possibly alter, be saved
and restored. It also requires that the state of any hardware
that a user could be using, also be save and restored. This
cannot be done in 2 instructions as stated. Further, this saving
and restoring cannot be a side-effect of a particular compiler, as
stated.

Cheers,
Dick Johnson

Penguin : Linux version 2.2.17 on an i686 machine (801.18 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


2000-11-01 00:09:59

by Alan

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> users must be fairly recent (4.x and about - 3.x has come into discussion
> but doesn't count here) customers. Obviously, they are big and SIGNIFICANT
> customers. Do we know that Linux can't handle the load, though, or is
> this just more supposition based on statistics?

On the same hardware netware 3 at least tended to beat us flat, but then it
wasnt a general purpose OS. I think what Jeff is trying to build is basically
a box that runs netware in the netware 3/4 style - ie fast and a little
unprotected with a standard linux application space protected mode on top of it
- its an interesting concept.

2000-11-01 00:14:59

by Michael H. Warfield

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Wed, Nov 01, 2000 at 12:07:50AM +0000, Alan Cox wrote:
> > users must be fairly recent (4.x and about - 3.x has come into discussion
> > but doesn't count here) customers. Obviously, they are big and SIGNIFICANT
> > customers. Do we know that Linux can't handle the load, though, or is
> > this just more supposition based on statistics?

> On the same hardware netware 3 at least tended to beat us flat, but then it
> wasnt a general purpose OS. I think what Jeff is trying to build is basically
> a box that runs netware in the netware 3/4 style - ie fast and a little
> unprotected with a standard linux application space protected mode on top of it
> - its an interesting concept.

Alan, I'll grant that what Jeff is attempting to do is laudable.
It's just that I have experience with Netware 3.x and 4.x in the field
(Ok... More 3.x than 4.x) and I have more than passing familiarity with
5.x security problems. I know he wants to achieve some of the good things
that Novell managed. I know, from first hand experience, what cost some
of those come at. Can he achieve the goal of matching the good while
avoiding the bad? I hope so. He won't do it by promoting what he
perceives as the advantages of Novell over Linux without at least some
passing acknowledgement of the failures of Novell, the disadvantages, and
the costs. I think his goals are good and his head is in the right spot,
I just don't want to see history repeat itself. He has my best wishes,
since both he and Linux will benefit.

My $0.02.

Mike
--
Michael H. Warfield | (770) 985-6132 | [email protected]
(The Mad Wizard) | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0xDF1DD471 | possible worlds. A pessimist is sure of it!

2000-11-01 01:31:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

> Larry McVoy wrote:
>> Are there processes with virtual memory?
On Tue, Oct 31, 2000 at 03:38:00PM -0700, Jeff V. Merkey wrote:
> Yes.

If that stack switch is your context switch then you share the same VM for all
tasks. I think the above answer "yes" just means you have pagetables so you can
swap, but you _must_ miss memory protection across different processes. That
also mean any program can corrupt the memory of all the other programs. Even
on the Palm that's a showstopper limitation (and on the Palm that's an hardware
limitation, not a software deficiency of PalmOS).

That will never happen in linux, nor in windows, nor internally to kde2. It
happens in uclinux to deal with hardware without MMU. And infact the agenda
uses mips with memory protection even on a organizer with obvious advantages.

Just think kde2 could have all the kde app sharing the same VM skipping all the
tlb flushes by simply using clone instead of fork. Guess why they aren't doing
that? And even if they would do that, the first bug would _only_ destabilize
kde, so kill it and restart and everything else will keep running fine (you
don't even need to kill X). With your ring 0 linux _everything_ will crash, not
just kde.

And on sane architectures like alpha you don't even need to flush the TLB
during "real" context switching so all your worry to share the same VM for
everything is almost irrelevant there since it happens all the time anyways
(until you overflow the available ASN bits that takes a lots of forks to
happen).

So IMHO for you it's much saner to move all your performance critical code into
kernel space (that will be just stability-risky enough as khttpd and tux are).
In 2.4.x that will avoid all the cr3 reloads and that will be enough as what
you really care during fileserving are the copies that you must avoid.

Andrea

2000-11-01 01:34:14

by Horst von Brand

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

Jesse Pollard <[email protected]> said:

[...]

> Also pay attention to the security aspects of a true "zero copy" TCP stack.
> It means that SOMETIMES a user buffer will recieve data that is destined
> for a different process.

Why? AFAIKS, given proper handling of the issues involved, this can't
happen (sure can get tricky, but can be done in principle. Or am I
off-base?)
--
Horst von Brand [email protected]
Casilla 9G, Vin~a del Mar, Chile +56 32 672616

2000-11-01 02:31:21

by Horst von Brand

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

"Jeff V. Merkey" <[email protected]> said:
> One more optimization it has. NetWare never "calls" functions in the
> kernel. There's a template of register assignments in between kernel
> modules that's very strict (esi contains a WTD head, edi has the target
> thread, etc.) and all function calls are jumps in a linear space.
> layout of all functions are 16 bytes aligned for speed, and all
> arguments in kernel are passed via registers. It's a level of
> optimization no C compiler does -- all of it was done by hand, and most
> function s in fast paths are layed out in 512 byte chunks to increases
> speed. Stack memory activity in the NetWare kernel is almost
> non-existent in almost all of the "fast paths"

Nice! Now run that (i386 optimized?) beast on a machine that works
different (latest K7s perhaps?), and many optimizations break.

When you got that fixed, would you please port it to Alpha?

Sure, using C (with a not-overly-bright compiler) has a non-negligible
cost. But huge benefits too. The whole of software (including OS) design is
an excercise in delicate juggling among conflicting goals. Had Linus gone
down the "all-assembler, bummed to its limits" route, Linux would have been
dead by 0.03 or so, depending on the stubborness of its creator to be sure.
--
Horst von Brand [email protected]
Casilla 9G, Vin~a del Mar, Chile +56 32 672616

2000-11-01 03:52:48

by Jesse Pollard

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, 31 Oct 2000, Horst von Brand wrote:
>Jesse Pollard <[email protected]> said:
>
>[...]
>
>> Also pay attention to the security aspects of a true "zero copy" TCP stack.
>> It means that SOMETIMES a user buffer will recieve data that is destined
>> for a different process.
>
>Why? AFAIKS, given proper handling of the issues involved, this can't
>happen (sure can get tricky, but can be done in principle. Or am I
>off-base?)

As I understand the current implementation, this can't. One of the optimizations
I had read about (for a linux test) used zero copy to/from user buffer as well
as zero copy in the kernel. I believe the DMA went directly to the users memory.

This causes a problem when/if there is a context switch before the data is
actually transferred to the proper location. The buffer isn't ready for use,
but could be examined by the user application (hence the security problem).

It was posed that this is not a problem IF the cluster (and it was a beowulf
cluster under discussion) is operated in a single user, dedicated mode.
In which case, to examine the buffer would either be a bug in the program,
or a debugger looking at a buffer directly.

To my knowlege, zero copy is only done to/from device and kernel. Userspace
has to go through a buffer copy (one into user space; one output from user
space) for all IP handling. All checksums are either done by the device,
or done without copying the data.

--
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2000-11-01 05:02:01

by Peter Samuelson

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


[Jeff Merkey]
> > > The numbers don't lie. [...]
> >
[Ingo Molnar]
> > sure ;) I can do infinite context switches! You dont believe? See:
> >
> > #define schedule() do { } while (0)

[Jeff]
> Actually, I think the compiler would optimize this statement
> completely out of the code.

That was Ingo's point. He is doing infinite context switches per
second, where a context switch is defined as "schedule()", because he
turned it into a noop.

I.e. the numbers can easily be made to lie, by playing with the rules
of the game.

The point of confusion, Jeff, is that to *you* a context switch means
stack switch, with no baggage like scheduling or reloading registers.
Everyone else here thinks of a context switch as meaning a pre-emptive
switch between two unrelated processes -- which as you know involves
not only the stack, but MM adjustments, registers, floating point
registers (expensive on pre-P6), IP, and some form of scheduling.

Obviously some of these can be optimized out if you can make
assumptions about the processes: you might drop memory protection if
you like the stability of Windows 95, floating point if you can get
away with telling people they can't use it, maybe use FIFO scheduling
if you don't care about fairness and you know the processes are more or
less uniform. Linux cannot make any of these assumptions -- it is far
too general-purpose.

In Linux, in fact, jumping from ring 3 to ring 0 (ie system call) is
not considered a context switch. I suppose you would consider it one.
So the real question is, how many gettimeofday() per sec can Linux do?

Peter

2000-11-01 05:09:40

by Larry McVoy

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, Oct 31, 2000 at 11:01:20PM -0600, Peter Samuelson wrote:
> So the real question is, how many gettimeofday() per sec can Linux do?

Oh, about 3,531,073 on a 1Ghz AM thunderbird running
Linux disks.bitmover.com 2.4.0-test5.

That's 283.2 nanoseconds per call, to save you the math.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2000-11-01 05:20:58

by Peter Samuelson

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


[me]
> > So the real question is, how many gettimeofday() per sec can Linux
> > do?

[Larry McVoy]
> Oh, about 3,531,073 on a 1Ghz AM thunderbird running
> Linux disks.bitmover.com 2.4.0-test5.

So, at two "context switches" (Jeff's term) per syscall, we're
somewhere around half the speed of Netware's longjmp.

Not bad. (:

Peter

2000-11-01 05:39:16

by Juri Haberland

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

"Jeff V. Merkey" wrote:
> A "context" is usually assued to be a "stack". The simplest of all
> context switches is:
>
> mov x, esp
> mov esp, y

Presumeably you'd immediately do a ret to some address, and there pop a
base address off the stack to get some global memory. Is that right?
Your context switches would be inline, and you'd have hardcoded which
process to execute next in most cases.

I'll buy the concept that changing stacks amounts to changing contexts,
so long as you follow certain rules. Obviously, rules are what define a
context. What are the two instructions that precede and the two
instructions that follow? I'd guess, something like this:

push bp
push $1
mov x, esp
mov esp, y
ret
$1 pop bp

--
Daniel

2000-11-01 09:52:15

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

At 01:30 01/11/2000, Andrea Arcangeli wrote:
> > Larry McVoy wrote:
> >> Are there processes with virtual memory?
>On Tue, Oct 31, 2000 at 03:38:00PM -0700, Jeff V. Merkey wrote:
> > Yes.
>
>If that stack switch is your context switch then you share the same VM for all
>tasks. I think the above answer "yes" just means you have pagetables so
>you can
>swap, but you _must_ miss memory protection across different processes. That
>also mean any program can corrupt the memory of all the other programs. Even
>on the Palm that's a showstopper limitation (and on the Palm that's an
>hardware
>limitation, not a software deficiency of PalmOS).
>
>That will never happen in linux, nor in windows, nor internally to kde2. It
>happens in uclinux to deal with hardware without MMU. And infact the agenda
>uses mips with memory protection even on a organizer with obvious advantages.
>
>Just think kde2 could have all the kde app sharing the same VM skipping
>all the
>tlb flushes by simply using clone instead of fork. Guess why they aren't doing
>that? And even if they would do that, the first bug would _only_ destabilize
>kde, so kill it and restart and everything else will keep running fine (you
>don't even need to kill X). With your ring 0 linux _everything_ will
>crash, not
>just kde.

No need for imagination. Reality shows it: In my experience Netware is very
unstable as an OS. - We are running 5 Netware servers here in College (used
to be 3.12 now are 4.11) and whenever we do any upgrades (e.g. new service
pack) the servers start crashing every day or so until we find one by one
all modules that are not SMP capable (I assume that this is the reason it
is crashing?) and take them out / replace them with 3rd party equivalent
modules. - Just to name some things: Novell FTP server, Xconsole and/or
associated modules, "Apple desktop rebuilding thing" module... - All of
those would cause the server to crash into the debugger when running SMP
usually with a page fault. - Admittedly Netware is great at file &
application serving so we use it but it gets nowhere near to the stability
of Linux. The number of times Linux production systems in College have
crashed can be counted on the fingers of my hand while I have lost count of
the Novell crashes a long time ago.

IMHO stability is more important than anything else. - I prefer to run 20
Linux servers which will result in no phonecalls at midnight calling me
into College to reboot them compared to a Netware server which runs as fast
as the 20 Linux servers but disturbs my out-of-working-hours time!

I agree that having ring 0 OS will improve performance, no doubt about
that, but at what price?

Just my 2p.

Anton

>And on sane architectures like alpha you don't even need to flush the TLB
>during "real" context switching so all your worry to share the same VM for
>everything is almost irrelevant there since it happens all the time anyways
>(until you overflow the available ASN bits that takes a lots of forks to
>happen).
>
>So IMHO for you it's much saner to move all your performance critical code
>into
>kernel space (that will be just stability-risky enough as khttpd and tux are).
>In 2.4.x that will avoid all the cr3 reloads and that will be enough as what
>you really care during fileserving are the copies that you must avoid.
>
>Andrea
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>Please read the FAQ at http://www.tux.org/lkml/

--
"Education is what remains after one has forgotten everything he
learned in school." - Albert Einstein
--
Anton Altaparmakov Voice: +44-(0)1223-333541(lab) / +44-(0)7712-632205(mobile)
Christ's College eMail: [email protected] / [email protected]
Cambridge CB2 3BU ICQ: 8561279
United Kingdom WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2000-11-01 11:15:13

by David Woodhouse

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!


[email protected] said:
> If that stack switch is your context switch then you share the same VM
> for all tasks.

> That will never happen in linux,

Isn't that _exactly_ what happens with Linux kernel threads, with lazy mm
switching?

--
dwmw2


2000-11-01 14:58:33

by Horst von Brand

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

Jesse Pollard <[email protected]> said:
> On Tue, 31 Oct 2000, Horst von Brand wrote:
> >Jesse Pollard <[email protected]> said:
> >
> >[...]
> >
> >> Also pay attention to the security aspects of a true "zero copy" TCP stack.
> >> It means that SOMETIMES a user buffer will recieve data that is destined
> >> for a different process.

> >Why? AFAIKS, given proper handling of the issues involved, this can't
> >happen (sure can get tricky, but can be done in principle. Or am I
> >off-base?)

> As I understand the current implementation, this can't. One of the
> optimization s I had read about (for a linux test) used zero copy to/from
> user buffer as well as zero copy in the kernel. I believe the DMA went
> directly to the users memory.

Right. This means you have to ensure (somehow blocking the process(es) with
access to the buffer(s) involved) that nobody can see half-filled buffers.
Tricky, but not impossible, at least not in principle. Or play VM games and
switch the areas underneath atomically. The VM games we have been told are
costlier than the average copy on "typical" machines (PCs, presumably ;-),
plus you'd have to either ensure aligned buffers (how?) or keep two copies
of whatever surrounds them (where is the advantage then?).
--
Horst von Brand [email protected]
Casilla 9G, Vin~a del Mar, Chile +56 32 672616

2000-11-01 15:01:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Wed, Nov 01, 2000 at 11:13:16AM +0000, David Woodhouse wrote:
> Isn't that _exactly_ what happens with Linux kernel threads, with lazy mm
> switching?

Sure. Infact all the kernel (modules included) runs in ring 0 sharing the same
part of VM and - as everybody knows - a bug in a driver (or in khttpd or tux)
can crash the kernel.

But you can't destabilize the whole system when a bug in apache triggers (that
would happen with a ring 0 linux instead, and yes, with "linux" Jeff meant the
whole system, not just the kernel).

Andrea

2000-11-01 15:41:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Tue, Oct 31, 2000 at 06:38:09PM -0700, Jeff V. Merkey wrote:
> [..] It's all rather complicated, and I think alien
> to Unix folks. [..]

That has _nothing_ to do with software. That's only has to do with the IA32
hardware.

If you only switch the stack during context switching then you _can't_ provide
memory protection between different tasks. Period.

Andrea

2000-11-01 17:29:48

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Andrea Arcangeli wrote:
>
> On Tue, Oct 31, 2000 at 06:38:09PM -0700, Jeff V. Merkey wrote:
> > [..] It's all rather complicated, and I think alien
> > to Unix folks. [..]
>
> That has _nothing_ to do with software. That's only has to do with the IA32
> hardware.
>
> If you only switch the stack during context switching then you _can't_ provide
> memory protection between different tasks. Period.
>

Andrea,

I am writing up a complete description of NetWare internals, and what we
are doing to
the 2.2.X code base to create a NetWare Linux hybrid. I will post, an
if Alan wants to fork his own personal NetWare 2.2.18pre in his /people
area, I would have Andre and my folks maintain it and make it available
for everyone to use.

I've been deluged with emails from folks on the list telling me to go
for it, and Novell customers who think it's a valid path to preserve
their investments in Novell technologies, and I know it will be a
lucrative market and put Linux in high level enterprise accounts for
high capacity file and print. When 2.4 is out the door, we'll start
looking at that one.

It would also allow the Linux companies to go from 20 million a year in
revenues to 100 million in revenues on boxed software sales alone.
Novell is bringing in 1 billion dollars a year. If Caldera, Suse, and
RedHat split this three ways, that's 33 million dollars more each year
than they are making now. People will pay it, and since it's ring 0,
they have to get it from a vendor to know that all the componenets are
stable together (this is how Novell stays in the accounts and is able to
deamdn a high price for this product).

I am finishing NWFS 2.4.4 post so, I will finish the write up after I
finish up these auto repair tools for the FS.

:-)

Jeff


> Andrea

2000-11-01 17:33:29

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Anton Altaparmakov wrote:
>
>
> IMHO stability is more important than anything else. - I prefer to run 20
> Linux servers which will result in no phonecalls at midnight calling me
> into College to reboot them compared to a Netware server which runs as fast
> as the 20 Linux servers but disturbs my out-of-working-hours time!
>
> I agree that having ring 0 OS will improve performance, no doubt about
> that, but at what price?
>

It depends on how well we do out job. I guess that's the real debate.
Welcome back,
how's things.

:-)

Jeff

> Just my 2p.
>
> Anton
>
> >And on sane architectures like alpha you don't even need to flush the TLB
> >during "real" context switching so all your worry to share the same VM for
> >everything is almost irrelevant there since it happens all the time anyways
> >(until you overflow the available ASN bits that takes a lots of forks to
> >happen).
> >
> >So IMHO for you it's much saner to move all your performance critical code
> >into
> >kernel space (that will be just stability-risky enough as khttpd and tux are).
> >In 2.4.x that will avoid all the cr3 reloads and that will be enough as what
> >you really care during fileserving are the copies that you must avoid.
> >
> >Andrea
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >Please read the FAQ at http://www.tux.org/lkml/
>
> --
> "Education is what remains after one has forgotten everything he
> learned in school." - Albert Einstein
> --
> Anton Altaparmakov Voice: +44-(0)1223-333541(lab) / +44-(0)7712-632205(mobile)
> Christ's College eMail: [email protected] / [email protected]
> Cambridge CB2 3BU ICQ: 8561279
> United Kingdom WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2000-11-01 17:39:51

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



"Jeff V. Merkey" wrote:
>
> Andrea Arcangeli wrote:
> >
> > On Tue, Oct 31, 2000 at 06:38:09PM -0700, Jeff V. Merkey wrote:
> > > [..] It's all rather complicated, and I think alien
> > > to Unix folks. [..]
> >
> > That has _nothing_ to do with software. That's only has to do with the IA32
> > hardware.
> >
> > If you only switch the stack during context switching then you _can't_ provide
> > memory protection between different tasks. Period.
> >
>
> Andrea,
>
> I am writing up a complete description of NetWare internals, and what we
> are doing to
> the 2.2.X code base to create a NetWare Linux hybrid. I will post, an
> if Alan wants to fork his own personal NetWare 2.2.18pre in his /people
> area, I would have Andre and my folks maintain it and make it available
> for everyone to use.
>
> I've been deluged with emails from folks on the list telling me to go
> for it, and Novell customers who think it's a valid path to preserve
> their investments in Novell technologies, and I know it will be a
> lucrative market and put Linux in high level enterprise accounts for
> high capacity file and print. When 2.4 is out the door, we'll start
> looking at that one.
>
> It would also allow the Linux companies to go from 20 million a year in
> revenues to 100 million in revenues on boxed software sales alone.
> Novell is bringing in 1 billion dollars a year. If Caldera, Suse, and
> RedHat split this three ways, that's 33 million dollars more each year


Wrong math. That's 330 million dollars for each compat more each year
to
fund more Linux development and make us all rich...

Jeff

> than they are making now. People will pay it, and since it's ring 0,
> they have to get it from a vendor to know that all the componenets are
> stable together (this is how Novell stays in the accounts and is able to
> deamdn a high price for this product).
>
> I am finishing NWFS 2.4.4 post so, I will finish the write up after I
> finish up these auto repair tools for the FS.
>
> :-)
>
> Jeff
>
> > Andrea

2000-11-01 18:07:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Wed, Nov 01, 2000 at 10:35:50AM -0700, Jeff V. Merkey wrote:
> Wrong math. That's 330 million dollars for each compat more each year to
> fund more Linux development and make us all rich...

Speaking only for myself: on the technical side I don't think you can't be much
faster than moving the performance critical services into the kernel and by
skipping the copies (infact I also think that for fileserving skipping the
copies and making sendfile to work and to work in zero copy will be enough).
So I don't think losing robusteness this way can be explained in any technical
way and no, it's not by showing me money that you'll convince me that's a good
idea.

Andrea

2000-11-01 18:38:16

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Andrea Arcangeli wrote:
>
> On Wed, Nov 01, 2000 at 10:35:50AM -0700, Jeff V. Merkey wrote:
> > Wrong math. That's 330 million dollars for each compat more each year to
> > fund more Linux development and make us all rich...
>
> Speaking only for myself: on the technical side I don't think you can't be much
> faster than moving the performance critical services into the kernel and by
> skipping the copies (infact I also think that for fileserving skipping the
> copies and making sendfile to work and to work in zero copy will be enough).
> So I don't think losing robusteness this way can be explained in any technical
> way and no, it's not by showing me money that you'll convince me that's a good
> idea.
>
> Andrea

This would help, but not as much as full ring 0.

Jeff

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-11-01 21:14:18

by Juri Haberland

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

"Jeff V. Merkey" wrote:
>
> Andrea Arcangeli wrote:
> >
> > Speaking only for myself: on the technical side I don't think you can't be much
> > faster than moving the performance critical services into the kernel and by
> > skipping the copies (infact I also think that for fileserving skipping the
> > copies and making sendfile to work and to work in zero copy will be enough).
> > So I don't think losing robusteness this way can be explained in any technical
> > way and no, it's not by showing me money that you'll convince me that's a good
> > idea.
>
> This would help, but not as much as full ring 0.

My experience is that I can get pretty much the same performance in ring
3 as ring 0 as long as I don't reload segment registers or take CR3. Is
this right, or am I missing some fundamental kind of ring 3 overhead?

Even in ring 0, you can mostly protect processes from each other using
segments: if you don't reload the segments you can restrict damage to
your own segment. It's not 100% safety but it is an enormous
improvement over running in the same address space as the OS kernel. I
don't have any problem at all with the idea of running a lot of parallel
tasks in the same address space: the safety of this comes down to the
compiler you use to compile the processes. If the compiler doesn't have
ops that let processes damage each other then you won't get damage,
assuming no bugs in your underlying implementation.

BTW, let me add my 'me too': go for it, there is obviously a pot of gold
there, just don't let Sauron^H^H^H^H^H^H Bill get to it first.

--
Daniel

2000-11-01 21:36:28

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Daniel Phillips wrote:
>
> "Jeff V. Merkey" wrote:
> >
> > Andrea Arcangeli wrote:
> > >
> > > Speaking only for myself: on the technical side I don't think you can't be much
> > > faster than moving the performance critical services into the kernel and by
> > > skipping the copies (infact I also think that for fileserving skipping the
> > > copies and making sendfile to work and to work in zero copy will be enough).
> > > So I don't think losing robusteness this way can be explained in any technical
> > > way and no, it's not by showing me money that you'll convince me that's a good
> > > idea.
> >
> > This would help, but not as much as full ring 0.
>
> My experience is that I can get pretty much the same performance in ring
> 3 as ring 0 as long as I don't reload segment registers or take CR3. Is
> this right, or am I missing some fundamental kind of ring 3 overhead?
>
> Even in ring 0, you can mostly protect processes from each other using
> segments: if you don't reload the segments you can restrict damage to
> your own segment. It's not 100% safety but it is an enormous
> improvement over running in the same address space as the OS kernel. I
> don't have any problem at all with the idea of running a lot of parallel
> tasks in the same address space: the safety of this comes down to the
> compiler you use to compile the processes. If the compiler doesn't have
> ops that let processes damage each other then you won't get damage,
> assuming no bugs in your underlying implementation.
>
> BTW, let me add my 'me too': go for it, there is obviously a pot of gold
> there, just don't let Sauron^H^H^H^H^H^H Bill get to it first.


Amen to that one. BTW. The package we mailed out to you from Brian
went yesterday. Let me know when it arrives. I sent it to the address
in Berlin you provided.

:-)

Jeff

>
> --
> Daniel
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-11-02 21:59:31

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

[recipients list shortened]
At 17:28 01/11/2000, Jeff V. Merkey wrote:
>Anton Altaparmakov wrote:
> > IMHO stability is more important than anything else. - I prefer to run 20
> > Linux servers which will result in no phonecalls at midnight calling me
> > into College to reboot them compared to a Netware server which runs as fast
> > as the 20 Linux servers but disturbs my out-of-working-hours time!
> >
> > I agree that having ring 0 OS will improve performance, no doubt about
> > that, but at what price?
>
>It depends on how well we do out job. I guess that's the real debate.

That's very true. (-:

I was just assuming that we live in an imperfect world and hence have
imperfect programs no matter how hard we try to keep them perfect. /-: But
that argument belongs to alt.philosophy.life or something like that... (-;

>Welcome back, how's things.

Fine. Thanks. Just very busy with other things so haven't gotten any coding
done in already a few weeks now. )-:

Anton

--
"Education is what remains after one has forgotten everything he
learned in school." - Albert Einstein
--
Anton Altaparmakov Voice: +44-(0)1223-333541(lab) / +44-(0)7712-632205(mobile)
Christ's College eMail: [email protected] / [email protected]
Cambridge CB2 3BU ICQ: 8561279
United Kingdom WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2000-11-02 22:50:48

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



"Jeff V. Merkey" wrote:

In the example of an AGI generating code fragment, while I described the
sequence of creating the AGI with immediate address usage correctly
(i.e. if you load an address into a register then immediately attempt to
use it, it will generate an AGI), I failed to put the register in the
coding example.

A couple of folks are testing the gcc compiler for AGI problems as a
result of this post, and I am posting the corrected code for their
tests.

This code fragment will generate an AGI condition:

mov eax, addr
mov [eax].offset, ebx

You can do it with any register combination, BTW, eax and abx are
provided as examples. For those who are monitoring the code produced by
gcc, this is the example to use to test generate an AGI correctly.

:-)

Jeff

2000-11-02 22:58:08

by Davide Libenzi

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Thu, 02 Nov 2000, Jeff V. Merkey wrote:
> "Jeff V. Merkey" wrote:
> This code fragment will generate an AGI condition:
>
> mov eax, addr
> mov [eax].offset, ebx

I had already posted the correction.
It was clear that You had forgot something coz Your old code fragment did not
generate AGI.


- Davide

2000-11-02 23:04:09

by Jeff Merkey

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!



Davide Libenzi wrote:
>
> On Thu, 02 Nov 2000, Jeff V. Merkey wrote:
> > "Jeff V. Merkey" wrote:
> > This code fragment will generate an AGI condition:
> >
> > mov eax, addr
> > mov [eax].offset, ebx
>
> I had already posted the correction.
> It was clear that You had forgot something coz Your old code fragment did not
> generate AGI.

typing too fast late at night. I described it correctly, I just forgot
to add the register indirection.

:-)

Jeff

>
> - Davide

2000-11-03 06:42:57

by Juri Haberland

[permalink] [raw]
Subject: Re: 2.2.18Pre Lan Performance Rocks!

"Jeff V. Merkey" wrote:
> A "context" is usually assued to be a "stack". The simplest of all
> context switches is:
>
> mov x, esp
> mov esp, y

Is that your two instruction context switch? The problem is, it doesn't
transfer control anywhere. Maybe it doesn't need to. I guess you could
break your tasks up into lots of little chunks and compile each chunk
inline and use actual calls to take you off the fast path. The stack
changes are actually doing some useful work here: you might for instance
be processing a network packet whose address is on the stack. But
somehow I don't think this is your two-instruction context switch. The
only halfway flexible two-instruction context switch I can think of is:

mov esp, y
ret

where you already know the stack depth where you are so you don't have
to store it, and the task execution order is predetermined. This
switches the *two* essential ingredients of a context: control+data.
But there's a big fat AGI there and all the overhead of a jump so it
doesn't get your superscalar performance.

Now my stupid question: why on earth do you need a billion context
switches a second?

--
Daniel