LinuxLists.cc - 2.2.18Pre Lan Performance Rocks!

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > It's still got some problems with NFS (I am seeing a few RPC timeout
> > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > but it's most impressive.
>
> Can you send a summary of the NFS reports to [email protected]

Yes. I just went home, so I am emailing from my house. I'll post late
tonight or in the morning. Performance on 100Mbit with NFS going
Linux->Linux is getting better throughput than IPX NetWare Client ->
NetWare 5.x on the same network by @ 3%. When you start loading up a
Linux server, it drops off sharply and NetWare keeps scaling, however,
this does indicate that the LAN code paths are equivalent relative to
latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
(but we'll fix this in Linux next). I think the ring transitions to
user space daemons are what are causing the scaling problems
Linux vs. NetWare.

:-)

Jeff

2000-10-30 06:47:31

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > but it's most impressive.
> >
> > Can you send a summary of the NFS reports to [email protected]
>
> Yes. I just went home, so I am emailing from my house. I'll post late
> tonight or in the morning. Performance on 100Mbit with NFS going
> Linux->Linux is getting better throughput than IPX NetWare Client ->
> NetWare 5.x on the same network by @ 3%. When you start loading up a
> Linux server, it drops off sharply and NetWare keeps scaling, however,
> this does indicate that the LAN code paths are equivalent relative to
> latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> (but we'll fix this in Linux next). I think the ring transitions to
> user space daemons are what are causing the scaling problems
> Linux vs. NetWare.

There are no user space daemons involved in the knfsd fast path, only in slow paths
like mounting.
The main problem I think in knfsd are the numerous copies of the data (e.g. 2+checksumming for
RX with fragments, upto 4 in some specific configurations). They're unfortunately
not trivial to fix. TX is a bit better, it does only one copy usually out of
the page cache. For RX it also helps to have a network card that supports hardware
checksumming.

-Andi

2000-10-30 07:02:07

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 07:47:00AM +0100, Andi Kleen wrote:
> On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> > On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > > but it's most impressive.
> > >
> > > Can you send a summary of the NFS reports to [email protected]
> >
> > Yes. I just went home, so I am emailing from my house. I'll post late
> > tonight or in the morning. Performance on 100Mbit with NFS going
> > Linux->Linux is getting better throughput than IPX NetWare Client ->
> > NetWare 5.x on the same network by @ 3%. When you start loading up a
> > Linux server, it drops off sharply and NetWare keeps scaling, however,
> > this does indicate that the LAN code paths are equivalent relative to
> > latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> > (but we'll fix this in Linux next). I think the ring transitions to
> > user space daemons are what are causing the scaling problems
> > Linux vs. NetWare.
>
> There are no user space daemons involved in the knfsd fast path, only in slow paths
> like mounting.

So why is it spawning off nfsd servers in user space? I did notice that all
the connect logic is down in the kernel with the RPC stuff in xprt.c, and this
looks to be pretty fast. I know about the checksumming problem with TCPIP.
But cycles are cheap on todays processors, so even with this overhead, it
could still get faster. IPX uses small packets that are less wire efficient
since the ratio of header size to payload is larger than what NFS in Linux
is doing, plus there's more of them for an equivalent data transfer, even
with packet burst. IPX would tend to be faster if there were multiple
routers involved, since the latency of smaller packets would be less and
IPX never has to deal with the problem of fragmentation.

Is there any CR3 reloading or tasking switching going on for each interrupt
that comes in you know of, BTW? EMON stats show the server's bus utilization
to get disproportiantely higher in Linux than NetWare 5.x and when it hits
60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
the bus utilitization, which would suggest clocks are being used for something
other than pushing packets in and out of the box (the checksumming is done
in the tcp code when it copies the fragments, which is very low overhead).

Jeff

> The main problem I think in knfsd are the numerous copies of the data (e.g. 2+checksumming for
> RX with fragments, upto 4 in some specific configurations). They're unfortunately
> not trivial to fix. TX is a bit better, it does only one copy usually out of
> the page cache. For RX it also helps to have a network card that supports hardware
> checksumming.
>
>
>
> -Andi

2000-10-30 07:09:18

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Sun, Oct 29, 2000 at 11:58:21PM -0700, Jeff V. Merkey wrote:
> On Mon, Oct 30, 2000 at 07:47:00AM +0100, Andi Kleen wrote:
> > On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> > > On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > > > but it's most impressive.
> > > >
> > > > Can you send a summary of the NFS reports to [email protected]
> > >
> > > Yes. I just went home, so I am emailing from my house. I'll post late
> > > tonight or in the morning. Performance on 100Mbit with NFS going
> > > Linux->Linux is getting better throughput than IPX NetWare Client ->
> > > NetWare 5.x on the same network by @ 3%. When you start loading up a
> > > Linux server, it drops off sharply and NetWare keeps scaling, however,
> > > this does indicate that the LAN code paths are equivalent relative to
> > > latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> > > (but we'll fix this in Linux next). I think the ring transitions to
> > > user space daemons are what are causing the scaling problems
> > > Linux vs. NetWare.
> >
> > There are no user space daemons involved in the knfsd fast path, only in slow paths
> > like mounting.
>
> So why is it spawning off nfsd servers in user space? I did notice that all

They just provide a process context, the actual work is only done in kernel mode.

> the connect logic is down in the kernel with the RPC stuff in xprt.c, and this
> looks to be pretty fast. I know about the checksumming problem with TCPIP.
> But cycles are cheap on todays processors, so even with this overhead, it
> could still get faster. IPX uses small packets that are less wire efficient
> since the ratio of header size to payload is larger than what NFS in Linux
> is doing, plus there's more of them for an equivalent data transfer, even
> with packet burst. IPX would tend to be faster if there were multiple
> routers involved, since the latency of smaller packets would be less and
> IPX never has to deal with the problem of fragmentation.
>
> Is there any CR3 reloading or tasking switching going on for each interrupt
> that comes in you know of, BTW? EMON stats show the server's bus utilization

No, interrupts do not change CR3.

One problem in Linux 2.2 is that kernel threads reload their VM on context switch
(that would include the nfsd thread), this should be fixed in 2.4 with lazy mm.
Hmm actually it should be only fixed for true kernel threads that have been started
with kernel_thread(), the "pseudo kernel threads" like nfsd uses probably do not
get that optimization because they don't set their MM to init_mm.

> to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8

I think that can be explained by the copying.

To be sure you could e.g. use the ktrace patch from IKD, it will give you
cycle accurate traces of execution of functions.

> the bus utilitization, which would suggest clocks are being used for something
> other than pushing packets in and out of the box (the checksumming is done
> in the tcp code when it copies the fragments, which is very low overhead).

No, unfortunately not. NFS uses UDP and the UDP defragmenting doesn't do copy-checksum
currently.

-Andi

2000-10-30 07:17:32

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Andi Kleen wrote:

> One problem in Linux 2.2 is that kernel threads reload their VM on
> context switch (that would include the nfsd thread), this should be
> fixed in 2.4 with lazy mm. Hmm actually it should be only fixed for
> true kernel threads that have been started with kernel_thread(), the
> "pseudo kernel threads" like nfsd uses probably do not get that
> optimization because they don't set their MM to init_mm.

yes, but for this there is an explicit mechanizm to lazy-MM during lengthy
system calls, an example is in buffer.c:

user_mm = start_lazy_tlb();
error = sync_old_buffers();
end_lazy_tlb(user_mm);

> > to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> > 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
>
> I think that can be explained by the copying.

yes. Constant copying contaminates the L1/L2 caches and creates dirty
cachelines all around the place. Fixed in 2.4 + TUX ;-)

Ingo

2000-10-30 07:20:32

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 08:08:58AM +0100, Andi Kleen wrote:
> On Sun, Oct 29, 2000 at 11:58:21PM -0700, Jeff V. Merkey wrote:
> > On Mon, Oct 30, 2000 at 07:47:00AM +0100, Andi Kleen wrote:
> > > On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> > > > On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > > > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > > > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > > > > but it's most impressive.
> > > > >
> > > > > Can you send a summary of the NFS reports to [email protected]
> > > >
> > > > Yes. I just went home, so I am emailing from my house. I'll post late
> > > > tonight or in the morning. Performance on 100Mbit with NFS going
> > > > Linux->Linux is getting better throughput than IPX NetWare Client ->
> > > > NetWare 5.x on the same network by @ 3%. When you start loading up a
> > > > Linux server, it drops off sharply and NetWare keeps scaling, however,
> > > > this does indicate that the LAN code paths are equivalent relative to
> > > > latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> > > > (but we'll fix this in Linux next). I think the ring transitions to
> > > > user space daemons are what are causing the scaling problems
> > > > Linux vs. NetWare.
> > >
> > > There are no user space daemons involved in the knfsd fast path, only in slow paths
> > > like mounting.
> >
> > So why is it spawning off nfsd servers in user space? I did notice that all
>
> They just provide a process context, the actual work is only done in kernel mode.
>
> > the connect logic is down in the kernel with the RPC stuff in xprt.c, and this
> > looks to be pretty fast. I know about the checksumming problem with TCPIP.
> > But cycles are cheap on todays processors, so even with this overhead, it
> > could still get faster. IPX uses small packets that are less wire efficient
> > since the ratio of header size to payload is larger than what NFS in Linux
> > is doing, plus there's more of them for an equivalent data transfer, even
> > with packet burst. IPX would tend to be faster if there were multiple
> > routers involved, since the latency of smaller packets would be less and
> > IPX never has to deal with the problem of fragmentation.
> >
> > Is there any CR3 reloading or tasking switching going on for each interrupt
> > that comes in you know of, BTW? EMON stats show the server's bus utilization
>
> No, interrupts do not change CR3.
>
> One problem in Linux 2.2 is that kernel threads reload their VM on context switch
> (that would include the nfsd thread), this should be fixed in 2.4 with lazy mm.

When you say it reloads it's VM, you mean it reloads the CR3 register?
This will cost you about 15 clocks up front, and about 150 clocks over
time as each TLB is reloaded (plus is will do LOCK# assertions invisibly
underneath for PTE fetches) One major difference is that in NetWare,
CR3 does not ever get changed in any network I/O fast paths, and none of
the TCP or IPX processes exist in user space, so there's never this
background overhead with page tables being loaded in and out. CR3 only
gets mucked with when someone allocates memory and pages in an address
frame (when memory is alloced and freed).

The user space activity is what's causing this, plus the copying. Is there
an easy way to completely disable multiple PTE/PDE address spaces and just
map the entire address space linear (like NetWare) with a compile option
so I could do a true apples to apples comparison?

Jeff

> Hmm actually it should be only fixed for true kernel threads that have been started
> with kernel_thread(), the "pseudo kernel threads" like nfsd uses probably do not
> get that optimization because they don't set their MM to init_mm.
>
> > to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> > 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
>
> I think that can be explained by the copying.

It's more. I am seeing tons of segment register reloads, which are very
heavy, and atomic reads of PTE entries.

>
> To be sure you could e.g. use the ktrace patch from IKD, it will give you
> cycle accurate traces of execution of functions

I'll look at this.

> > the bus utilitization, which would suggest clocks are being used for something
> > other than pushing packets in and out of the box (the checksumming is done
> > in the tcp code when it copies the fragments, which is very low overhead).
>
> No, unfortunately not. NFS uses UDP and the UDP defragmenting doesn't do copy-checksum

Correct. You're right -- it's in the tcp code, however, I remember going over
this when I was reviewing Alan's code -- that's what I was think of.

Jeff

> currently.
>
>
> -Andi

2000-10-30 07:24:12

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 09:26:59AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Andi Kleen wrote:
>
> > One problem in Linux 2.2 is that kernel threads reload their VM on
> > context switch (that would include the nfsd thread), this should be
> > fixed in 2.4 with lazy mm. Hmm actually it should be only fixed for
> > true kernel threads that have been started with kernel_thread(), the
> > "pseudo kernel threads" like nfsd uses probably do not get that
> > optimization because they don't set their MM to init_mm.
>
> yes, but for this there is an explicit mechanizm to lazy-MM during lengthy
> system calls, an example is in buffer.c:
>
> user_mm = start_lazy_tlb();
> error = sync_old_buffers();
> end_lazy_tlb(user_mm);
>
> > > to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> > > 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
> >
> > I think that can be explained by the copying.
>
> yes. Constant copying contaminates the L1/L2 caches and creates dirty
> cachelines all around the place. Fixed in 2.4 + TUX ;-)
>

Ingo, we need a build option to completely disable multiple address spaces
for a start, and just map everything to a linear address space. This
will eliminate the overhead of the CR3 activity. The use of segment registers
for copy_to_user, etc. causes segment register reloads, which are very
heavyweight on Intel.

Is there an option to map Linux into a flat address space like NetWare so
I can do an apples to apples comparison of raw LAN I/O scaling?

> Ingo

2000-10-30 07:30:13

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> Is there an option to map Linux into a flat address space [...]

nope, Linux is fundamentally multitasked.

what you can do to hack around this is to not switch to the idle thread
after having done work in nfsd. Some simple & stupid thing in schedule:

if (next == idle_task) {
while (nr_running)
barrier();
goto repeat_schedule;
}

(provided you are testing this on a UP system.) This way we do not destroy
the TLB cache when we wait a few microseconds for the next network
interrupt.

we do this in 2.4 already - ie. nfsd doesnt have to mark itself lazy-MM,
the idle thread will automatically 'inherit' the MM of nfsd, and is going
to switch CR3 only if the next process is not nfsd. So you can get an
apples to apples comparison by using 2.4.

Ingo

2000-10-30 07:38:43

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:16:46AM -0700, Jeff V. Merkey wrote:
> On Mon, Oct 30, 2000 at 08:08:58AM +0100, Andi Kleen wrote:
> > On Sun, Oct 29, 2000 at 11:58:21PM -0700, Jeff V. Merkey wrote:
> > > On Mon, Oct 30, 2000 at 07:47:00AM +0100, Andi Kleen wrote:
> > > > On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> > > > > On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > > > > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > > > > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > > > > > but it's most impressive.
> > > > > >
> > > > > > Can you send a summary of the NFS reports to [email protected]
> > > > >
> > > > > Yes. I just went home, so I am emailing from my house. I'll post late
> > > > > tonight or in the morning. Performance on 100Mbit with NFS going
> > > > > Linux->Linux is getting better throughput than IPX NetWare Client ->
> > > > > NetWare 5.x on the same network by @ 3%. When you start loading up a
> > > > > Linux server, it drops off sharply and NetWare keeps scaling, however,
> > > > > this does indicate that the LAN code paths are equivalent relative to
> > > > > latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> > > > > (but we'll fix this in Linux next). I think the ring transitions to
> > > > > user space daemons are what are causing the scaling problems
> > > > > Linux vs. NetWare.
> > > >
> > > > There are no user space daemons involved in the knfsd fast path, only in slow paths
> > > > like mounting.
> > >
> > > So why is it spawning off nfsd servers in user space? I did notice that all
> >
> > They just provide a process context, the actual work is only done in kernel mode.
> >
> > > the connect logic is down in the kernel with the RPC stuff in xprt.c, and this
> > > looks to be pretty fast. I know about the checksumming problem with TCPIP.
> > > But cycles are cheap on todays processors, so even with this overhead, it
> > > could still get faster. IPX uses small packets that are less wire efficient
> > > since the ratio of header size to payload is larger than what NFS in Linux
> > > is doing, plus there's more of them for an equivalent data transfer, even
> > > with packet burst. IPX would tend to be faster if there were multiple
> > > routers involved, since the latency of smaller packets would be less and
> > > IPX never has to deal with the problem of fragmentation.
> > >
> > > Is there any CR3 reloading or tasking switching going on for each interrupt
> > > that comes in you know of, BTW? EMON stats show the server's bus utilization
> >
> > No, interrupts do not change CR3.
> >
> > One problem in Linux 2.2 is that kernel threads reload their VM on context switch
> > (that would include the nfsd thread), this should be fixed in 2.4 with lazy mm.
>
> When you say it reloads it's VM, you mean it reloads the CR3 register?

Yes.

> This will cost you about 15 clocks up front, and about 150 clocks over
> time as each TLB is reloaded (plus is will do LOCK# assertions invisibly
> underneath for PTE fetches) One major difference is that in NetWare,
> CR3 does not ever get changed in any network I/O fast paths, and none of
> the TCP or IPX processes exist in user space, so there's never this
> background overhead with page tables being loaded in and out. CR3 only
> gets mucked with when someone allocates memory and pages in an address
> frame (when memory is alloced and freed).
>
> The user space activity is what's causing this, plus the copying. Is there
> an easy way to completely disable multiple PTE/PDE address spaces and just
> map the entire address space linear (like NetWare) with a compile option
> so I could do a true apples to apples comparison?

No. In 2.4 you could probably use the on demand lazy vm mechanism ingo described for
the nfsd processes. In 2.2 it is a bit more tricky, if I remember right lazy mm needed
quite a few changes.

But before doing too many changes i would first verify if that is really the problem.

> > Hmm actually it should be only fixed for true kernel threads that have been started
> > with kernel_thread(), the "pseudo kernel threads" like nfsd uses probably do not
> > get that optimization because they don't set their MM to init_mm.
> >
> > > to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> > > 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
> >
> > I think that can be explained by the copying.
>
> It's more. I am seeing tons of segment register reloads, which are very
> heavy, and atomic reads of PTE entries.

PTEs are read for aging on memory pressure in vmscan.

segment register reload happen on interrupt entry, to load the kernel cs/ds (are they that
costly?). If it was really costly you could probably check in interrupt entry if you're
already running in kernel space and skip it.

>
> > > the bus utilitization, which would suggest clocks are being used for something
> > > other than pushing packets in and out of the box (the checksumming is done
> > > in the tcp code when it copies the fragments, which is very low overhead).
> >
> > No, unfortunately not. NFS uses UDP and the UDP defragmenting doesn't do copy-checksum
>
> Correct. You're right -- it's in the tcp code, however, I remember going over
> this when I was reviewing Alan's code -- that's what I was think of.

2.2 does not use checksum-copy-to-user in TCP RX, csum is either done separate or in hardware.
It does csum-copy-fragment for TX, as well as UDP.

-Andi

2000-10-30 08:08:23

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

> >
> > When you say it reloads it's VM, you mean it reloads the CR3 register?
>
> Yes.
>
> No. In 2.4 you could probably use the on demand lazy vm mechanism ingo described for
> the nfsd processes. In 2.2 it is a bit more tricky, if I remember right lazy mm needed
> quite a few changes.
>
> But before doing too many changes i would first verify if that is really the problem.

We will never beat NetWare on scaling if this is the case, even in 2.4.
Andre and my first job will be to create an arch port with MANOS that
disables this and restructures the VM.

> PTEs are read for aging on memory pressure in vmscan.
>
> segment register reload happen on interrupt entry, to load the kernel cs/ds (are they that
> costly?). If it was really costly you could probably check in interrupt entry if you're
> already running in kernel space and skip it.

They are. segment register reloads will trigger the following:

IDT table atomic fetch to verify (LOCK#) (if triggered by task gate from INTR)
GDT table atomic fetch to verify (LOCK#)
LDT table atomic fetch to verify (LOCK#) (if present)
PDE table atomic fetch to verify (LOCK#)

The process has to verify that the loaded segment descriptor is valid, and
it will fetch from all these tables to do it, with up to 4 (LOCK#)
assertions occurring invisibly in the hardware underneath (which will
generate 4 non-cacheable memory references, in addition to wrecking
havoc on the affected L1/L2 cache lines). Oink. It only does this
when you load one, not when you save one, like pushing it on the
stack. Since you look at MANOS code, you'll note that in
CONTEXT.386, I do and add esp, 3 * 4 instead of poppping the segment
registers off the stack if they are in the kernel address space.

Linux should do the same, if possible as an optimization.

>
> >
>
> 2.2 does not use checksum-copy-to-user in TCP RX, csum is either done separate or in hardware.
> It does csum-copy-fragment for TX, as well as UDP.

Yes, I know, this is what I was referrring to. Received are always ugly,
there's always at least one copy into cache for the write, sometimes
more...

Jeff

>
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-30 08:12:04

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 09:39:37AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > Is there an option to map Linux into a flat address space [...]
>
> nope, Linux is fundamentally multitasked.
>
> what you can do to hack around this is to not switch to the idle thread
> after having done work in nfsd. Some simple & stupid thing in schedule:
>
> if (next == idle_task) {
> while (nr_running)
> barrier();
> goto repeat_schedule;
> }
>
> (provided you are testing this on a UP system.) This way we do not destroy
> the TLB cache when we wait a few microseconds for the next network
> interrupt.
>
> we do this in 2.4 already - ie. nfsd doesnt have to mark itself lazy-MM,
> the idle thread will automatically 'inherit' the MM of nfsd, and is going
> to switch CR3 only if the next process is not nfsd. So you can get an
> apples to apples comparison by using 2.4.

Ingo, I will attempt this, but I doubt seriously it will allow
Linux to defeat NetWare 5.x on LAN I/O scaling. ALl protection
has to go away in all LAN paths for this to happen, and user space
apps set to ring 0. NetWare 5.cx does support ring 3 applications,
but the model is different than Linux. I will look at a potential
compromise between the two with the MANOS /arch merge. It may
allow some incarnation of Linux to smoke NetWare 5.x on LAN
performance. Doing this would move MARS-NWE and SAMBA into the kernel
without changing a line of code in either...

Jeff

>
> Ingo

2000-10-30 08:16:34

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 01:04:34AM -0700, Jeff V. Merkey wrote:
> > >
> > > When you say it reloads it's VM, you mean it reloads the CR3 register?
> >
> > Yes.
> >
> > No. In 2.4 you could probably use the on demand lazy vm mechanism ingo described for
> > the nfsd processes. In 2.2 it is a bit more tricky, if I remember right lazy mm needed
> > quite a few changes.
> >
> > But before doing too many changes i would first verify if that is really the problem.
>
> We will never beat NetWare on scaling if this is the case, even in 2.4.
> Andre and my first job will be to create an arch port with MANOS that
> disables this and restructures the VM.

I just guess the end result will be as crash prone as Netware when you install any third
party software ;)

lazy mm is probably a better path, as long as you stay in kernel threads and a single user mm
it'll never switch VMs.

>
> > PTEs are read for aging on memory pressure in vmscan.
> >
> > segment register reload happen on interrupt entry, to load the kernel cs/ds (are they that
> > costly?). If it was really costly you could probably check in interrupt entry if you're
> > already running in kernel space and skip it.
>
> They are. segment register reloads will trigger the following:
>
> IDT table atomic fetch to verify (LOCK#) (if triggered by task gate from INTR)
> GDT table atomic fetch to verify (LOCK#)
> LDT table atomic fetch to verify (LOCK#) (if present)
> PDE table atomic fetch to verify (LOCK#)
>
> The process has to verify that the loaded segment descriptor is valid, and
> it will fetch from all these tables to do it, with up to 4 (LOCK#)
> assertions occurring invisibly in the hardware underneath (which will
> generate 4 non-cacheable memory references, in addition to wrecking
> havoc on the affected L1/L2 cache lines). Oink. It only does this
> when you load one, not when you save one, like pushing it on the
> stack. Since you look at MANOS code, you'll note that in
> CONTEXT.386, I do and add esp, 3 * 4 instead of poppping the segment
> registers off the stack if they are in the kernel address space.
>
> Linux should do the same, if possible as an optimization.

Interesting. You could do easily the same in Linux by changing (in 2.2) the SAVE_ALL macro
in arch/i386/kernel/irq.h and doing the same in ret_from_intr's RESTORE_ALL macro (after
changing it to a RESTORE_ALL_INT or you'll break system calls)

Actually I don't even know why irq.h's SAVE_ALL even loads __KERNEL_CS, it should be already
set by the interrupt gate. __KERNEL_DS probably needs to be still loaded, but only when
you came from user space.

-Andi

2000-10-30 08:40:27

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

> >
> > I just guess the end result will be as crash prone as Netware when you install any third
> > party software ;)
> >
> > lazy mm is probably a better path, as long as you stay in kernel threads and a single user mm
> > it'll never switch VMs.
>
> It's not that bad. Some code has been running for years and will be OK.
> Also, the number of Ooops people get anyway on Linux today is about the
> same state as Abends were in NetWare in their frequency in Live customers
> accounts. I've also seen miscreant user apps in Linux wreck havoc
> from user space as deadly as an Oops.
>
> It's not the activity in the lan path that's the problem, it's the switching
> period that's the problem. The fix goes in the loader -- trusted apps get
> loaded in the kernel, non-trusted apps get loaded in a ring 3 address space.
> The WTD thing I described stops them from context switching until the
> system goes ile, since LAN I/O always gets moved to the front of the
> work queue, if you remember my post describing it. NetWare does this
> trick by holding off CR3 reloads to apps until the server goes idle
> relative to incoming or outging I/O. Linus just lets both happen
> all over the place. I'll show you end of December with the port. Ring 3
> apps will still work - you'll just be able to load some of them at ring 0,
> and run like bat out of hell.
>
> You are right though, Linux is like a tank going 25 miles an hour crushing
> everything in it's path, while NetWare is an aluminum frame speed racer
> that weighs 20 lbs., gets 1000 miles to the gallon, runs at MACH VI, and
> will explode in flames if it hits a tack in the road...
>
> :-)
>
> > > stack. Since you look at MANOS code, you'll note that in
> > > CONTEXT.386, I do and add esp, 3 * 4 instead of poppping the segment
> > > registers off the stack if they are in the kernel address space.
> > >
> > > Linux should do the same, if possible as an optimization.
> >
>
>
>
> > Interesting. You could do easily the same in Linux by changing (in 2.2) the SAVE_ALL macro
> > in arch/i386/kernel/irq.h and doing the same in ret_from_intr's RESTORE_ALL macro (after
> > changing it to a RESTORE_ALL_INT or you'll break system calls)
> >
> > Actually I don't even know why irq.h's SAVE_ALL even loads __KERNEL_CS, it should be already
> > set by the interrupt gate. __KERNEL_DS probably needs to be still loaded, but only when
> > you came from user space.
>
> Sounds like a fun patch. I'll get to work on it. Might want to ping Linus,
> and let him put this one in, since it's his code, and would be a very easy
> fix. The more lipo-suction we could do, the better.
>
> :-)
>
> Jeff
>
>
>
> >
> > -Andi

2000-10-30 08:42:37

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> [...] All protection has to go away in all LAN paths for this to
> happen, and user space apps set to ring 0. [...]

i found that this is not a requirement for good network scalability. We do
not do a syscall for every packet, so the cost evens out. Sure, it does
not hurt to not eat ~1 microsecond per system-call, but it causes no
overhead or scalability limit otherwise. In the TUX webserver we have
user-space modules doing context-switch-less webserving, and it scales
quite well, and is generic.

Ingo

2000-10-30 08:59:43

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 10:52:08AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > [...] All protection has to go away in all LAN paths for this to
> > happen, and user space apps set to ring 0. [...]
>
> i found that this is not a requirement for good network scalability. We do
> not do a syscall for every packet, so the cost evens out. Sure, it does
> not hurt to not eat ~1 microsecond per system-call, but it causes no
> overhead or scalability limit otherwise. In the TUX webserver we have
> user-space modules doing context-switch-less webserving, and it scales
> quite well, and is generic.
>
> Ingo

No argument here, but the overhead of reloading CR3 period will kill
performance. 2.4 does not beat NetWare, BTW, it gets a little further,
but still hits the wall, and Netware keeps going strong, and scales
up to 5000 users on file and print, with a SINGLE processor at less <
40% utilization. Linux hits the wall at a few hundred file and print
users. It's the overhead of CR3 switching at all that is causing this
and the heavy usage of Intel's protection model.

For example, if you put a MOV EAX, CR3; MOV CR3, EAX; in a context switching
path, on a PPro 200, you can do about 35,000 context switches/second
(I did this in NetWare 4.x in 1994) with the same code without the
CR3 reload, I could do over 2,000,000 context switches/second without
the CR3 load. INVL Page[0] in the same path is less heavy, but still
reduced context switches to 135,000/second. It's the CR3 activity period
that kills performance. There's also the use of segment registers
all over the place to copy from kernel to user and user to kernel
space memory. Having the fast paths you mention does help a lot,
but it's the fact that this goes on at all that will make it tough
to walk into a NetWare shop with Linux and rip out NetWare servers
and replace them unless we look at a NetWare vs. NetLinux (that's
what we call it! a NetWare-like Linux platform).

If we had this, MS would be running for the trenches. Coupling the
internet application base of Linux with the speed of NetWare would
be Mr. Gate's worst nightmare. They've been trying to kill off NetWare
for almost 15 years, and it's still out there and they are still
slower, and everyone knows it. It's not Linux or NT that's killing
off NetWare, right now, it's the incompetent management in San Jose
who are trying to rape Novell's rich bank accounts
(Schmidt and Sonsini and friends), and the fact they have no IA64
NetWare (they would be shipping it not if I were still there ....).

It could be as easy as a compile option in .config (NETWARE_SERVER_MODE [Y/N])
with WTD (which is a more sophistacted method for doing lazy VM and
controlling context switching activity for fast path I/O) and
mapping of te Linux address space as linear -- you can still have
protection with this model, just some apps would be able to load
in the kernel address space, and run at ring 0).

:-)

Jeff

2000-10-30 09:04:34

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> No argument here, but the overhead of reloading CR3 period will kill
> performance. [...]

2.4 does not reload CR3, unless you are using multiple user-space
processes.

> 2.4 does not beat NetWare, BTW, it gets a little further, but still
> hits the wall, [...]

as i told you in the previous mail, the main overhead is not CR3, it's the
copying & dirtying of all data, and the subsequent DMA-initiated dirty
cacheline writeback. I can serve 100 MB/sec web content with 2.4 & TUX
just fine - it relies on a zero-copying infrastructure.

Ingo

2000-10-30 09:14:57

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:13:58AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > No argument here, but the overhead of reloading CR3 period will kill
> > performance. [...]
>
> 2.4 does not reload CR3, unless you are using multiple user-space
> processes.
>
> > 2.4 does not beat NetWare, BTW, it gets a little further, but still
> > hits the wall, [...]
>
> as i told you in the previous mail, the main overhead is not CR3, it's the
> copying & dirtying of all data, and the subsequent DMA-initiated dirty
> cacheline writeback. I can serve 100 MB/sec web content with 2.4 & TUX
> just fine - it relies on a zero-copying infrastructure.
>
> Ingo

Great Ingo, you've got the web server covered. What about file and print.
I think this is great, but most web servers are connected to a T1 or T3
line, and all the fancy optimization means absolutely squat since about
99.999999% of the planet has a max baandwidth of a T1, ADSL, or T3 Line,
this is a far cry from Gigabit ethernet, or even 100Mbit ethernet.

How many users can you put on the web server? Web servers are also
read only data, the easiest of all LAN cases to deal with. It's
incoming writes that present all the tough problems, as Andi Klein
wisely observed. Not to knock Tux, I think it's great, but it does
not solve the file and print scaling problem, no does it?

Jeff

2000-10-30 09:17:47

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> For example, if you put a MOV EAX, CR3; MOV CR3, EAX; in a context
> switching path, on a PPro 200, you can do about 35,000 context
> switches/second

in 2.4 & Xeons we can do more than 100,000 context switches/second, and
that is more than enough. But the point is: network IO performance does
not depend on context switching speed too much. Also, in Linux we are
using global pages which makes kernel-space TLBs persistent even across
CR3 flushes.

> [...] There's also the use of segment registers all over the place to
> copy from kernel to user and user to kernel space memory. [...]

we do not use the fs segment register for user-space copies anymore,
neither in 2.2, nor in 2.4. You must be reading old books and probably
forgot to cross-check with the kernel source? :-)

> [...] Having the fast paths you mention does help a lot, but it's the
> fact that this goes on at all that will make it tough to walk into a
> NetWare shop with Linux and rip out NetWare servers and replace them
> unless we look at a NetWare vs. NetLinux (that's what we call it! a
> NetWare-like Linux platform).

the worst thing you can do is to mis-identify performance problems and
spend braincells on the wrong problem. The problems limiting Linux network
scalability have been identified during the last 12 months by a small
team, and solved in TUX. TUX is a fileserver, it shouldnt be alot of work
to enable it for (TCP-only?) netware serving. It's *done*, Jeff, it's not
a hypotetical thing, it's here, it works and it performs.

Ingo

2000-10-30 09:24:18

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:27:04AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > For example, if you put a MOV EAX, CR3; MOV CR3, EAX; in a context
> > switching path, on a PPro 200, you can do about 35,000 context
> > switches/second
>
> in 2.4 & Xeons we can do more than 100,000 context switches/second, and
> that is more than enough. But the point is: network IO performance does
> not depend on context switching speed too much. Also, in Linux we are
> using global pages which makes kernel-space TLBs persistent even across
> CR3 flushes.

This is putrid. NetWare does 353,00,000/second on a Xenon, pumping out
gobs of packets in between them. MANOS does 857,000,000/second. This
is terrible. No wonder it's so f_cking slow!!!

>
> > [...] There's also the use of segment registers all over the place to
> > copy from kernel to user and user to kernel space memory. [...]
>
> we do not use the fs segment register for user-space copies anymore,
> neither in 2.2, nor in 2.4. You must be reading old books and probably
> forgot to cross-check with the kernel source? :-)

ds: and es: are both used in copy-to-user and copy-from-user and they get
reloaded.

>
> > [...] Having the fast paths you mention does help a lot, but it's the
> > fact that this goes on at all that will make it tough to walk into a
> > NetWare shop with Linux and rip out NetWare servers and replace them
> > unless we look at a NetWare vs. NetLinux (that's what we call it! a
> > NetWare-like Linux platform).
>
> the worst thing you can do is to mis-identify performance problems and
> spend braincells on the wrong problem. The problems limiting Linux network
> scalability have been identified during the last 12 months by a small
> team, and solved in TUX. TUX is a fileserver, it shouldnt be alot of work
> to enable it for (TCP-only?) netware serving. It's *done*, Jeff, it's not
> a hypotetical thing, it's here, it works and it performs.
>

NetWare is here too, and it handles 5000+ file and print users, Linux does not.
Let's fix it. I know why NetWare is fast. Let's apply some of the same
principles and see what happens. Love to have you involved.

> Ingo

2000-10-30 09:32:09

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> [...] you've got the web server covered. What about file and print.

a web server, as you probably know, is a read-mostly fileserver that
serves files via the HTTP protocol. The rest is only protocol fluff.

> I think this is great, but most web servers are connected to a T1 or
> T3 line, and all the fancy optimization means absolutely squat since
> about 99.999999% of the planet has a max baandwidth of a T1, ADSL, or
> T3 Line, this is a far cry from Gigabit ethernet, or even 100Mbit
> ethernet.

Your argument is curious - first you cry for performance, then you say
'nobody wants that much bandwidth'. Of course, if the network is bandwidth
limited then we cannot scale above that bandwidth. But thats not the
point. The point is to put 10 cards into a server and still being able to
saturate them. The point is also to spend less cycles on saturating
available bandwidth. The point is also to not flush the L1 just because
someone requested a 10K webpage.

> How many users can you put on the web server? [...]

tens of thousands, on a single CPU. Can probably handle over 100 thousand
users as well, with IP aliasing so the socket space is spread out.

> Web servers are also read only data, the easiest of all LAN cases to
> deal with. It's incoming writes that present all the tough problems,

reads dominate writes in almost all workloads, thats common wisdom. Why
write if nobody reads the data? And while web servers are mostly read only
data, they can write data as well, see POST and PUT. The fact that
incoming writes are hard should not let you distract from the fact that
reads are also extremely important.

Ingo

2000-10-30 09:34:59

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> This is putrid. NetWare does 353,00,000/second on a Xenon, pumping out
> gobs of packets in between them. MANOS does 857,000,000/second. This
> is terrible. No wonder it's so f_cking slow!!!

(no need to get emotional.) And please check your numbers, 857 million
context switches per second means that on a 1 GHZ CPU you do one context
switch per 1.16 clock cycles. Wow!

Ingo

2000-10-30 09:37:29

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:41:35AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > [...] you've got the web server covered. What about file and print.
>
> a web server, as you probably know, is a read-mostly fileserver that
> serves files via the HTTP protocol. The rest is only protocol fluff.
>
> > I think this is great, but most web servers are connected to a T1 or
> > T3 line, and all the fancy optimization means absolutely squat since
> > about 99.999999% of the planet has a max baandwidth of a T1, ADSL, or
> > T3 Line, this is a far cry from Gigabit ethernet, or even 100Mbit
> > ethernet.
>
> Your argument is curious - first you cry for performance, then you say
> 'nobody wants that much bandwidth'. Of course, if the network is bandwidth
> limited then we cannot scale above that bandwidth. But thats not the
> point. The point is to put 10 cards into a server and still being able to
> saturate them. The point is also to spend less cycles on saturating
> available bandwidth. The point is also to not flush the L1 just because
> someone requested a 10K webpage.

It's not curious, it's not about bandwidth, it's about latency, and getting
packets in and out of the server as fast as possible, and ahead of everything
else. Cache affinity and all the Tux stuff is a great piece of work. Let's
talk about file and print, that's the present problem, not how to
pump out read-only data from a web server.

>
> > How many users can you put on the web server? [...]
>
> tens of thousands, on a single CPU. Can probably handle over 100 thousand
> users as well, with IP aliasing so the socket space is spread out.
>
> > Web servers are also read only data, the easiest of all LAN cases to
> > deal with. It's incoming writes that present all the tough problems,
>
> reads dominate writes in almost all workloads, thats common wisdom. Why
> write if nobody reads the data? And while web servers are mostly read only
> data, they can write data as well, see POST and PUT. The fact that
> incoming writes are hard should not let you distract from the fact that
> reads are also extremely important.
>
> Ingo

Web servers don't do writes, unless a CGI script is running somewhere
or some Java or Perl or something, then this stuff goes through a
wrapper, which is slow, or did I miss something.

Jeff

2000-10-30 09:41:09

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> ds: and es: are both used in copy-to-user and copy-from-user and they
> get reloaded.

And they all share the same segment descriptor. Whats your point? ES is
the default target segment for string operations. DS is the default data
segment. Have you ever profiled how many cycles it takes to do a "mov
__KERNEL_DS, %es" in entry.S, before making your (ridiculous) claim? I
have.

Ingo

2000-10-30 09:41:59

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:44:26AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > This is putrid. NetWare does 353,00,000/second on a Xenon, pumping out
> > gobs of packets in between them. MANOS does 857,000,000/second. This
> > is terrible. No wonder it's so f_cking slow!!!

And please check your numbers, 857 million
> context switches per second means that on a 1 GHZ CPU you do one context
> switch per 1.16 clock cycles. Wow!

Excuse me, 857,000,000 instructions executed and 460,000,000 context switches
a second -- on a PII system at 350 Mhz. It's due to AGI optimization.
Download MANOS and verify for yourself, it has a built in EMON in monitor.
After I complete the port, not even NetWare will be able to touch it.

Your Tux web server will also run on it, at significantly increased
performance.

Jeff

>
> Ingo

2000-10-30 09:44:29

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:50:24AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > ds: and es: are both used in copy-to-user and copy-from-user and they
> > get reloaded.
>
> And they all share the same segment descriptor. Whats your point? ES is
> the default target segment for string operations. DS is the default data
> segment. Have you ever profiled how many cycles it takes to do a "mov
> __KERNEL_DS, %es" in entry.S, before making your (ridiculous) claim? I
> have.
>

No. I used a hardware analyzer to show me how many LOCK# assertions it does
invisible to your software tools underneath. Try using EMON to profile,
it gives hardware numbers and let's you watch the cache controllers
issue non-cacheable memory references to fetch the descriptors.

Jeff

> Ingo

2000-10-30 09:46:39

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> > reads dominate writes in almost all workloads, thats common wisdom. Why
> > write if nobody reads the data? And while web servers are mostly read only
> > data, they can write data as well, see POST and PUT. The fact that
> > incoming writes are hard should not let you distract from the fact that
> > reads are also extremely important.
>
> Web servers don't do writes, unless a CGI script is running somewhere
> or some Java or Perl or something, then this stuff goes through a
> wrapper, which is slow, or did I miss something.

yes, you missed TUX modules.

Ingo

2000-10-30 09:49:39

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 11:56:06AM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > > reads dominate writes in almost all workloads, thats common wisdom. Why
> > > write if nobody reads the data? And while web servers are mostly read only
> > > data, they can write data as well, see POST and PUT. The fact that
> > > incoming writes are hard should not let you distract from the fact that
> > > reads are also extremely important.
> >
> > Web servers don't do writes, unless a CGI script is running somewhere
> > or some Java or Perl or something, then this stuff goes through a
> > wrapper, which is slow, or did I miss something.
>
> yes, you missed TUX modules.
>

Great. I can load a TUX module and use it with my T1 line. If I could
spare an extra $100,000/month, perhaps I can lease an SDM-172 or TAT-8, or
even an OC-172 then I would be able to take advantage of it in the real
world.

Jeff

> Ingo

2000-10-30 09:51:50

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> > And please check your numbers, 857 million
> > context switches per second means that on a 1 GHZ CPU you do one context
> > switch per 1.16 clock cycles. Wow!
>
> Excuse me, 857,000,000 instructions executed and 460,000,000 context switches
> a second -- on a PII system at 350 Mhz. [...]

so it does 1.3 context switches per clock cycle? Wow! And i can type
100000000000000000000 characters a second, just measured it. Really!

> Your Tux web server will also run on it, at significantly increased
> performance.

as i told you in the previous mails, TUX does not depend on schedule()
performance. schedule() cost does not even show up in the top 20 entries
of the profiler.

Ingo

2000-10-30 09:55:11

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> It's not curious, it's not about bandwidth, it's about latency, and
> getting packets in and out of the server as fast as possible, and
> ahead of everything else. [...]

TUX prepares a HTTP reply in about 30 microseconds (plus network latency),
good enough? Network latency is the limit, even on gigabit - not to talk
about T1 lines.

Ingo

2000-10-30 09:58:20

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:01:08PM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > > And please check your numbers, 857 million
> > > context switches per second means that on a 1 GHZ CPU you do one context
> > > switch per 1.16 clock cycles. Wow!
> >
> > Excuse me, 857,000,000 instructions executed and 460,000,000 context switches
> > a second -- on a PII system at 350 Mhz. [...]
>
> so it does 1.3 context switches per clock cycle? Wow! And i can type
> 100000000000000000000 characters a second, just measured it. Really!

Go download it and try it, then come back with that smirk wiped off your face.
I'll enjoy it.....

:-)

>
> > Your Tux web server will also run on it, at significantly increased
> > performance.
>
> as i told you in the previous mails, TUX does not depend on schedule()
> performance. schedule() cost does not even show up in the top 20 entries
> of the profiler.

And as I told, you, your code has nothing to do with it, it's the fact it
goes on at all. Ingo, go get a copy of NetWare 3.12 (I'll even send you
one -- I've got extra licensed copies), install it, put a load of 5000
connections on it, with 4 adapters. Dual boot Linux on it, and attempt
the same with SAMBA or MARS-NWE, and watch it oink.

Go do it. What's your address to I can ship you the CD's for 3.12. Then come
back and tell me how TUX is going to solve the File and Print performance
issues in Linux.

:-)

Jeff

>
> Ingo

2000-10-30 10:00:00

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:04:43PM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > It's not curious, it's not about bandwidth, it's about latency, and
> > getting packets in and out of the server as fast as possible, and
> > ahead of everything else. [...]
>
> TUX prepares a HTTP reply in about 30 microseconds (plus network latency),
> good enough? Network latency is the limit, even on gigabit - not to talk
> about T1 lines.

Great. Now how do we get the smae numbers on SAMBA and MARS-NWE? THat's
the question, not whether your baby TUX is pretty. I already said it
was pretty, focus on the other issue.

Jeff

>
> Ingo
>

2000-10-30 10:03:20

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> > > Excuse me, 857,000,000 instructions executed and 460,000,000 context switches
> > > a second -- on a PII system at 350 Mhz. [...]

> Go download it and try it, then come back with that smirk wiped off
> your face. I'll enjoy it.....

so in 0.53 clock cycles you are implementing things like address space
separation, process priorities, fairness and other essential scheduling
features? Truly awesome ...

Ingo

2000-10-30 10:04:20

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> > TUX prepares a HTTP reply in about 30 microseconds (plus network latency),
> > good enough? Network latency is the limit, even on gigabit - not to talk
> > about T1 lines.
>
> Great. Now how do we get the smae numbers on SAMBA and MARS-NWE? [...]

simple, write a TUX protocol module for it. FTP protocol module is on its
way. Stay tuned.

Ingo

2000-10-30 10:10:11

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:12:44PM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > > > Excuse me, 857,000,000 instructions executed and 460,000,000 context switches
> > > > a second -- on a PII system at 350 Mhz. [...]
>
> > Go download it and try it, then come back with that smirk wiped off
> > your face. I'll enjoy it.....
>
> so in 0.53 clock cycles you are implementing things like address space
> separation, process priorities, fairness and other essential scheduling
> features? Truly awesome ...

Ingo, This original thread was regarding Linux vs. NetWare 5.x performance
metrics and responses from Linux folks about how to affect and
improve them, not a diatribe on the features of TUX.

I wrote the kernel in NetWare 4.x that later became 5.x. To date, my
NOS kernel has grossed over 8 billion dollars. This is more money
than all the Linux companies have made combined in their entire
history, on your work, Linus's work and everyone else's. I don't have
anything to prove to myself, or anyone else, which is why I do as I
please in life, and no one's comments or snipes dissuade me from moving
forward. I also am not "someone's employee".

If you have some ideas on how to improve file and print or help me
get a linux incarnation that can stomp NetWare, I'd love to hear
your ideas. I think TUX is great, BTW. Otherwise, end-of-line...

:-)

Jeff

>
> Ingo

2000-10-30 10:12:21

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 12:13:52PM +0100, Ingo Molnar wrote:
>
> On Mon, 30 Oct 2000, Jeff V. Merkey wrote:
>
> > > TUX prepares a HTTP reply in about 30 microseconds (plus network latency),
> > > good enough? Network latency is the limit, even on gigabit - not to talk
> > > about T1 lines.
> >
> > Great. Now how do we get the smae numbers on SAMBA and MARS-NWE? [...]
>
> simple, write a TUX protocol module for it. FTP protocol module is on its
> way. Stay tuned.
>
> Ingo

I do not believe this approach will allow Linux to match NetWare's
file and print performance, but I am willing to give it a whirl. Where
is the TUX module for MARS. Let's start with this one. What help
would you require. You understand these TUX modules, so you should
take the lead.

I am listening. Instruct me....

Jeff

2000-10-30 10:22:21

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, 30 Oct 2000, Jeff V. Merkey wrote:

> Ingo, This original thread was regarding Linux vs. NetWare 5.x
> performance metrics and responses from Linux folks about how to affect
> and improve them, not a diatribe on the features of TUX.

oh, i believe you misunderstand. TUX itself is quite simple. But it
extended the Linux TCP stack and scalability to new levels (which has
nothing to do with TUX itself, it's the scalability of the Linux
networking stack that evolved gradually over the past 10 years - we spend
95% of the time outside of TUX!). And if you claim that Linux needs this
and that for scalability, then i'd like to point you humbly towards those
existing, generic and well-performing results. TUX is just a 5% 'HTTP
protocol fluff' around the generic stuff.

Ingo

2000-10-30 10:57:14

by john slee

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

On Mon, Oct 30, 2000 at 03:06:25AM -0700, Jeff V. Merkey wrote:
> Ingo, This original thread was regarding Linux vs. NetWare 5.x performance
> metrics and responses from Linux folks about how to affect and
> improve them, not a diatribe on the features of TUX.

while beating netware in certain areas is certainly a noble goal,
it is far from the only one.

[snip]

> If you have some ideas on how to improve file and print or help me
> get a linux incarnation that can stomp NetWare, I'd love to hear
> your ideas. I think TUX is great, BTW. Otherwise, end-of-line...

if netware users are happy with netware, i can't see why they'd want to
switch to linux or a linux-netware-ish equivalent while netware are
still around to provide support. "ain't broke, don't fix," etc.

j.

2000-10-30 12:47:35

by Alan

[permalink] [raw]

Subject: Re: 2.2.18Pre Lan Performance Rocks!

> We will never beat NetWare on scaling if this is the case, even in 2.4.
> Andre and my first job will be to create an arch port with MANOS that
> disables this and restructures the VM.

In the 2.4 case if you are just running NFS daemons then there are no tlb
reloads going on at all. Whats murdering you is mostly memory copies. I would
suspect if you rerun the profiles on a box with a much lower memory bandwidth
that the effect will be even more visible

2000-10-30 12:51:27