LinuxLists.cc - bigphysarea support in 2.2.19 and 2.4.0 kernels

2000-12-21 21:18:05

Subject: bigphysarea support in 2.2.19 and 2.4.0 kernels

A question related to bigphysarea support in the native Linux
2.2.19 and 2.4.0 kernels.

I know there are patches for this support, but is it planned for
rolling into the kernel by default to support Dolphin SCI and
some of the NUMA Clustering adapters. I see it there for some
of the video adapters.

Is this planned for the kernel proper, or will it remain a patch?
At the rate the VM and mm subsystems tend to get updated, I am
wondering if there's a current version out for this.

Jeff

2000-12-21 22:01:33

by Alan

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

> A question related to bigphysarea support in the native Linux
> 2.2.19 and 2.4.0 kernels.
>
> I know there are patches for this support, but is it planned for
> rolling into the kernel by default to support Dolphin SCI and
> some of the NUMA Clustering adapters. I see it there for some
> of the video adapters.

bigphysarea is the wrong model for 2.4. The bootmem allocator means that
drivers could do early claims via the bootmem interface during boot up. That
would avoid all the cruft.

For 2.2 bigphysarea is a hack, but a neccessary add on patch and not one you
can redo cleanly as we don't have bootmem

I belive Pauline Middelink had a patch implementing bigphysarea in terms of
bootmem

Alan

2000-12-21 22:20:11

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

On Thu, Dec 21, 2000 at 09:32:46PM +0000, Alan Cox wrote:
> > A question related to bigphysarea support in the native Linux
> > 2.2.19 and 2.4.0 kernels.
> >
> > I know there are patches for this support, but is it planned for
> > rolling into the kernel by default to support Dolphin SCI and
> > some of the NUMA Clustering adapters. I see it there for some
> > of the video adapters.
>
> bigphysarea is the wrong model for 2.4. The bootmem allocator means that
> drivers could do early claims via the bootmem interface during boot up. That
> would avoid all the cruft.
>
> For 2.2 bigphysarea is a hack, but a neccessary add on patch and not one you
> can redo cleanly as we don't have bootmem
>
> I belive Pauline Middelink had a patch implementing bigphysarea in terms of
> bootmem
>
> Alan

Alan,

Thanks for the prompt response. I am merging the Dolphin SCI High Speed
interconnect drivers into 2.2.18 and 2.4.0 for our M2FS project, and I
am reviewing the big ugly nasty patch they have current as of 2.2.13
(really old). I will be looking over the 2.4 tree for a more clean
manner to do what they want.

What's in the patch alters the /proc filesystem, and the VM code. I will
submit a patch against 2.2.19 and 2.4.0 for this support for their SCI
adapters after I get a handle on it.

:-)

Jeff

2000-12-21 22:29:01

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

Alan,

I am looking over the 2.4 bigphysarea patch, and I think I agree
there needs to be a better approach. It's a messy hack -- I agree.

:-)

Jeff

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-12-22 09:10:21

by Pauline Middelink

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

On Thu, 21 Dec 2000 around 15:53:39 -0700, Jeff V. Merkey wrote:
>
> Alan,
>
> I am looking over the 2.4 bigphysarea patch, and I think I agree
> there needs to be a better approach. It's a messy hack -- I agree.

Please explain further.
Just leaving it at that is not nice. What is messy?
The implementation? The API?

If you have a better solutions for allocating big chunks of
physical continious memory at different stages during the
runtime of the kernel, i would be very interesseted.

(Alan: bootmem allocation just won't do. I need that memory
in modules which get potentially loaded/unloaded, hence a
wrapper interface for allowing access to a bootmem allocated
piece of memory)

And the API? That API was set a long time ago, luckely not by me :)
Though I dont see the real problem. It allows allocation and
freeing of chunks of memory. Period. Its all its suppose to do.
Or do you want it rolled in kmalloc? So GFP_DMA with size>128K
would take memory from this? That would mean a much more intrusive
patch in very sensitive and rapidly changing parts of the kernel
(2.2->2.4 speaking)...

Met vriendelijke groet,
Pauline Middelink
--
GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2
For more details look at my website http://www.polyware.nl/~middelink

2000-12-22 09:27:34

by Alan

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

> (Alan: bootmem allocation just won't do. I need that memory
> in modules which get potentially loaded/unloaded, hence a
> wrapper interface for allowing access to a bootmem allocated
> piece of memory)

Yes, I pointed him at you for 2.4test because you had the code sitting on
top of bootmem which is the right way to do it.

> Or do you want it rolled in kmalloc? So GFP_DMA with size>128K
> would take memory from this? That would mean a much more intrusive
> patch in very sensitive and rapidly changing parts of the kernel
> (2.2->2.4 speaking)...

bigmem is 'last resort' stuff. I'd much rather it is as now a seperate
allocator so you actually have to sit and think and decide to give up on
kmalloc/vmalloc/better algorithms and only use it when the hardware sucks

2000-12-22 17:46:14

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

On Fri, Dec 22, 2000 at 09:39:28AM +0100, Pauline Middelink wrote:
> On Thu, 21 Dec 2000 around 15:53:39 -0700, Jeff V. Merkey wrote:
> >
> > Alan,
> >
> > I am looking over the 2.4 bigphysarea patch, and I think I agree
> > there needs to be a better approach. It's a messy hack -- I agree.
>
> Please explain further.
> Just leaving it at that is not nice. What is messy?
> The implementation? The API?
>
> If you have a better solutions for allocating big chunks of
> physical continious memory at different stages during the
> runtime of the kernel, i would be very interesseted.
>
> (Alan: bootmem allocation just won't do. I need that memory
> in modules which get potentially loaded/unloaded, hence a
> wrapper interface for allowing access to a bootmem allocated
> piece of memory)
>
> And the API? That API was set a long time ago, luckely not by me :)
> Though I dont see the real problem. It allows allocation and
> freeing of chunks of memory. Period. Its all its suppose to do.
> Or do you want it rolled in kmalloc? So GFP_DMA with size>128K
> would take memory from this? That would mean a much more intrusive
> patch in very sensitive and rapidly changing parts of the kernel
> (2.2->2.4 speaking)...
>
> Met vriendelijke groet,
> Pauline Middelink

Pauline,

Can we put together a patch that meets Alan's requirements and get it into
the kernel proper. We have taken on a project from Dolphin to merge the
high speed Dolphin SCI interconnect drivers into the kernel proper, and
obviously, it's not possible to do so if the drivers are dependent on
this patch. I can send you the driver sources for the SCI cards, at
least the portions that depend on this patch, and would appreciate
any guidance you could provide on a better way to allocate memory.

SCI allows machines to create windows of shared memory across a cluster
of nodes, and at 1 Gigabyte-per-second (Gigabyte not gigabit). I am
putting a sockets interface into the drivers so Apache, LVS, and
Pirahna can use these very high speed adapters for a clustered web
server. Our M2FS clustered file system also is being architected
to use these cards.

I will post the source code for the SCI cards at vger.timpanogas.org
and if you have time, please download this code and take a look at
how we are using the bigphysarea APIs to create these windows accros
machines. The current NUMA support in Linux is somewhat slim, and
I would like to use established APIs to do this if possible.

:-)

Jeff

> --
> GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2
> For more details look at my website http://www.polyware.nl/~middelink

2000-12-22 19:14:00

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

On Fri, Dec 22, 2000 at 11:11:05AM -0700, Jeff V. Merkey wrote:
> On Fri, Dec 22, 2000 at 09:39:28AM +0100, Pauline Middelink wrote:
> > On Thu, 21 Dec 2000 around 15:53:39 -0700, Jeff V. Merkey wrote:
>

Pauline/Alan,

I have been studying the SCI code and I think I may have a workaround that
won't need the patch, but it will require pinning large chunks of memory
with the existing __get_free_pages() functions. I will need to make the
changes and test them. This change will require significant testing. I will
ping you guys if I have questions. If we can reach a compromise on the
bigphysarea patch, it would be great, but absent this, I will be looking
at this alternate solution.

The real question is how to guarantee that these pages will be contiguous
in memory. The slab allocator may also work, but I think there are size
constraints on how much I can get in one pass.

:-)

Jeff

>
>
>
> > --
> > GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2
> > For more details look at my website http://www.polyware.nl/~middelink

2000-12-22 19:52:48

by Andi Kleen

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

On Fri, Dec 22, 2000 at 11:35:30AM -0700, Jeff V. Merkey wrote:
> The real question is how to guarantee that these pages will be contiguous
> in memory. The slab allocator may also work, but I think there are size
> constraints on how much I can get in one pass.

You cannot guarantee it after the system has left bootup stage. That's the
whole reason why bigphysarea exists.

-Andi

2000-12-22 20:01:27

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

On Fri, Dec 22, 2000 at 08:21:37PM +0100, Andi Kleen wrote:
> On Fri, Dec 22, 2000 at 11:35:30AM -0700, Jeff V. Merkey wrote:
> > The real question is how to guarantee that these pages will be contiguous
> > in memory. The slab allocator may also work, but I think there are size
> > constraints on how much I can get in one pass.
>
> You cannot guarantee it after the system has left bootup stage. That's the
> whole reason why bigphysarea exists.
>
> -Andi

I am wondering why the drivers need such a big contiguous chunk of memory.
For message passing operatings, they should not. Some of
the user space libraries appear to need this support. I am going through
this code today attempting to determine if there's a way to reduce this
requirement or map the memory differently. I am not using these cards
for a ccNUMA implementation, although they have versions of these
adapters that can provide this capability, but for message passing with
small windows of coherence between machines with push/pull DMA-style
behavior for high speed data transfers. 99.9% of the clustering
stuff on Linux uses this model, so this requirement perhaps can be
restructured to be a better fit for Linux.

Just having the patch in the kernel for bigphysarea support would solve
this issue if it could be structured into a form Alan finds acceptable.
Absent this, we need a workaround that's more tailored for the
requirments for Linux apps.

Jeff

2000-12-22 20:08:37

by Tim Wright

[permalink] [raw]

Subject: NUMA and SCI [was Re: bigphysarea support in 2.2.19 and 2.4.0 kernels]

Hi Jeff,

On Fri, Dec 22, 2000 at 11:11:05AM -0700, Jeff V. Merkey wrote:
[...]
> SCI allows machines to create windows of shared memory across a cluster
> of nodes, and at 1 Gigabyte-per-second (Gigabyte not gigabit). I am
> putting a sockets interface into the drivers so Apache, LVS, and
> Pirahna can use these very high speed adapters for a clustered web
> server. Our M2FS clustered file system also is being architected
> to use these cards.

You're probably aware of this, but SCI allows a lot more then the creation
of windows of shared memory. The IBM NUMA-Q machines (what was Sequent), use
the SCI interconnect to build a single-system image machine with all memory
visible from all "nodes". In fact, all the commercial NUMA machines of which
I am aware have this property (all nodes see and can address all memory). The
non-uniform part of NUMA comes from the potentially differing latency and
speed of different parts of memory (local vs remote in this case).
AFAIK, the work that Kanoj Sarcar has been doing is to enable such machines.

It sounds like you have a different requirement of very high-speed shared
memory between different nodes that can be mapped and unmapped as required.
Do I understand this correctly ? That would make your requirements somewhat
orthogonal to the requirements those of us with NUMA architectures have.

> I will post the source code for the SCI cards at vger.timpanogas.org
> and if you have time, please download this code and take a look at
> how we are using the bigphysarea APIs to create these windows accros
> machines. The current NUMA support in Linux is somewhat slim, and
> I would like to use established APIs to do this if possible.

See above. It may be that you need different APIs anyway.

Regards,

Tim

--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI

2000-12-22 20:14:37

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: NUMA and SCI [was Re: bigphysarea support in 2.2.19 and 2.4.0 kernels]

On Fri, Dec 22, 2000 at 11:37:29AM -0800, Tim Wright wrote:

I have been working with SCI since 1994. The people who own
Dolphin and the SCI chipsets also own TRG. We dropped work in
the P6 ccNUMA cards several years back because Intel was
convinced that shared-nothing was the way to go (and it is).
However, SCI's ability to create explicit sharing makes it
the fastest shared nothing interface around for message passing
(go figure).

I think we do need some bettr APIs. Grab the source at my FTP server,
and I'd love any input you could provide.

Thanks,

:-)

Jeff

> Hi Jeff,
>
> On Fri, Dec 22, 2000 at 11:11:05AM -0700, Jeff V. Merkey wrote:
> [...]
> > SCI allows machines to create windows of shared memory across a cluster
> > of nodes, and at 1 Gigabyte-per-second (Gigabyte not gigabit). I am
> > putting a sockets interface into the drivers so Apache, LVS, and
> > Pirahna can use these very high speed adapters for a clustered web
> > server. Our M2FS clustered file system also is being architected
> > to use these cards.
>
> You're probably aware of this, but SCI allows a lot more then the creation
> of windows of shared memory. The IBM NUMA-Q machines (what was Sequent), use
> the SCI interconnect to build a single-system image machine with all memory
> visible from all "nodes". In fact, all the commercial NUMA machines of which
> I am aware have this property (all nodes see and can address all memory). The
> non-uniform part of NUMA comes from the potentially differing latency and
> speed of different parts of memory (local vs remote in this case).
> AFAIK, the work that Kanoj Sarcar has been doing is to enable such machines.
>
> It sounds like you have a different requirement of very high-speed shared
> memory between different nodes that can be mapped and unmapped as required.
> Do I understand this correctly ? That would make your requirements somewhat
> orthogonal to the requirements those of us with NUMA architectures have.
>
> > I will post the source code for the SCI cards at vger.timpanogas.org
> > and if you have time, please download this code and take a look at
> > how we are using the bigphysarea APIs to create these windows accros
> > machines. The current NUMA support in Linux is somewhat slim, and
> > I would like to use established APIs to do this if possible.
>
> See above. It may be that you need different APIs anyway.
>
> Regards,
>
> Tim
>
> --
> Tim Wright - [email protected] or [email protected] or [email protected]
> IBM Linux Technology Center, Beaverton, Oregon
> "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI

2000-12-22 20:21:57

by Pauline Middelink

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

On Fri, 22 Dec 2000 around 13:25:41 -0700, Jeff V. Merkey wrote:
> On Fri, Dec 22, 2000 at 08:21:37PM +0100, Andi Kleen wrote:
> > On Fri, Dec 22, 2000 at 11:35:30AM -0700, Jeff V. Merkey wrote:
> > > The real question is how to guarantee that these pages will be contiguous
> > > in memory. The slab allocator may also work, but I think there are size
> > > constraints on how much I can get in one pass.
> >
> > You cannot guarantee it after the system has left bootup stage. That's the
> > whole reason why bigphysarea exists.
> >
> > -Andi
>
> I am wondering why the drivers need such a big contiguous chunk of memory.
> For message passing operatings, they should not. Some of
> the user space libraries appear to need this support. I am going through
> this code today attempting to determine if there's a way to reduce this
> requirement or map the memory differently. I am not using these cards
> for a ccNUMA implementation, although they have versions of these
> adapters that can provide this capability, but for message passing with
> small windows of coherence between machines with push/pull DMA-style
> behavior for high speed data transfers. 99.9% of the clustering
> stuff on Linux uses this model, so this requirement perhaps can be
> restructured to be a better fit for Linux.

Well, to be frank, I'm only aware of my zoran driver (and the buz?)
needing it. The only reason is that the ZR36120 framegrabber wants
to load a complete field from a single DMA-base address. Needless
to say TV frames are large in RGB24 mode, and going fast at that.
So its just a deficiancy of the chip which made me use bigphys.

And yes, I tried a version (which is in the kernel) which must be compiled
in, so it can find a large enough area to work with. The only problem is
when the driver has a problem, one can not easily reload the module
(but who am i telling, right? :) )

> Just having the patch in the kernel for bigphysarea support would solve
> this issue if it could be structured into a form Alan finds acceptable.
> Absent this, we need a workaround that's more tailored for the
> requirments for Linux apps.

Is there a solution than? As long as the hardware which needs it is
not common good, Linus will oppose it (at least that's what I figured
from his messages back in the old days.) Oh well, maybe his opinion
has changed, one may hope :)

Met vriendelijke groet,
Pauline Middelink
--
GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2
For more details look at my website http://www.polyware.nl/~middelink

2000-12-22 20:59:32

by Alan

[permalink] [raw]

Subject: Re: NUMA and SCI [was Re: bigphysarea support in 2.2.19 and 2.4.0 kernels]

> I think we do need some bettr APIs. Grab the source at my FTP server,
> and I'd love any input you could provide.

Pure message passing drivers for the Dolphinics cards already exist. Ron
Minnich wrote some.

http://www.acl.lanl.gov/~rminnich/

Alan

2000-12-22 21:16:53

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: NUMA and SCI [was Re: bigphysarea support in 2.2.19 and 2.4.0 kernels]

On Fri, Dec 22, 2000 at 08:30:07PM +0000, Alan Cox wrote:
> > I think we do need some bettr APIs. Grab the source at my FTP server,
> > and I'd love any input you could provide.
>
> Pure message passing drivers for the Dolphinics cards already exist. Ron
> Minnich wrote some.
>
> http://www.acl.lanl.gov/~rminnich/
>
> Alan

Not for the newer cards. I will look over his code, and see what's there.

Jeff

2000-12-22 21:28:05

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: NUMA and SCI [was Re: bigphysarea support in 2.2.19 and 2.4.0 kernels]

On Fri, Dec 22, 2000 at 02:40:59PM -0700, Jeff V. Merkey wrote:
> On Fri, Dec 22, 2000 at 08:30:07PM +0000, Alan Cox wrote:
> > > I think we do need some bettr APIs. Grab the source at my FTP server,
> > > and I'd love any input you could provide.
> >
> > Pure message passing drivers for the Dolphinics cards already exist. Ron
> > Minnich wrote some.
> >
> > http://www.acl.lanl.gov/~rminnich/
> >
> > Alan
>
> Not for the newer cards. I will look over his code, and see what's there.
>
> Jeff
>

Alan,

I reviewed his code. It's only current as of 2.2.5 and guess what? It also
requires the bigphysarea patch. The code I am using includes support for
all the versions of the Dolphin SCI cards, so it is larger than what's
there, but his versions are heavily dependent on this patch as well.

Jeff

2000-12-22 21:30:26

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

On Fri, Dec 22, 2000 at 08:51:03PM +0100, Pauline Middelink wrote:
> On Fri, 22 Dec 2000 around 13:25:41 -0700, Jeff V. Merkey wrote:
> > On Fri, Dec 22, 2000 at 08:21:37PM +0100, Andi Kleen wrote:
> > > On Fri, Dec 22, 2000 at 11:35:30AM -0700, Jeff V. Merkey wrote:
> > > > The real question is how to guarantee that these pages will be contiguous
> > > > in memory. The slab allocator may also work, but I think there are size
> > > > constraints on how much I can get in one pass.
> > >
> > > You cannot guarantee it after the system has left bootup stage. That's the
> > > whole reason why bigphysarea exists.
> > >
> > > -Andi
> >
> > I am wondering why the drivers need such a big contiguous chunk of memory.
> > For message passing operatings, they should not. Some of
> > the user space libraries appear to need this support. I am going through
> > this code today attempting to determine if there's a way to reduce this
> > requirement or map the memory differently. I am not using these cards
> > for a ccNUMA implementation, although they have versions of these
> > adapters that can provide this capability, but for message passing with
> > small windows of coherence between machines with push/pull DMA-style
> > behavior for high speed data transfers. 99.9% of the clustering
> > stuff on Linux uses this model, so this requirement perhaps can be
> > restructured to be a better fit for Linux.
>
> Well, to be frank, I'm only aware of my zoran driver (and the buz?)
> needing it. The only reason is that the ZR36120 framegrabber wants
> to load a complete field from a single DMA-base address. Needless
> to say TV frames are large in RGB24 mode, and going fast at that.
> So its just a deficiancy of the chip which made me use bigphys.
>
> And yes, I tried a version (which is in the kernel) which must be compiled
> in, so it can find a large enough area to work with. The only problem is
> when the driver has a problem, one can not easily reload the module
> (but who am i telling, right? :) )
>
> > Just having the patch in the kernel for bigphysarea support would solve
> > this issue if it could be structured into a form Alan finds acceptable.
> > Absent this, we need a workaround that's more tailored for the
> > requirments for Linux apps.
>
> Is there a solution than? As long as the hardware which needs it is
> not common good, Linus will oppose it (at least that's what I figured
> from his messages back in the old days.) Oh well, maybe his opinion
> has changed, one may hope :)
>
> Met vriendelijke groet,
> Pauline Middelink

Having a 1 Gigabyte per second fat pipe that runs over a prallel bus
fabric with a standard PCI card that costs @ $ 500 and can run LVS
and TUX at high speeds would be for the common good, particularly since
NT and W2K both have implementations of Dolphin SCI that allow them
to exploit this hardware.

Jeff

> --
> GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2
> For more details look at my website http://www.polyware.nl/~middelink
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-12-22 22:11:52

by Erik Mouw

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

On Fri, Dec 22, 2000 at 02:54:50PM -0700, Jeff V. Merkey wrote:
> Having a 1 Gigabyte per second fat pipe that runs over a prallel bus
> fabric with a standard PCI card that costs @ $ 500 and can run LVS
> and TUX at high speeds would be for the common good, particularly since
> NT and W2K both have implementations of Dolphin SCI that allow them
> to exploit this hardware.

I'm just wondering how you are going to do 1 Gbyte per second when you
still have to get the data through a PCI bus to that card. In theory,
standard PCI can do 133 Mbyte/s, but only when you're very lucky to be
able to burst large chunks of data. OK, 64 bit PCI at 66 MHz should
quadruple the throughput, but that's still not enough for 1 Gbyte/s.

Erik

--
J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
of Electrical Engineering, Faculty of Information Technology and Systems,
Delft University of Technology, PO BOX 5031, 2600 GA Delft, The Netherlands
Phone: +31-15-2783635 Fax: +31-15-2781843 Email: [email protected]
WWW: http://www-ict.its.tudelft.nl/~erik/

2000-12-22 23:32:07

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

On Fri, Dec 22, 2000 at 10:39:43PM +0100, Erik Mouw wrote:
> On Fri, Dec 22, 2000 at 02:54:50PM -0700, Jeff V. Merkey wrote:
> > Having a 1 Gigabyte per second fat pipe that runs over a prallel bus
> > fabric with a standard PCI card that costs @ $ 500 and can run LVS
> > and TUX at high speeds would be for the common good, particularly since
> > NT and W2K both have implementations of Dolphin SCI that allow them
> > to exploit this hardware.
>
> I'm just wondering how you are going to do 1 Gbyte per second when you
> still have to get the data through a PCI bus to that card. In theory,
> standard PCI can do 133 Mbyte/s, but only when you're very lucky to be
> able to burst large chunks of data. OK, 64 bit PCI at 66 MHz should
> quadruple the throughput, but that's still not enough for 1 Gbyte/s.

THe fabric supports this data rate. PCI cards are limited to @ 130MB,
but multiple nodes all running at the same time could generate this much
traffic.

Jeff

>
>
> Erik
>
> --
> J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
> of Electrical Engineering, Faculty of Information Technology and Systems,
> Delft University of Technology, PO BOX 5031, 2600 GA Delft, The Netherlands
> Phone: +31-15-2783635 Fax: +31-15-2781843 Email: [email protected]
> WWW: http://www-ict.its.tudelft.nl/~erik/

2000-12-22 23:43:51

by Albert D. Cahalan

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

> bigmem is 'last resort' stuff. I'd much rather it is as now a
> seperate allocator so you actually have to sit and think and
> decide to give up on kmalloc/vmalloc/better algorithms and
> only use it when the hardware sucks

It isn't just for sucky hardware. It is for performance too.

1. Linux isn't known for cache coloring ability. Even if it was,
users want to take advantage of large pages or BAT registers
to reduce TLB miss costs. (that is, mapping such areas into
a process is needed... never mind security for now)

2. Programming a DMA controller with multiple addresses isn't
as fast as programming it with one.

Consider what happens when you have the ability to make one
compute node DMA directly into the physical memory of another.
With a large block of physical memory, you only need to have
the destination node give the writer a single physical memory
address to send the data to. With loose pages, the destination
has to transmit a great big list. That might be 30 thousand!

The point of all this is to crunch data as fast as possible,
with Linux mostly getting out of the way. Perhaps you want
to generate real-time high-resolution video of a human heart
as it beats inside somebody. You process raw data (audio, X-ray,
magnetic resonance, or whatever) on one group of processors,
then hand off the data to another group of processors for the
rendering task. Actually there might be many stages. Playing
games with individual pages will cut into your performance.
The data stream is fat and relentless.

2000-12-23 09:49:03

by Eric W. Biederman

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

"Albert D. Cahalan" <[email protected]> writes:

> > bigmem is 'last resort' stuff. I'd much rather it is as now a
> > seperate allocator so you actually have to sit and think and
> > decide to give up on kmalloc/vmalloc/better algorithms and
> > only use it when the hardware sucks
>
> It isn't just for sucky hardware. It is for performance too.

> 1. Linux isn't known for cache coloring ability.
Most hardware doesn't need it. It might help a little
but not much.
> Even if it was,
> users want to take advantage of large pages or BAT registers
> to reduce TLB miss costs. (that is, mapping such areas into
> a process is needed... never mind security for now)

I think the minor cost incurred by uniform size is well made up
for by reliable memory management, and avoidance of swapping, and
needing less total ram. Besides the fact I don't see large
physical areas of memory being more than a marginal performance gain.

> 2. Programming a DMA controller with multiple addresses isn't
> as fast as programming it with one.

Garbage collecting is theoretically more efficient than explicit
memory management too. But seriously I doubt that several pages
have significantly more overhead than a giant burst, per transfer.

> Consider what happens when you have the ability to make one
> compute node DMA directly into the physical memory of another.
> With a large block of physical memory, you only need to have
> the destination node give the writer a single physical memory
> address to send the data to. With loose pages, the destination
> has to transmit a great big list. That might be 30 thousand!

Hmm, queuing up enough data for a second at a time seems a little
excessive. And with a 128M chunk... your system can't do good
memory management at all.

> The point of all this is to crunch data as fast as possible,
> with Linux mostly getting out of the way. Perhaps you want
> to generate real-time high-resolution video of a human heart
> as it beats inside somebody. You process raw data (audio, X-ray,
> magnetic resonance, or whatever) on one group of processors,
> then hand off the data to another group of processors for the
> rendering task. Actually there might be many stages. Playing
> games with individual pages will cut into your performance.

If you are doing a real time task you don't want to very close
to your performance envelope. If you are hitting the performance
envelope any small hiccup will cause you to miss your deadline,
and close to your performance envelope hiccups are virtually certain.

Pushing the machine just 5% slower should get everything going
with multiple pages, and you wouldn't be pushing the performance
envelope so your machine can compensate for the occasional hiccup.

> The data stream is fat and relentless.

So you add another node if your current nodes can't handle the load
without using giant physical areas of memory. Attempt to redesign
the operating system. Much more cost effective.

Eric

2000-12-23 20:43:03

by Jes Sorensen

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

>>>>> "Albert" == Albert D Cahalan <[email protected]> writes:

>> bigmem is 'last resort' stuff. I'd much rather it is as now a
>> seperate allocator so you actually have to sit and think and decide
>> to give up on kmalloc/vmalloc/better algorithms and only use it
>> when the hardware sucks

Albert> It isn't just for sucky hardware. It is for performance too.

Albert> 1. Linux isn't known for cache coloring ability. Even if it
Albert> was, users want to take advantage of large pages or BAT
Albert> registers to reduce TLB miss costs. (that is, mapping such
Albert> areas into a process is needed... never mind security for now)

Albert> 2. Programming a DMA controller with multiple addresses isn't
Albert> as fast as programming it with one.

LOL

Consider that allocating the larger block of memory is going to take a
lot longer than it will take for the DMA engine to read the
scatter/gather table entries and fetch a new address word now and
then.

Jes

2000-12-24 08:46:51

by Albert D. Cahalan

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

Jes Sorensen writes:
> Albert D Cahalan <[email protected]> writes:

[about using huge physical allocations for number crunching]

>> 2. Programming a DMA controller with multiple addresses isn't
>> as fast as programming it with one.
>
> LOL
>
> Consider that allocating the larger block of memory is going
> to take a lot longer than it will take for the DMA engine to
> read the scatter/gather table entries and fetch a new address
> word now and then.

Say it takes a whole minute to allocate the memory. It wouldn't
of course, because you'd allocate memory at boot, but anyway...
Then the app runs, using that memory, for a multi-hour surgery.
The allocation happens once; the inter-node DMA transfers occur
dozens or hundreds of times per second.

2000-12-24 09:30:10

by Albert D. Cahalan

[permalink] [raw]

Subject: Re: bigphysarea support in 2.2.19 and 2.4.0 kernels

Eric W. Biederman writes:

> If you are doing a real time task you don't want to very close
> to your performance envelope. If you are hitting the performance
> envelope any small hiccup will cause you to miss your deadline,
> and close to your performance envelope hiccups are virtually certain.
>
> Pushing the machine just 5% slower should get everything going
> with multiple pages, and you wouldn't be pushing the performance
> envelope so your machine can compensate for the occasional hiccup.
>
>> The data stream is fat and relentless.
>
> So you add another node if your current nodes can't handle the load
> without using giant physical areas of memory. Attempt to redesign
> the operating system. Much more cost effective.

Nodes can be wicked expensive. :-)

Pushing the performance envelope is important when you want to
sell lots of systems. Radar is a similar computational task,
with the added need to reduce space and weight requirements.
It's not OK to be 5% more expensive, bulky, and heavy.

Also the Airplane Principal: more nodes means more big failures.