2001-10-03 12:03:47

by Vladimir V. Saveliev

[permalink] [raw]
Subject: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2

Hi

It looks like something wrong happens with writing/reading to block
device using generic read/write functions when one does:

mke2fs /dev/hda1 (blocksize is 4096)
mount /dev/hda1
umount /dev/hda1
mke2fs /dev/hda1 - FAILS with
Warning: could not write 8 blocks in inode table starting at 492004:
Attempt to write block from filesystem resulted in short write

(note that /dev/hda1 should be big enough - 3gb is enogh for example)


Explanation of what happens (could be wrong and unclear):

blocksize of /dev/hda1 was 1024. So, /dev/hda1's inode->i_blkbits is set
to 10.
mount-ing used set_blocksize() to change blocksize to 4096 in
blk_size[][].
But inode of /dev/hda1 still has i_blkbits which makes
block_prepare_write to create buffers of 1024 bytes and call
blkdev_get_block for each of them.
fs/block_dev.c:/max_block calculates number of blocks on the device
using blk_size[][] and thinks that there are 4 times less blocks on the
device.

Thanks,
vs

PS: thanks to Elena <[email protected]> for finding that



2001-10-03 13:16:38

by Alexander Viro

[permalink] [raw]
Subject: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2



On Wed, 3 Oct 2001, Vladimir V. Saveliev wrote:

> Hi
>
> It looks like something wrong happens with writing/reading to block
> device using generic read/write functions when one does:
>
> mke2fs /dev/hda1 (blocksize is 4096)
> mount /dev/hda1
> umount /dev/hda1
> mke2fs /dev/hda1 - FAILS with
> Warning: could not write 8 blocks in inode table starting at 492004:
> Attempt to write block from filesystem resulted in short write
>
> (note that /dev/hda1 should be big enough - 3gb is enogh for example)

Ehh... Linus, both blkdev_get() and blkdev_open() should set ->i_blkbits.
Vladimir, see if the patch below helps:

--- S11-pre2/fs/block_dev.c Mon Oct 1 17:56:00 2001
+++ /tmp/block_dev.c Wed Oct 3 09:12:31 2001
@@ -549,36 +549,23 @@
return res;
}

-int blkdev_get(struct block_device *bdev, mode_t mode, unsigned flags, int kind)
+static int do_open(struct block_device *bdev, struct inode *inode, struct file *file)
{
- int ret = -ENODEV;
- kdev_t rdev = to_kdev_t(bdev->bd_dev); /* this should become bdev */
- down(&bdev->bd_sem);
+ int ret = -ENXIO;
+ kdev_t dev = to_kdev_t(bdev->bd_dev);

+ down(&bdev->bd_sem);
lock_kernel();
if (!bdev->bd_op)
- bdev->bd_op = get_blkfops(MAJOR(rdev));
+ bdev->bd_op = get_blkfops(MAJOR(dev));
if (bdev->bd_op) {
- /*
- * This crockload is due to bad choice of ->open() type.
- * It will go away.
- * For now, block device ->open() routine must _not_
- * examine anything in 'inode' argument except ->i_rdev.
- */
- struct file fake_file = {};
- struct dentry fake_dentry = {};
- ret = -ENOMEM;
- fake_file.f_mode = mode;
- fake_file.f_flags = flags;
- fake_file.f_dentry = &fake_dentry;
- fake_dentry.d_inode = bdev->bd_inode;
ret = 0;
if (bdev->bd_op->open)
- ret = bdev->bd_op->open(bdev->bd_inode, &fake_file);
+ ret = bdev->bd_op->open(inode, file);
if (!ret) {
bdev->bd_openers++;
- bdev->bd_inode->i_size = blkdev_size(rdev);
- bdev->bd_inode->i_blkbits = blksize_bits(block_size(rdev));
+ bdev->bd_inode->i_size = blkdev_size(dev);
+ bdev->bd_inode->i_blkbits = blksize_bits(block_size(dev));
} else if (!bdev->bd_openers)
bdev->bd_op = NULL;
}
@@ -589,9 +576,26 @@
return ret;
}

+int blkdev_get(struct block_device *bdev, mode_t mode, unsigned flags, int kind)
+{
+ /*
+ * This crockload is due to bad choice of ->open() type.
+ * It will go away.
+ * For now, block device ->open() routine must _not_
+ * examine anything in 'inode' argument except ->i_rdev.
+ */
+ struct file fake_file = {};
+ struct dentry fake_dentry = {};
+ fake_file.f_mode = mode;
+ fake_file.f_flags = flags;
+ fake_file.f_dentry = &fake_dentry;
+ fake_dentry.d_inode = bdev->bd_inode;
+
+ return do_open(bdev, bdev->bd_inode, &fake_file);
+}
+
int blkdev_open(struct inode * inode, struct file * filp)
{
- int ret;
struct block_device *bdev;

/*
@@ -604,29 +608,8 @@

bd_acquire(inode);
bdev = inode->i_bdev;
- down(&bdev->bd_sem);
-
- ret = -ENXIO;
- lock_kernel();
- if (!bdev->bd_op)
- bdev->bd_op = get_blkfops(MAJOR(inode->i_rdev));

- if (bdev->bd_op) {
- ret = 0;
- if (bdev->bd_op->open)
- ret = bdev->bd_op->open(inode,filp);
- if (!ret) {
- bdev->bd_openers++;
- bdev->bd_inode->i_size = blkdev_size(inode->i_rdev);
- } else if (!bdev->bd_openers)
- bdev->bd_op = NULL;
- }
-
- unlock_kernel();
- up(&bdev->bd_sem);
- if (ret)
- bdput(bdev);
- return ret;
+ return do_open(bdev, inode, filp);
}

int blkdev_put(struct block_device *bdev, int kind)

2001-10-03 16:18:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2


On Wed, 3 Oct 2001, Alexander Viro wrote:
>
> Ehh... Linus, both blkdev_get() and blkdev_open() should set ->i_blkbits.

Duh. I couldn't even _imagine_ that we'd be so stupid to have duplicated
that code twice instead of just having blkdev_open() call blkdev_get().

Thanks.

Linus

2001-10-03 21:09:35

by Eric Whiting

[permalink] [raw]
Subject: Buffer cache confusion? Re: [reiserfs-list] bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2



I see a similar odd failure with jfs in 2.4.11pre1. Is this related to
the 2.4.11preX buffer cache improvements?

eric


# uname -a
2.4.11-pre1 #1 SMP Tue Oct 2 12:28:07 MDT 2001 i686
# mkfs.jfs /dev/hdc3
mkfs.jfs development version: $Name: v1_0_6 $
Warning! All data on device /dev/hdc3 will be lost!
Continue? (Y/N) Y
Format completed successfully.
10241436 kilobytes total disk space.
# mount -t jfs /dev/hdc3 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/hdc3,
or too many mounted file systems


"Vladimir V. Saveliev" wrote:
>
> Hi
>
> It looks like something wrong happens with writing/reading to block
> device using generic read/write functions when one does:
>
> mke2fs /dev/hda1 (blocksize is 4096)
> mount /dev/hda1
> umount /dev/hda1
> mke2fs /dev/hda1 - FAILS with
> Warning: could not write 8 blocks in inode table starting at 492004:
> Attempt to write block from filesystem resulted in short write
>
> (note that /dev/hda1 should be big enough - 3gb is enogh for example)
>
> Explanation of what happens (could be wrong and unclear):
>
> blocksize of /dev/hda1 was 1024. So, /dev/hda1's inode->i_blkbits is set
> to 10.
> mount-ing used set_blocksize() to change blocksize to 4096 in
> blk_size[][].
> But inode of /dev/hda1 still has i_blkbits which makes
> block_prepare_write to create buffers of 1024 bytes and call
> blkdev_get_block for each of them.
> fs/block_dev.c:/max_block calculates number of blocks on the device
> using blk_size[][] and thinks that there are 4 times less blocks on the
> device.
>
> Thanks,
> vs
>
> PS: thanks to Elena <[email protected]> for finding that

2001-10-03 21:43:18

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2



On Wed, 3 Oct 2001, Linus Torvalds wrote:

>
> On Wed, 3 Oct 2001, Alexander Viro wrote:
> >
> > Ehh... Linus, both blkdev_get() and blkdev_open() should set ->i_blkbits.
>
> Duh. I couldn't even _imagine_ that we'd be so stupid to have duplicated
> that code twice instead of just having blkdev_open() call blkdev_get().

Notice that (inode, file) is bogus API for block_device ->open().
I've checked all instances of that method in 2.4.11-pre2. Results:
The _only_ part of inode they are using is ->i_rdev. Read-only.
They use file->f_flags and file->f_mode (also read-only).

There are 3 exceptions:
1) initrd sets file->f_op. The whole thing is a dirty hack - it
should become a character device in 2.5.
2) drivers/s390/char/tapeblock.c does bogus (and useless) stuff with
file, including putting pointer to it into global structures. Since file can
be fake (allocated on stack of caller) it's hardly a good idea. Fortunately,
driver doesn't ever look at that pointer. Ditto for the rest of bogus
stuff done there - it's a dead code.
3) drivers/block/floppy.c calls permission(inode) and caches result
in file->private_data.

Summary on the floppy case: Alain uses "we have write permissions on
/dev/fd<n>" as a security check in several ioctls. The reason why
we can't just check that file had been opened for write is that floppy_open()
will refuse to open the thing for write if it's write-protected.

Notice that we could trivially move the check into fd_ioctl() itself -
permission() is fast in all relevant cases and it's definitely much faster
than operations themselves (we are talking about honest-to-$DEITY
PC floppy controller here). That wouldn't require any userland changes.

In other words, for all we care it's (block_device, flags, mode). And
that makes a lot of sense, since we don't _have_ file in quite a few
cases. Moreover, we don't care what inode is used for open - access control
is done in generic code, same way as for _any_ open(). Notice that even
floppy_open() extra checks do not affect the success of open() - we just
cache them for future calls of ioctl().

Moreover, ->release() for block_device also doesn't care for the junk
we pass - it only uses inode->i_rdev. In all cases. And I'd rather
see it them as
int (*open)(struct block_device *bdev, int flags, int mode);
int (*release)(struct block_device *bdev);
int (*check_media_change)(struct block_device *bdev);
int (*revalidate)(struct block_device *bdev);
- that would make more sense than the current variant. They are block_device
methods, not file or inode ones, after all.

2001-10-03 22:00:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2

Hi Al,

In article <[email protected]> you wrote:
> Moreover, ->release() for block_device also doesn't care for the junk
> we pass - it only uses inode->i_rdev. In all cases. And I'd rather
> see it them as
> int (*open)(struct block_device *bdev, int flags, int mode);
> int (*release)(struct block_device *bdev);
> int (*check_media_change)(struct block_device *bdev);
> int (*revalidate)(struct block_device *bdev);
> - that would make more sense than the current variant. They are block_device
> methods, not file or inode ones, after all.

How about starting 2.5 with that patch ones 2.4.11 is done? Linus?

Christoph

--
Of course it doesn't work. We've performed a software upgrade.

2001-10-03 22:51:09

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2



On Wed, 3 Oct 2001, Christoph Hellwig wrote:

> Hi Al,
>
> In article <[email protected]> you wrote:
> > Moreover, ->release() for block_device also doesn't care for the junk
> > we pass - it only uses inode->i_rdev. In all cases. And I'd rather
> > see it them as
> > int (*open)(struct block_device *bdev, int flags, int mode);
> > int (*release)(struct block_device *bdev);
> > int (*check_media_change)(struct block_device *bdev);
> > int (*revalidate)(struct block_device *bdev);
> > - that would make more sense than the current variant. They are block_device
> > methods, not file or inode ones, after all.
>
> How about starting 2.5 with that patch ones 2.4.11 is done? Linus?

I don't think that it's a good idea. Such patch is trivial - it can be
done at any point in 2.5. Moreover, while it does clean some of the
mess up, I don't see a lot of other stuff that would depend on it.

2001-10-03 23:55:50

by Rob Landley

[permalink] [raw]
Subject: Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2)

On Wednesday 03 October 2001 18:51, Alexander Viro wrote:
> On Wed, 3 Oct 2001, Christoph Hellwig wrote:
> > How about starting 2.5 with that patch ones 2.4.11 is done? Linus?
>
> I don't think that it's a good idea. Such patch is trivial - it can be
> done at any point in 2.5. Moreover, while it does clean some of the
> mess up, I don't see a lot of other stuff that would depend on it.

I think he's just trolling for anything that might bud off 2.5 at this point.
Can you blame him? (Yes, al, you may flame me now. :)

Out of morbid curiosity, when 2.5 does finally fork off (a purely academic
question, I know), which VM will it use? I'm guessing Alan will still
inherit the "stable" codebase, but the -ac and -linus trees are breaking new
ground on divergence here. Which tree becomes 2.4 once Alan inherits it?
(Is this part of what's holding up 2.5?)

Are we waiting for andrea's shiny new VM to get into Alan's tree first? I
think Alan said something about somewhere freezing over, but don't quite
recall. Is someone else (Andrea?) likely to become 2.4 maintainer?

What exactly still needs to happen before 2.4 can be locked down, encased in
lucite, and put into bugfix-only mode? (Anybody who's tried to use 3D
acceleration with Red Hat 7.1 and >= 2.4.9 is unlikely to be covinced that
it's currently in bugfix-only mode. The DRI part, anyway.)

On a technical level, 2.4.10's vm is working fine for me on my laptop. (And
I've seen "not working". ANYTHING would have been an improvement over
~2.4.5-2.4.7 or so. I often went for a soda while it swapped trying to open
its twentieth Konqueror window. (Yeah, I know, bad habit I picked up years
ago under OS/2...) At least THAT problem is now history. Then again I did
buy 256 megabytes of ram for my laptop, so it might not have been the new
kernel that fixed it. :)

What's else is left? I'm curious.

Rob

(Oh, and what's the deal with "classzones"? Linus told Andrea classzones
were a dumb idea, and we'd regret it when we tried to inflict NUMA
architecture on 2.5, but then went with Andrea's VM anyway, which I thought
was based on classzones... Was that ever resolved? What the problem
avoided? What IS a classzone, anyway? I'd be happy to RTFM, if anybody
could tell me where TF the M is hiding...)

Gotta go, Star Trek: The Previous Generation is about to come on...

2001-10-04 00:38:21

by Rik van Riel

[permalink] [raw]
Subject: Re: Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2)

On Wed, 3 Oct 2001, Rob Landley wrote:

> (Oh, and what's the deal with "classzones"? Linus told Andrea
> classzones were a dumb idea, and we'd regret it when we tried to
> inflict NUMA architecture on 2.5, but then went with Andrea's VM
> anyway, which I thought was based on classzones... Was that ever
> resolved? What the problem avoided? What IS a classzone, anyway?
> I'd be happy to RTFM, if anybody could tell me where TF the M is
> hiding...)

Classzones used to be a superset of the memory zones, so
if you have memory zones A, B and C you'd have classzone
Ac consisting of memory zone A, classzone Bc = {A + B}
and Cc = {A + B + C}.

This gives obvious problems for NUMA, suppose you have 4
nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A,
4B and 4C. Putting together classzones for these isn't
quite obvious and memory balancing will be complex ;)

Of course, nobody knows the exact definitions of classzones
in the new 2.4 VM since it's completely undocumented; lets
hope Andrea will document his code or we'll see a repeat of
the development chaos we had with the 2.2 VM...

cheers,

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ (volunteers needed)

http://www.surriel.com/ http://distro.conectiva.com/

2001-10-04 02:28:37

by Rob Landley

[permalink] [raw]
Subject: Re: Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2)

On Wednesday 03 October 2001 20:38, Rik van Riel wrote:
> On Wed, 3 Oct 2001, Rob Landley wrote:
> > (Oh, and what's the deal with "classzones"? Linus told Andrea
> > classzones were a dumb idea, and we'd regret it when we tried to
> > inflict NUMA architecture on 2.5, but then went with Andrea's VM
> > anyway, which I thought was based on classzones... Was that ever
> > resolved? What the problem avoided? What IS a classzone, anyway?
> > I'd be happy to RTFM, if anybody could tell me where TF the M is
> > hiding...)
>
> Classzones used to be a superset of the memory zones, so
> if you have memory zones A, B and C you'd have classzone
> Ac consisting of memory zone A, classzone Bc = {A + B}
> and Cc = {A + B + C}.

Ah. Cumulative zones. A class being a collection of zones, the class-zone
patch. Right. That makes a lot more sense...

> This gives obvious problems for NUMA, suppose you have 4
> nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A,
> 4B and 4C.

Is there really a NUMA machine out there where you can DMA out of another
node's 16 bit ISA space? So far the differences in the zones seem to be
purely a question of capabilities (what you can use this ram for), not
performance. Now I know numa changes that, but I'm wondering how many
performance-degraded memory zones we're likely to have that still have
capabilities like "we can DMA straight out of this". Or better yet, "we WANT
to DMA straight out of this". Zones where we wouldn't be better off having
the capability in question invoked from whichever node is "closest" to that
resource. Perhaps some kind of processor-specific tasklet.

So how often does node 1 care about the difference between DMAable and
non-DMAable memory in node 2? And more importantly, should the kernel care
about this difference, or have the function invoked over on the other
processor?

Especially since, discounting straightforward memory access latency
variations, it SEEMS like this is largely a driver question. Device X can
DMA to/from these zones of memory. The memory isn't different to the
processors, it's different to the various DEVICES. So it's not just a
processor question, but an association between processors, memory, and
devices. (Back to the concept of nodes.) Meaning drivers could be supplying
zone lists, which is just going to be LOADS of fun...

<uninformed rant>

I thought a minimalistic approach to numa optimization was to think in terms
of nodes, and treat each node as one or more processors with a set of
associated "cheap" resources (memory, peripherals, etc). Multiple tiers of
decreasing locality for each node sounds like a lot of effort for a first
attempt at NUMA support. That's where the "hideously difficult to calculate"
bits come in. A problem with could increase exponentially with the number of
nodes...

I always think of numa as the middle of a continuum. Zillion-way SMP with
enormous L1 caches on each processor starts acting a bit like NUMA (you don't
wanna go out of cache and fight the big evil memory bus if you can at all
avoid it, and we're already worrying about process locality (processor
affinity) to preserve cache state...). Shared memory beowulf clusters that
page fault through the network with a relatively low-latency interconnect
like myrinet would act a bit like NUMA too. (Obviously, I haven't played
with the monster SGI hardware or the high-end stuff IBM's so proud of.)

In a way, swap space on the drives could be considered a
performance-delimited physical memory zone. One the processor can't access
directly, which involves the allocation of DRAM bounce buffers. Between that
and actual bounce buffers we ALREADY handle problems a lot like page
migration between zones (albeit not in a generic, unified way)...

So I thought the cheap and easy way out is to have each node know what
resources it considers "local", what resources are a pain to access (possibly
involving a tasklet on annother node), and a way to determine when tasks
requiring a lot of access to them might better to be migrated directly to a
node where they're significantly cheaper to the point where the cost of
migration gets paid back. This struck me as the 90% "duct tape" solution to
NUMA.

</uninformed rant> (Hopefully, anyway...)

Of course there's bound to be something fundamentally wrong with my
understanding of the situation that invalidates all of the above, and I'd
appreciate anybody willing to take the time letting me know what it is...

So what hardware inherently requires a multi-tier NUMA approach beyond "local
stuff" and "everything else"? (I suppose there's bound to be some linearlly
arranged system with a long gradual increase in memory access latency as you
go down the row, and of course a node in the middle which has a unique
resource everybody's fighting for. Is this a common setup in NUMA systems?)

And then, of course, there's the whole question of 3D accelerated video card
texture memory, and trying to stick THAT into a zone. :) (Eew! Eew! Eew!)
Yeah, it IS a can of worms, isn't it?

But class/zone lists still seem fine for processors. It's just a question of
doing the detective work for memory allocation up front, as it were. If you
can't figure it out up front, how the heck are you supposed to do it
efficiently at allocation time?

It's just that a lot of DEVICES (like 128 megabyte video cards, and
limited-range DMA controllers) need their own class/zone lists, too. This
chunk of physical memory can be used as DMA buffers for this PCI bridge,
which can only be addressed directly by this group of processors anyway
because they share the IO-APIC it's wired to... Which involves challenging a
LOT of assumptions about the global nature of system resources previous
kernels used to make, I know. (Memory for DMA needs the specific device in
question, but we already do that for ISA vs PCI dma... The user level stuff
is just hinting to avoid bounce buffers...)

Um, can bounce buffers permanent page migration to another zone? (Since we
have to allocate the page ANYWAY, might as well leave it there till it's
evicted, unless of course we're very likely to evict it again pronto in which
case we want to avoid bouncing it back... Hmmm... Then under NUMA there
would be the "processor X can't access page in new location easily to fill it
with new data to DMA out..." Fun fun fun...)

> Putting together classzones for these isn't
> quite obvious and memory balancing will be complex ;)

And this differs from normal in what way?

It seems like andrea's approach is just changing where work is done. Moving
deductive work from allocation time to boot time. Assembling class/zone
lists is an init-time problem (boot time or hot-pluggable-hardware swap
time). Having zones linked together into lists of "this pool of memory can
be used for these tasks", possibly as linked lists in order of preference for
allocations or some such optimization, doesn't strike me as unreasonable.
(It is ENTIRELY possible I'm wrong about this. Bordering on "likely", I'll
admit...)

Making sure that a list arrangement is OPTIMAL is another matter, but
whatever method gets chosen to do that people are probably going to be
arguing it for years. You can't swap to disk perfectly without being able to
see the future, either...

The balancing issue is going to be fun, but that's true whatever happens.
You inherently have multiple nodes (collections of processors with clear and
conflicting preferences about resources) disagreeing with each other about
allocation decisions during the course of operation. That's part of the
reason the "cheap bucket" and "non-cheap bucket" approach always appealed to
me (for zillion way SMP and shared memory clusters, anyway, where they're
pretty much the norm anyway). Of course where cheap buckets overlap, there
might need to be some variant of weighting to reduce thrashing... Hmmm.

Wouldn't you need weighting for non-class zones anyway? Classing zones
doesn't necessarily make weighting undoable. The ability to make decisions
about a class doesn't mean ALL decisions have to be just aboout the class.
It's just that you quickly know what "world" you're starting with, and can
narrow down from there. (I'll have to look more closely at Andrea's
implementation now that I know what the heck it's supposed to be doing. Now
that I THINK I know, anyway...)

> Of course, nobody knows the exact definitions of classzones
> in the new 2.4 VM since it's completely undocumented; lets

I'd noticed.

> hope Andrea will document his code or we'll see a repeat of
> the development chaos we had with the 2.2 VM...

Or, for that matter, early 2.4 up until the start of the use-once thread.
For me, anyway.

Since 2.4 isn't supposed to handle NUMA anyway, I don't see what difference
it makes. Just use ANYTHING that stops the swap storms, lockups, zone
starvation, zero order allocation failures, bounce buffer shortages, and
other such fun we were having a few versions back. (Once again, this part
now seems to be in the "it works for me"(tm) stage.)

Then rip it out and start over in 2.5 if there's stuff it can't do.

> cheers,
>
> Rik

thingy,

Rob Landley,
master of stupid questions.

2001-10-04 20:48:25

by Alan

[permalink] [raw]
Subject: Re: Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2O

> Is there really a NUMA machine out there where you can DMA out of another
> node's 16 bit ISA space? So far the differences in the zones seem to be

DMA engines are tied to the node the device is tied to not to the processor
in question in most NUMA systems.

2001-10-04 20:58:33

by Alan

[permalink] [raw]
Subject: Re: Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2)

> question, I know), which VM will it use? I'm guessing Alan will still
> inherit the "stable" codebase, but the -ac and -linus trees are breaking new
> ground on divergence here. Which tree becomes 2.4 once Alan inherits it?
> (Is this part of what's holding up 2.5?)

For the moment I plan to maintain the 2.4.*-ac tree. I don't know what will
happen about 2.4 longer term - that is a Linus question. Looking at
historical VM history I don't think we will eliminate enough "2.4.10+ oops
on my box" and "on this load the VM sucks" cases from 2.4.10 to fairly
review Andrea's VM until Linus has done another 5 or 6 releases and the VM
has been tuned, bugs removed and other oops cases proven not to be vm
triggered.

Alan

2001-10-04 23:44:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA & classzones (was Whining about 2.5)

I'll preface this by saying I know a little about the IBM NUMA-Q
(aka Sequent) hardware, but not very much about VM (or anyone
else's NUMA hardware).

>> Classzones used to be a superset of the memory zones, so
>> if you have memory zones A, B and C you'd have classzone
>> Ac consisting of memory zone A, classzone Bc = {A + B}
>> and Cc = {A + B + C}.
>
> Ah. Cumulative zones. A class being a collection of zones, the class-zone
> patch. Right. That makes a lot more sense...
>
>> This gives obvious problems for NUMA, suppose you have 4
>> nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A,
>> 4B and 4C.
>
> Is there really a NUMA machine out there where you can DMA out of another
> node's 16 bit ISA space? So far the differences in the zones seem to be

If I understand your question (and my hardware) correctly, then yes. I think
we (IBM NUMA-Q) can DMA from anywhere to anywhere (using PCI cards,
not ISA, but we could still use the ISA DMA zone).

But it probably doesn't make sense to define A,B, and C for each node. For
a start, we don't use ISA DMA (and probably no other NUMA box does either).
If HIGHMEM is the stuff above 900Mb or so (and assuming for a moment
that we have 1Gb per node), then we probably don't need NORMAL+HIGHMEM
for each node either.

0-900Mb = NORMAL (1A)
900-1Gb = HIGHMEM_NODE1 (1B)
1G-2Gb = HIGHMEM_NODE2 (2)
2G-3Gb = HIGHMEM_NODE3 (3)
3Gb-4Gb = HIGHMEM_NODE4 (4)

If we have less than 1Gb per node, then one of the other nodes will have 2
zones - whichever contains the transition point from NORMAL-> HIGHMEM.
Thus number of zones = number of nodes + 1.
(to my mind, if we're frigging with the zone patterns for NUMA, getting rid of
DMA zone probably isn't too hard).

If I were allowed to define classzones as a per-processor concept (and I
don't know enough about VM to know if that's possible), it would seem to
fit nicely. Taking the map above, the classzones for a processor on node 3
would be:

{3} , {1A + 1B + 2+ 3 + 4}

> Especially since, discounting straightforward memory access latency
> variations, it SEEMS like this is largely a driver question. Device X can
> DMA to/from these zones of memory. The memory isn't different to the
> processors, it's different to the various DEVICES. So it's not just a
> processor question, but an association between processors, memory, and
> devices. (Back to the concept of nodes.) Meaning drivers could be supplying
> zone lists, which is just going to be LOADS of fun...

If possible, I'd like to avoid making every single driver NUMA aware. Partly
because I'm lazy, but also because I think it can be simpler than this. The
mem subsystem should just be able to allocate something that's as good
as possible for that card, without the driver worrying explicitly about zones
(though it may have to specify if it can do 32/64 bit DMA).

see http://lse.sourceforge.net/numa - there should be some NUMA API
proposals there for explicit stuff.

> <uninformed rant>
>
> I thought a minimalistic approach to numa optimization was to think in terms
> of nodes, and treat each node as one or more processors with a set of
> associated "cheap" resources (memory, peripherals, etc). Multiple tiers of
> decreasing locality for each node sounds like a lot of effort for a first
> attempt at NUMA support. That's where the "hideously difficult to calculate"
> bits come in. A problem with could increase exponentially with the number of
> nodes...

Going back to the memory map above, say that node 1-2 are tightly coupled,
and 3-4 are tighly coupled, but 1-3, 1-4, 2-3, 2-4 are loosely coupled. This gives
us a possible heirarchical NUMA situation. So now the classzones for procs
on node 3 would be:

{3} , {3+4}, {1A + 1B + 2+ 3 + 4}

which would make heirarchical NUMA easy enough.

> I always think of numa as the middle of a continuum. Zillion-way SMP with
> enormous L1 caches on each processor starts acting a bit like NUMA (you don't
> wanna go out of cache and fight the big evil memory bus if you can at all
> avoid it, and we're already worrying about process locality (processor
> affinity) to preserve cache state...).

Kind of, except you can explicitly specify which bits of memory you want to
use, rather than the hardware working it out for you.

> Shared memory beowulf clusters that
> page fault through the network with a relatively low-latency interconnect
> like myrinet would act a bit like NUMA too.

Yes.

> (Obviously, I haven't played
> with the monster SGI hardware or the high-end stuff IBM's so proud of.)

There's a 16-way NUMA (4x4) at OSDL (http://www.osdlab.org) that's running
linux and available for anyone to play with, if you're so inclined. It doesn't
understand very much of it's NUMA-ness, but it works. This is the IBM
NUMA-Q hardware ... I presume that's what you're referring to).

> In a way, swap space on the drives could be considered a
> performance-delimited physical memory zone. One the processor can't access
> directly, which involves the allocation of DRAM bounce buffers. Between that
> and actual bounce buffers we ALREADY handle problems a lot like page
> migration between zones (albeit not in a generic, unified way)...

I don't think it's quite that simple. For swap, you always want to page stuff
back in before using it. For NUMA memory on remote nodes, it may or may
not be worth migrating the page. If we chose to migrate a process between
nodes, we could indeed set up a system where we'd page fault pages in from
the remote node as we used them, or we could just migrate the working set
with the process.

Incidentally, swapping on NUMA will need per-zone swapping even more,
so I don't see how we could do anything sensible for this without a physical
to virtual mem map. But maybe someone knows how.

> So I thought the cheap and easy way out is to have each node know what
> resources it considers "local", what resources are a pain to access (possibly
> involving a tasklet on annother node), and a way to determine when tasks
> requiring a lot of access to them might better to be migrated directly to a
> node where they're significantly cheaper to the point where the cost of
> migration gets paid back. This struck me as the 90% "duct tape" solution to
> NUMA.

Pretty much. I don't know of any situation when we need a tasklet on another
node - that's a pretty horrible thing to have to do.

> So what hardware inherently requires a multi-tier NUMA approach beyond "local
> stuff" and "everything else"? (I suppose there's bound to be some linearlly
> arranged system with a long gradual increase in memory access latency as you
> go down the row, and of course a node in the middle which has a unique
> resource everybody's fighting for. Is this a common setup in NUMA systems?)

The next generation of hardware/chips will have more heirarchical stuff.
The shorter / smaller a bus is, the faster it can go, so we can tightly couple
small sets faster than big sets.

> And then, of course, there's the whole question of 3D accelerated video card
> texture memory, and trying to stick THAT into a zone. :) (Eew! Eew! Eew!)
> Yeah, it IS a can of worms, isn't it?

Your big powerful NUMA server is going to be used to play Quake on? ;-)
Same issue for net cards, etc though I guess.

> But class/zone lists still seem fine for processors. It's just a question of
> doing the detective work for memory allocation up front, as it were. If you
> can't figure it out up front, how the heck are you supposed to do it
> efficiently at allocation time?

If I understand what you mean correctly, we should be able to lay out
the topology at boot time, and work out which phys mem locations will
be faster / slower from any given resource (proc, PCI, etc).

> This
> chunk of physical memory can be used as DMA buffers for this PCI bridge,
> which can only be addressed directly by this group of processors anyway
> because they share the IO-APIC it's wired to...

Hmmm ... at least in the hardware I'm familiar with, we can access any PCI
bridge or any IO-APIC from any processor. Slower, but functional.

> Um, can bounce buffers permanent page migration to another zone? (Since we
> have to allocate the page ANYWAY, might as well leave it there till it's
> evicted, unless of course we're very likely to evict it again pronto in which
> case we want to avoid bouncing it back...

As I understand zones, they're physical, therefore pages don't migrate
between them. The data might be copied from the bounce buffer to a
page in another zone, but ...

Not sure if we're using quite the same terminology. Feel free to correct me.

> Hmmm... Then under NUMA there
> would be the "processor X can't access page in new location easily to fill it
> with new data to DMA out..." Fun fun fun...)

On the machines I'm used to, there's no problem with "can't access", just
slower or faster.

> Since 2.4 isn't supposed to handle NUMA anyway, I don't see what difference
> it makes. Just use ANYTHING that stops the swap storms, lockups, zone
> starvation, zero order allocation failures, bounce buffer shortages, and
> other such fun we were having a few versions back. (Once again, this part
> now seems to be in the "it works for me"(tm) stage.)
>
> Then rip it out and start over in 2.5 if there's stuff it can't do.

I'm not convinced that changing directions all the time is the most
efficient way to operate - it would be nice to keep building on work
already done in 2.4 (on whatever subsystem that is) rather than rework
it all, but maybe that'll happen anyway, so ....

Martin.

2001-10-05 06:37:13

by Rob Landley

[permalink] [raw]
Subject: Re: NUMA & classzones (was Whining about 2.5)

On Thursday 04 October 2001 19:39, Martin J. Bligh wrote:
> I'll preface this by saying I know a little about the IBM NUMA-Q
> (aka Sequent) hardware, but not very much about VM (or anyone
> else's NUMA hardware).

I saw the IBM guys in Austin give a talk on it last year, which A) had more
handwaving that star wars episode zero, B) had FAR more info about politics
in the AIX division than about NUMA, C) involved the main presenter letting
us know he was leaving IBM at the end of the week...

Kind of like getting details about CORBA out of IBM. And I worked there when
i was trying to do that. (I was once in charge of implementing corba
compliance for a project, and all they could find me to define it at the time
was a marketing brochure. Sigh...)

> >> This gives obvious problems for NUMA, suppose you have 4
> >> nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A,
> >> 4B and 4C.
> >
> > Is there really a NUMA machine out there where you can DMA out of another
> > node's 16 bit ISA space? So far the differences in the zones seem to be
>
> If I understand your question (and my hardware) correctly, then yes. I
> think we (IBM NUMA-Q) can DMA from anywhere to anywhere (using PCI cards,
> not ISA, but we could still use the ISA DMA zone).

Somebody made a NUMA machine with an ISA bus? Wow. That's peverse. I'm
impressed. (It was more a "when do we care" question...)

Two points:

1) If you can DMA from anywhere to anywhere, it's one big zone, isn't it?
Where does the NUMA come in? (I guess it's more expensive to DMA between
certain devices/memory pages? Or are we talking sheer processor access
latency here, nothing to do with devices at all...?)

2) A processor-centric view of memory zones is not the whole story. Look at
the zones we have now. The difference between the ISA zone, the PCI zone,
and high memory has nothing to do with the processor*. It's a question of
which devices (which bus/bridge really) can talk to which pages. In current
UP/SMP systems, the processor can talk to all of them pretty much equally.

* modulo the intel 36 bit extension stuff, which I must admit I haven't
looked closely at. Don't have the hardware. Then again that's sort of the
traditional numa problem of "some memory is a bit funky for the processor to
access". Obviously I'm not saying I/O is the ONLY potential difference
between memory zones...

So we need zones defined relative not just to processors (or groups of
processors that have identical access profiles), but also defined relative to
i/o devices and busses. Meaning zones may become a driver issue.

This gets us back to the concept of "nodes". Groups of processors and
devices that collectively have a similar view of the world, memory-wise. Is
this a view of the problem that current NUMA thinking is using, or not?

> But it probably doesn't make sense to define A,B, and C for each node. For
> a start, we don't use ISA DMA (and probably no other NUMA box does
> either). If HIGHMEM is the stuff above 900Mb or so (and assuming for a
> moment that we have 1Gb per node), then we probably don't need
> NORMAL+HIGHMEM for each node either.
>
> 0-900Mb = NORMAL (1A)
> 900-1Gb = HIGHMEM_NODE1 (1B)
> 1G-2Gb = HIGHMEM_NODE2 (2)
> 2G-3Gb = HIGHMEM_NODE3 (3)
> 3Gb-4Gb = HIGHMEM_NODE4 (4)

By highmem you mean memory our I/O devices can't DMA out of?

Will all the I/O devices in the system share a single pool of buffer memory,
or will devices be attached to nodes?

(My thinking still turns to making shared memory beowulf clusters act like
one big system. The hardware for that will continue to be cheap: rackmount a
few tyan thunder dual athlon boards. You can distribute drives for storage
and swap space (even RAID them if you like), and who says such a cluster has
to put all external access through a single node?

> If we have less than 1Gb per node, then one of the other nodes will have 2
> zones - whichever contains the transition point from NORMAL-> HIGHMEM.

So "normal" belongs to a specific node, so all devices basically belong to
that node?

> Thus number of zones = number of nodes + 1.
> (to my mind, if we're frigging with the zone patterns for NUMA, getting rid
> of DMA zone probably isn't too hard).

You still have the problem of doing DMA. Now this is a seperable problem
boiling down to either allocation and locking of DMAable buffers the
processor can directly access, or setting up bounce buffers when the actual
I/O is kicked off. (Or doing memory mapped I/O, or PIO. But all that is
still a bit like one big black box, I'd think. And to do it right, you need
to know which device you're doing I/O to, because I really wouldn't assume
every I/O device on the system shares the same pool of DMAable memory. Or
that we haven't got stuff like graphics cards that have their own RAM we map
into our address space. Or, for that matter, that physical memory mapping in
one node makes anything directly accessable from another node.)

> If I were allowed to define classzones as a per-processor concept (and I
> don't know enough about VM to know if that's possible), it would seem to
> fit nicely. Taking the map above, the classzones for a processor on node 3
> would be:
>
> {3} , {1A + 1B + 2+ 3 + 4}


Not just per processor. Think about a rackmount shared memory beowulf
system, page faulting through the network. With quad-processor boards in
each 1U, and BLAZINGLY FAST interconnects in the cluster. Now teach that to
act like NUMA.

Each 1U has four processors with identical performance (and probably one set
of page tables if they share a northbridge). Assembling NUMA systems out of
closely interconnected SMP systems.

> If possible, I'd like to avoid making every single driver NUMA aware.

It may not be a driver issue. It may be a bus issue. If there are two PCI
busses in the system, do they HAVE to share one set of physical memory
mappings? (NUMA sort of implies we have more than one northbridge. Dontcha
think we might have more than one southbridge, too?)

> Partly because I'm lazy, but also because I think it can be simpler than
> this. The mem subsystem should just be able to allocate something that's as
> good as possible for that card, without the driver worrying explicitly
> about zones (though it may have to specify if it can do 32/64 bit DMA).

It's not just 32/64 bit DMA. You're assuming every I/O device in the system
is talking to exactly the same pool of memory. The core assumption of NUMA
is that the processors aren't doing that, so I don't know why the I/O devices
necessarily should. (Maybe they do, what do I know. It would be nice to
hear from somebody with actual information...)

And if they ARE all talking to one pool of memory, than the whole NUMA
question becomes a bit easier, actually... The flood of zones we were so
worried about (Node 18's processor sending packets through a network card
living on node 13) can't really happen, can it?

> see http://lse.sourceforge.net/numa - there should be some NUMA API
> proposals there for explicit stuff.

Thanks for the link. :)

> > I always think of numa as the middle of a continuum. Zillion-way SMP
> > with enormous L1 caches on each processor starts acting a bit like NUMA
> > (you don't wanna go out of cache and fight the big evil memory bus if you
> > can at all avoid it, and we're already worrying about process locality
> > (processor affinity) to preserve cache state...).
>
> Kind of, except you can explicitly specify which bits of memory you want to
> use, rather than the hardware working it out for you.

Ummm...

Is the memory bus somehow physically reconfiguring itself to make some chunk
of memory lower or higher latency when talking to a given processor? I'm
confused...

> > Shared memory beowulf clusters that
> > page fault through the network with a relatively low-latency interconnect
> > like myrinet would act a bit like NUMA too.
>
> Yes.

But that's the bit that CLEARLY works in terms of nodes, and also which has
devices attached to different nodes, requiring things like remote tasklets to
access remote devices, and page migration between nodes to do repeated access
on remote pages. (Not that this is much different than sending a page back
and forth between processor caches in SMP. Hence the continuum I was talking
about...)

The multiplicative complexity I've heard fears about on this list seems to
stem from an interaction between "I/O zones" and "processor access zones"
creating an exponential number of gradations when the two qualities apply to
the same page. But in a node setup, you don't have to worry about it. A
node has its local memory, and it's local I/O, and it inflicts work on remote
zones when it needs to deal with their resources. There may be one big
shared pool of I/O memory or some such (IBM's NUMA-Q), but in that case it's
the same for all processors. Each node has one local pool, one remote pool,
and can just talk to a remote node when it needs to (about like SMP).

I THOUGHT numa had a gradient, of "local, not as local, not very local at
all, darn expensive" pages that differed from node to node, which would be a
major pain to optimize for yes. (I was thinking motherboard trace length and
relaying stuff several hops down a bus...) But I haven't seen it yet. And
even so, "not local=remote" seems to cover the majority of the cases without
exponential complexity...

I am still highly confused.

> > (Obviously, I haven't played
> > with the monster SGI hardware or the high-end stuff IBM's so proud of.)
>
> There's a 16-way NUMA (4x4) at OSDL (http://www.osdlab.org) that's running
> linux and available for anyone to play with, if you're so inclined. It
> doesn't understand very much of it's NUMA-ness, but it works. This is the
> IBM NUMA-Q hardware ... I presume that's what you're referring to).

That's what I've heard the most about. I'm also under the impression that
SGI was working on NUMA stuff up around the origin line, and that sun had
some monsters in the works as well...

It still seems to me that either clustering or zillion-way SMP is the most
interesting area of future supercomputing, though. Sheer price to
performance. For stuff that's not very easily seperable into chunks, they've
got 64 way SMP working in the lab. For stuff that IS chunkable, thousand box
clusters are getting common. If the interconnects between boxes are a
bottleneck, 10gigE is supposed to be out in late 2003, last I heard, meaning
gigE will get cheap... And for just about everything else, there's Moore's
Law...

Think about big fast-interconnect shared memory clusters. Resources are
either local or remote through the network, you don't care too much about
gradients. So the "symmetrical" part of SMP applies to decisions between
nodes. There's another layer of decisions in that a node may be an SMP box
in and of itself (probably will), but there's only really two layers to worry
about, not an exponential amount of complexity where each node has a
potentially unique relationship with every other node...

People wanting to run straightforward multithreaded programs using shared
memory and semaphores on big clusters strikes me as an understandable goal,
and the drive for fast (low latency) interconnects to make that feasible is
something I can see a good bang for the buck coming out of. Here's the
hardware that's widely/easily/cheaply available, here's what programmers want
to do with it. I can see that.

The drive to support monster mainframes which are not only 1% of the market
but which get totally redesigned every three or four years to stay ahead of
moore's law... I'm not quite sure what's up there. How much of the market
can throw that kind of money to constantly offset massive depreciation?

Is the commodity hardware world going to inherit NUMA (via department level
shared memory beowulf clusters, or just plain the hardware to do it getting
cheap enough), or will it remain a niche application?

As I said: master of stupid questions. The answers are taking a bit more
time...

> > In a way, swap space on the drives could be considered a
> > performance-delimited physical memory zone. One the processor can't
> > access directly, which involves the allocation of DRAM bounce buffers.
> > Between that and actual bounce buffers we ALREADY handle problems a lot
> > like page migration between zones (albeit not in a generic, unified
> > way)...
>
> I don't think it's quite that simple. For swap, you always want to page
> stuff back in before using it. For NUMA memory on remote nodes, it may or
> may not be worth migrating the page.

Bounce buffers. This is new? Seems like the same locking issues, even...

> If we chose to migrate a process
> between nodes, we could indeed set up a system where we'd page fault pages
> in from the remote node as we used them, or we could just migrate the
> working set with the process.

Yup. This is a problem I've heard discussed a lot: deciding when to migrate
resources. (Pages, processes, etc.) It also seems to be a seperate layer of
the problem, one that isn't too closely tied to the initial allocation
strategy (although it may feed back into it, but really that seems to be just
free/alloc and maybe adjusting weighting/ageing whatever. Am I wrong?)

I.E. migration strategy and allocation strategy aren't necessarily the same
thing...

> Incidentally, swapping on NUMA will need per-zone swapping even more,
> so I don't see how we could do anything sensible for this without a
> physical to virtual mem map. But maybe someone knows how.

There you got me. I DO know that you can have multiple virtual mappings for
each physical page, so it's not as easy as the other way around, but this
could be why the linked list was invented...

(I believe Rik is working on patches that cover this bit. Haven't looked at
them yet.)

> > So I thought the cheap and easy way out is to have each node know what
> > resources it considers "local", what resources are a pain to access
> > (possibly involving a tasklet on annother node), and a way to determine
> > when tasks requiring a lot of access to them might better to be migrated
> > directly to a node where they're significantly cheaper to the point where
> > the cost of migration gets paid back. This struck me as the 90% "duct
> > tape" solution to NUMA.
>
> Pretty much. I don't know of any situation when we need a tasklet on
> another node - that's a pretty horrible thing to have to do.

Think shared memory beowulf.

My node has a hard drive. Some other node wants to read and write to my hard
drive, because it's part of a larger global file system or storage area
network or some such.

My node has a network card. There are three different connections to the
internet, and they're on seperate nodes to avoid single point of failure
syndrome.

My node has a video capture card. The cluster as a whole is doing realtime
video acquisition and streaming for a cable company that saw the light and
switched over to MP4 with a big storage cluster. Incoming signals from cable
(or movies fed into the system for pay per view) get converted to mp4
(processor intensive, cluster needed to keep up with HDTV, especially
multiple channels) and saved in the storage area network part, and subscriber
channels get fetched and fed back out. (Probably not as video, probably as a
TCP/IP stream to a set top box. The REAL beauty of digital video isn't
trying to do "moves on demand", it's having a cluster stuffed with old
episodes of Mash, ER, The West Wing, Star Trek, The Incredible Hulk, Dark
Shadows, and Dr. Who which you can call up and play at will. Syndicated
content on demand. EASY task for a cluster to do. Doesn't NEED to think
NUMA, that could be programmed as beowulf. But we could also be using the
Mach microkernel on SMP boxes, it makes about as much sense. Beowulf is
message passing, microkernels are message passing, CORBA is message
passing... Get fast interconnects, message passing becomes less and less of
a good idea...)

> > So what hardware inherently requires a multi-tier NUMA approach beyond
> > "local stuff" and "everything else"? (I suppose there's bound to be some
> > linearlly arranged system with a long gradual increase in memory access
> > latency as you go down the row, and of course a node in the middle which
> > has a unique resource everybody's fighting for. Is this a common setup
> > in NUMA systems?)
>
> The next generation of hardware/chips will have more heirarchical stuff.
> The shorter / smaller a bus is, the faster it can go, so we can tightly
> couple small sets faster than big sets.

Sure. This is electronics 101, the speed of light is not your friend.
(Intel fought and lost this battle with the pentium 4's pipeline, more haste
less speed...)

But the question of how much of a gradient we care about remains. It's
either local, or it's not local.

The question is latency, not throughput. (Rambus did this too, more
throughput less latency...) Lots of things use loops in an attempt to get
fixed latency: stuff wanders by at known intervals so it's easy to fill up
slots on the bus because you know when your slot will be coming by...

NUMA is also a question of latency. Gimme high end fiber stuff and I could
have a multi-gigabit pipe between two machines in different buildings.
Latency will still make it a less fun to try to page access DRAM through than
your local memory bus, regardless of relative throughput.

> > And then, of course, there's the whole question of 3D accelerated video
> > card texture memory, and trying to stick THAT into a zone. :) (Eew!
> > Eew! Eew!) Yeah, it IS a can of worms, isn't it?
>
> Your big powerful NUMA server is going to be used to play Quake on? ;-)
> Same issue for net cards, etc though I guess.

Not quake, video capture and streaming. Big market there, which beowulf
clusters can address today, but in a fairly clumsy way. (The sane way to
program that is to have one node dispatching/accepting frames to other nodes,
so beowulf isn't so bad. But message passing is not a way to control
latency, and latency is your real problem when you want to avoid droppping
frames. Buffering helps this, though. Five seconds of buffer space covers a
multitude of sins...)

> > But class/zone lists still seem fine for processors. It's just a
> > question of doing the detective work for memory allocation up front, as
> > it were. If you can't figure it out up front, how the heck are you
> > supposed to do it efficiently at allocation time?
>
> If I understand what you mean correctly, we should be able to lay out
> the topology at boot time, and work out which phys mem locations will
> be faster / slower from any given resource (proc, PCI, etc).

Ask andrea. I THINK so, but I'm not the expert. (And Linus seems to
disagree, and he tends to have good reasons. :)

> > This
> > chunk of physical memory can be used as DMA buffers for this PCI bridge,
> > which can only be addressed directly by this group of processors anyway
> > because they share the IO-APIC it's wired to...
>
> Hmmm ... at least in the hardware I'm familiar with, we can access any PCI
> bridge or any IO-APIC from any processor. Slower, but functional.

Is the speed difference along a noticeably long gradient, or more "this group
is fast, the rest is not so fast"?

And do the bridges and IO-APICS cluster with processors into something that
looks like nodes, or do they overlap in a less well defined way?

> > Um, can bounce buffers permanent page migration to another zone? (Since
> > we have to allocate the page ANYWAY, might as well leave it there till
> > it's evicted, unless of course we're very likely to evict it again pronto
> > in which case we want to avoid bouncing it back...
>
> As I understand zones, they're physical, therefore pages don't migrate
> between them.

And processors are physical, so tasks don't migrate between them?

> The data might be copied from the bounce buffer to a
> page in another zone, but ...

Virtual page, physical page...

> Not sure if we're using quite the same terminology. Feel free to correct
> me.

I'm more likely to receive correction. I'm trying to learn and understand
the problem...

> > Hmmm... Then under NUMA there
> > would be the "processor X can't access page in new location easily to
> > fill it with new data to DMA out..." Fun fun fun...)
>
> On the machines I'm used to, there's no problem with "can't access", just
> slower or faster.

Well, with shared memory beowulf clusters you could have a tasklet on the
other machine lock the page and spit you a copy of the data, so "can't"
doesn't work there either. That's where the word "easily" came in...

But an attempt to DMA into or out of that page from another node would
involve bounce buffers on the other node...

> > Since 2.4 isn't supposed to handle NUMA anyway, I don't see what
> > difference it makes. Just use ANYTHING that stops the swap storms,
> > lockups, zone starvation, zero order allocation failures, bounce buffer
> > shortages, and other such fun we were having a few versions back. (Once
> > again, this part now seems to be in the "it works for me"(tm) stage.)
> >
> > Then rip it out and start over in 2.5 if there's stuff it can't do.
>
> I'm not convinced that changing directions all the time is the most
> efficient way to operate

No comment on the 2.4.0-2.4.10 VM development process will be made by me at
this time.

> - it would be nice to keep building on work
> already done in 2.4 (on whatever subsystem that is) rather than rework
> it all, but maybe that'll happen anyway, so ....

At one point I thought the purpose of a stable series was to stabilize,
debug, and tweak what you'd already done, and architectural changes went in
development series. (Except for the occasional new driver.) As I said, I
tend to be wrong about stuff...

> Martin.

Rob

2001-10-05 06:37:13

by Rob Landley

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

On Thursday 04 October 2001 16:53, Alan Cox wrote:
> > Is there really a NUMA machine out there where you can DMA out of another
> > node's 16 bit ISA space? So far the differences in the zones seem to be
>
> DMA engines are tied to the node the device is tied to not to the processor
> in question in most NUMA systems.

Oh good. I'd sort of guessed that part, but wasn't quite sure. (I've seen
hardware people do some amazingly odd things before. Luckily not recently,
though...)

So would a workable (if naieve) attempt to use Andrea's
memory-zones-grouped-into-classes approach on NUMA just involve making a
class/zone list for each node? (Okay, you've got to identify nodes, and
group together processors, bridges, DMAable devices, etc, but it seems like
that has to be done anyway, class/zone or not.) How does what people want to
do for NUMA improve on that?

Is a major goal of NUMA figuring out how to borrow resources from adjacent
nodes (and less-adjacent nodes) in a "second choice, third choice, twelfth
choice" kind of way? Or is a "this resource set is local to this node, and
allocating beyond this group is some variant of swapping behavior" approach
an acceptable first approximation?

If class/zone is so evil for NUMA, what's the alternative that's being
considered? (Pointer to paper?)

I'm wondering how the class/zone approach is more evil than the alternative
of having lots and lots of little zones which have different properties for
each processor and DMAable device on the system, and then trying to figure
out what to do from there at allocation time or during each attempt to
inflict I/O upon buffers.

Rob

P.S. Rik pointed out in email (replying to my "master of stupid questions"
signoff) that I am indeed confused about a great many things, but didn't
elaborate. Of course I agree with this, but I do try to make it up on volume
:).

2001-10-05 14:47:54

by Alan

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

> So would a workable (if naieve) attempt to use Andrea's
> memory-zones-grouped-into-classes approach on NUMA just involve making a
> class/zone list for each node? (Okay, you've got to identify nodes, and
> group together processors, bridges, DMAable devices, etc, but it seems like
> that has to be done anyway, class/zone or not.) How does what people want to
> do for NUMA improve on that?

I fear it becomes an N! problem.

I'd like to hear what Andrea has planned since without docs its hard to
speculate on how the 2.4.10 vm works anyway

2001-10-05 17:46:30

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA & classzones (was Whining about 2.5)

>> I'll preface this by saying I know a little about the IBM NUMA-Q
>> (aka Sequent) hardware, but not very much about VM (or anyone
>> else's NUMA hardware).
>
> I saw the IBM guys in Austin give a talk on it last year, which A) had more
> handwaving that star wars episode zero, B) had FAR more info about politics
> in the AIX division than about NUMA, C) involved the main presenter letting
> us know he was leaving IBM at the end of the week...

Oops. I disclaim all knowledge. I gave a brief presentation at OLS. The slides
are somewhere .... but they probably don't make much sense without words.
http://lse.sourceforge.net/numa/mtg.2001.07.25/minutes.html under
"Porting Linux to NUMA-Q".

> Kind of like getting details about CORBA out of IBM. And I worked there
> when i was trying to do that. (I was once in charge of implementing corba
> compliance for a project, and all they could find me to define it at the time
> was a marketing brochure. Sigh...)

IBM is huge - don't tar us all with the same brush ;-) There are good parts
and bad parts ...

>> >> This gives obvious problems for NUMA, suppose you have 4
>> >> nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A,
>> >> 4B and 4C.
>> >
>> > Is there really a NUMA machine out there where you can DMA out of another
>> > node's 16 bit ISA space? So far the differences in the zones seem to be
>>
>> If I understand your question (and my hardware) correctly, then yes. I
>> think we (IBM NUMA-Q) can DMA from anywhere to anywhere (using PCI cards,
>> not ISA, but we could still use the ISA DMA zone).
>
> Somebody made a NUMA machine with an ISA bus? Wow. That's peverse. I'm
> impressed. (It was more a "when do we care" question...)

No, that's not what I said (though we do have some perverse bus in there ;-))
I said we can dma out of the first physical 16Mb of RAM (known as the ISA DMA
zone on any node into any other node (using a PCI card). Or at least that's what
I meant ;-)

> Two points:
>
> 1) If you can DMA from anywhere to anywhere, it's one big zone, isn't it?
> Where does the NUMA come in? (I guess it's more expensive to DMA between
> certain devices/memory pages? Or are we talking sheer processor access
> latency here, nothing to do with devices at all...?)

It takes about 10 times longer to DMA or read memory from a remote node's
RAM than the local nodes RAM. It's "Non-uniform" in terms of access speed,
not capability. I don't have latency / bandwidth exact ratios to hand, but it's
in the order of 10:1 on our boxes.

> 2) A processor-centric view of memory zones is not the whole story. Look at
> the zones we have now. The difference between the ISA zone, the PCI zone,
> and high memory has nothing to do with the processor*.

True (for the ISA and PCI). I'd say that the difference between NORMAL and
HIGHMEM has everything to do with the processor. Above 4Gb you're talking
about 36 bit stuff, but HIGHMEM is often just stuff from 900Mb or so to 4Gb.

> It's a question of
> which devices (which bus/bridge really) can talk to which pages. In current
> UP/SMP systems, the processor can talk to all of them pretty much equally.

That's true of the NUMA systems I know too (though maybe not all NUMA
systems).

> So we need zones defined relative not just to processors (or groups of
> processors that have identical access profiles), but also defined relative to
> i/o devices and busses. Meaning zones may become a driver issue.
>
> This gets us back to the concept of "nodes". Groups of processors and
> devices that collectively have a similar view of the world, memory-wise. Is
> this a view of the problem that current NUMA thinking is using, or not?

More or less. A node may not internally be symmetric - some processors may
be closer to each other than others. I guess we can redefine the definition
of node at that point to be the more tightly coupled groups of processors,
but those procs may still have uniform access to the same physical memory,
so the definition gets looser.

>> But it probably doesn't make sense to define A,B, and C for each node. For
>> a start, we don't use ISA DMA (and probably no other NUMA box does
>> either). If HIGHMEM is the stuff above 900Mb or so (and assuming for a
>> moment that we have 1Gb per node), then we probably don't need
>> NORMAL+HIGHMEM for each node either.
>>
>> 0-900Mb = NORMAL (1A)
>> 900-1Gb = HIGHMEM_NODE1 (1B)
>> 1G-2Gb = HIGHMEM_NODE2 (2)
>> 2G-3Gb = HIGHMEM_NODE3 (3)
>> 3Gb-4Gb = HIGHMEM_NODE4 (4)
>
> By highmem you mean memory our I/O devices can't DMA out of?

No. I mean stuff that's not permanently mapped to virtual memory (as I
understand it, that's anything over about 900Mb, but then I don't understand
it all that well, so ... ;-) )

> Will all the I/O devices in the system share a single pool of buffer memory,
> or will devices be attached to nodes?
>
> (My thinking still turns to making shared memory beowulf clusters act like
> one big system. The hardware for that will continue to be cheap: rackmount a
> few tyan thunder dual athlon boards. You can distribute drives for storage
> and swap space (even RAID them if you like), and who says such a cluster has
> to put all external access through a single node?

In the case of the hardware I know about (and the shared mem beowulf clusters),
you can attach I/O devices to any node (we have 2 PCI buses & 2 I/O APICs
per node). OK, so at the moment the released code only uses the buses on
the first node, but I have code to fix that that's in development (it works, but it's
crude).

What's really nice is to multi-drop connect a SAN using fibre-channel cards to
each and every node (normally 2 cards per node for redundancy). A disk access
on any node then gets routed through the local SAN interface, rather than across
the interconnect. Much faster. Outbound net traffic is the same, inbound is harder.

>> If we have less than 1Gb per node, then one of the other nodes will have 2
>> zones - whichever contains the transition point from NORMAL-> HIGHMEM.
>
> So "normal" belongs to a specific node, so all devices basically belong to
> that node?

NORMAL = <900Mb. If I have >900Mb of mem in the first mode, then NORMAL
belongs to that node. There's nothing to stop any device DMAing into things
outside the NORMAL (see Jens' patches to reduce bounce bufferness) zone -
they use physaddrs - with 32bit PCI, that means the first 4Gb, with 64Gb, pretty
much anywhere. Or at least that's how I understand it until somebody tells me
different.

And no, that doesn't mean that all devices belong to that node. Even if you say
I can only DMA into the normal zone, a device on node 3, with no local memory
in the normal zone just DMAs over the interconnect. And, yes, that takes some
hardware support - that's why NUMA boxes ain't cheap ;-)

> You still have the problem of doing DMA. Now this is a seperable problem
> boiling down to either allocation and locking of DMAable buffers the
> processor can directly access, or setting up bounce buffers when the actual
> I/O is kicked off. (Or doing memory mapped I/O, or PIO. But all that is
> still a bit like one big black box, I'd think. And to do it right, you need
> to know which device you're doing I/O to, because I really wouldn't assume
> every I/O device on the system shares the same pool of DMAable memory. Or

Yes it does (for me at least). To access some things, I take the PCI dev pointer,
work out which bus the PCI card is attatched to, do a bus->node map, and that
gives me the answer.

> that we haven't got stuff like graphics cards that have their own RAM we map
> into our address space. Or, for that matter, that physical memory mapping in
> one node makes anything directly accessable from another node.)

For me at least, all memory, whether mem-mapped to a card or not, is accessible
from everywhere. Port I/O I have to play some funky games for, but that's a
different story.

> Not just per processor. Think about a rackmount shared memory beowulf
> system, page faulting through the network. With quad-processor boards in
> each 1U, and BLAZINGLY FAST interconnects in the cluster. Now teach that to
> act like NUMA.

The interconnect would need hardware support for NUMA to do the cache
coherency and transparent remote memory access.

> Each 1U has four processors with identical performance (and probably one set
> of page tables if they share a northbridge). Assembling NUMA systems out of
> closely interconnected SMP systems.

That's exactly what the NUMA-Q systems are. More or less standard 4-way
Intel boxes, with an interconnect doing something like 10Gb/second with
reasonably low latency. Up to 16 quads = 64 processors. Current code is
limited to 32 procs on Linux.

> It may not be a driver issue. It may be a bus issue. If there are two PCI
> busses in the system, do they HAVE to share one set of physical memory
> mappings? (NUMA sort of implies we have more than one northbridge. Dontcha
> think we might have more than one southbridge, too?)

I fear I don't understand this. Not remembering what north vs south did again
(I did know once) probably isn't helping ;-) But yes, everyone shares the same
physical memory map, at leas on NUMA-Q

> It's not just 32/64 bit DMA. You're assuming every I/O device in the system
> is talking to exactly the same pool of memory. The core assumption of NUMA
> is that the processors aren't doing that, so I don't know why the I/O devices
> necessarily should. (Maybe they do, what do I know. It would be nice to
> hear from somebody with actual information...)

No, the core assumption of NUMA isn't that everyone's not talking to the same
pool of memory, it's that talking to different parts of the pool isn't done at a
uniform speed.

> And if they ARE all talking to one pool of memory, than the whole NUMA
> question becomes a bit easier, actually... The flood of zones we were so
> worried about (Node 18's processor sending packets through a network card
> living on node 13) can't really happen, can it?

You still want to allocate memory locally for performance reasons, even though
it'll work. My current port to NUMA-Q doesn't do that, and the performance
will probably get a lot better when it does. We still need to split different nodes
memory into different zones (or find another similar solution).

>> > I always think of numa as the middle of a continuum. Zillion-way SMP
>> > with enormous L1 caches on each processor starts acting a bit like NUMA
>> > (you don't wanna go out of cache and fight the big evil memory bus if you
>> > can at all avoid it, and we're already worrying about process locality
>> > (processor affinity) to preserve cache state...).
>>
>> Kind of, except you can explicitly specify which bits of memory you want to
>> use, rather than the hardware working it out for you.
>
> Ummm...

I mean if you have a process running on node 1, you can tell it to allocate
memory on node 1 (or you could if the code was there ;-) ). Processes on
node 3 get memory on node 3, etc. In an "enormous L1 cache" the hardware
works out where to put things in the cache, not the OS.

> Is the memory bus somehow physically reconfiguring itself to make some chunk
> of memory lower or higher latency when talking to a given processor? I'm
> confused...

Each node has it's own bank of RAM. If I access the RAM in another node,
I go over the interconnect, which is a damned sight slower than just going
over the local memory bus. The interconnect plugs into the local memory
bus of each node, and transparently routes requests around to other nodes
for you.

Think of it like a big IP based network. Each node is a LAN, with it's own
subnet. The interconnect is the router connecting the LANs. I can do 100Mbs
over the local LAN, but only 10 Mbps through the router to remote LANs.

This email is far too big, and I have to go to a meeting. I'll reply to the rest
of it later ;-)

M.

2001-10-06 01:45:04

by Jesse Barnes

[permalink] [raw]
Subject: Re: NUMA & classzones (was Whining about 2.5)

These are some long messages... I'll do my best to reply w/respect to SGI
boxes. Maybe it's time I got it together and put one of our NUMA machines
on the 'net. A little background first though, for those that aren't
familiar with our NUMA arch. All this info should be available at
http://techpubs.sgi.com I think, but I'll briefly go over it here.

Our newest systems are made of 'bricks'. There are a few different types
of brick: c, r, i, p, d, etc. C-bricks are simply a collection of 0-4
cpus and some amount of memory. R-bricks are simply a collection of
NUMAlink ports. I-bricks have a few PCI busses and a couple of disks.
P-bricks have a bunch of PCI busses, and D-bricks have a bunch of disks.
Each brick has at least one IO port, C-bricks also have a NUMAlink port
that can be connected to other C-bricks or R-bricks. So remote memory
accesses have to go out a NUMAlink to an R-brick and in through another
NUMAlink on another C-brick on all but the smallest systems. P, and I
bricks are connected directly to C-bricks, while D-bricks are connected
via fibrechannel to SCSI cards on either P or I bricks.

On Thu, 4 Oct 2001, Rob Landley wrote:
> Somebody made a NUMA machine with an ISA bus? Wow. That's peverse. I'm
> impressed. (It was more a "when do we care" question...)

I certainly hope not!! It's bad enough that people have Pentium NUMA
machines; I don't envy the people that had to bring those up.

> Two points:
>
> 1) If you can DMA from anywhere to anywhere, it's one big zone, isn't it?
> Where does the NUMA come in? (I guess it's more expensive to DMA between
> certain devices/memory pages? Or are we talking sheer processor access
> latency here, nothing to do with devices at all...?)

On SGI, yes. We've got both MIPS and IPF (formerly known as IA64) NUMA
machines. Since both are 64 bit, things are much easier than with say, a
Pentium. PCI cards that are 64 bit DMA capable can read/write any memory
location on the machine. 32 bit cards can as well, with the help of our
PCI bridge. Ideally though, you'd like to DMA to/from devices that are
directly connected to the memory on their associated C-brick, otherwise
you've got to hop over other nodes to get to your memory (=higher latency,
possible bandwidth contention). I hope this answers your question.

> 2) A processor-centric view of memory zones is not the whole story. Look at
> the zones we have now. The difference between the ISA zone, the PCI zone,
> and high memory has nothing to do with the processor*. It's a question of
> which devices (which bus/bridge really) can talk to which pages. In current
> UP/SMP systems, the processor can talk to all of them pretty much equally.

You can think of all the NUMA systems I know of that way as well, but you
get higher performance if you're careful about which pages you talk to.

> So we need zones defined relative not just to processors (or groups of
> processors that have identical access profiles), but also defined relative to
> i/o devices and busses. Meaning zones may become a driver issue.
>
> This gets us back to the concept of "nodes". Groups of processors and
> devices that collectively have a similar view of the world, memory-wise. Is
> this a view of the problem that current NUMA thinking is using, or not?

Yup. We have pg_data_t for just that purpose, although it currently only
has information about memory, not total system topology (i.e. I/O devices,
CPUs, etc.).

> > But it probably doesn't make sense to define A,B, and C for each node. For
> > a start, we don't use ISA DMA (and probably no other NUMA box does
> > either). If HIGHMEM is the stuff above 900Mb or so (and assuming for a
> > moment that we have 1Gb per node), then we probably don't need
> > NORMAL+HIGHMEM for each node either.
> >
> > 0-900Mb = NORMAL (1A)
> > 900-1Gb = HIGHMEM_NODE1 (1B)
> > 1G-2Gb = HIGHMEM_NODE2 (2)
> > 2G-3Gb = HIGHMEM_NODE3 (3)
> > 3Gb-4Gb = HIGHMEM_NODE4 (4)
>
> By highmem you mean memory our I/O devices can't DMA out of?

On the NUMA-Q platform, probably (but I'm not sure since I've never
worked on one).

> Will all the I/O devices in the system share a single pool of buffer memory,
> or will devices be attached to nodes?

Both. At least for us.

> Not just per processor. Think about a rackmount shared memory beowulf
> system, page faulting through the network. With quad-processor boards in
> each 1U, and BLAZINGLY FAST interconnects in the cluster. Now teach that to
> act like NUMA.

Sounds familiar. It's much easier when your memory controllers are aware
of this fact though...

> It's not just 32/64 bit DMA. You're assuming every I/O device in the system
> is talking to exactly the same pool of memory. The core assumption of NUMA
> is that the processors aren't doing that, so I don't know why the I/O devices
> necessarily should. (Maybe they do, what do I know. It would be nice to
> hear from somebody with actual information...)

I think the core assumption is exactly the opposite, but you're correct if
you're talking about simple clusters.

> And if they ARE all talking to one pool of memory, than the whole NUMA
> question becomes a bit easier, actually... The flood of zones we were so
> worried about (Node 18's processor sending packets through a network card
> living on node 13) can't really happen, can it?

I guess it depends on the machine. On our machine, you could do that, but
it looks like the IBM machine would need bounce buffers for such
things. You might also need bounce buffers for 32 bit PCI cards for some
machines, since they might only be able to DMA to/from the first 4 GB of
memory.

> I THOUGHT numa had a gradient, of "local, not as local, not very local at
> all, darn expensive" pages that differed from node to node, which would be a

That's pretty much right. But for most things it's not *too* bad to
optimize for, until you get into *huge* machines (e.g. 1024p, lots of
memory).

> major pain to optimize for yes. (I was thinking motherboard trace length and
> relaying stuff several hops down a bus...) But I haven't seen it yet. And
> even so, "not local=remote" seems to cover the majority of the cases without
> exponential complexity...

Yeah, luckily, you can assume local==remote and things will work, albeit
slowly (ask Ralf about forgetting to turn on CONFIG_NUMA on one of our
MIPS machines).

> That's what I've heard the most about. I'm also under the impression that
> SGI was working on NUMA stuff up around the origin line, and that sun had
> some monsters in the works as well...

AFAIK, Sun just has SMP machines, but they might have a NUMA one in the
pipe. And yes, we've had NUMA stuff for awhile, and recently got a 1024p
system running. There were some weird bottlenecks exposed by that one.

> People wanting to run straightforward multithreaded programs using shared
> memory and semaphores on big clusters strikes me as an understandable goal,
> and the drive for fast (low latency) interconnects to make that feasible is
> something I can see a good bang for the buck coming out of. Here's the
> hardware that's widely/easily/cheaply available, here's what programmers want
> to do with it. I can see that.
>
> The drive to support monster mainframes which are not only 1% of the market
> but which get totally redesigned every three or four years to stay ahead of
> moore's law... I'm not quite sure what's up there. How much of the market
> can throw that kind of money to constantly offset massive depreciation?
>
> Is the commodity hardware world going to inherit NUMA (via department level
> shared memory beowulf clusters, or just plain the hardware to do it getting
> cheap enough), or will it remain a niche application?

Maybe?

> > Incidentally, swapping on NUMA will need per-zone swapping even more,
> > so I don't see how we could do anything sensible for this without a
> > physical to virtual mem map. But maybe someone knows how.

I know Kanoj was talking about this awhile back; don't know if he ever
came up with any code though...

> The question is latency, not throughput. (Rambus did this too, more

Bingo! Latency is absolutely key to NUMA. If you have really bad
latency, you've basically got a cluster. The programming model is greatly
simplified though, as you mentioned above (i.e. shared mem multithreading
vs. MPI).

> Is the speed difference along a noticeably long gradient, or more "this group
> is fast, the rest is not so fast"?

Depends on the machine. I think IBM's machines have something like a 10:1
ratio of remote vs. local memory access latency, while SGI's latest have
something like 1.5:1. That's per-hop though, so big machines can be
pretty non-uniform.

I hope I've at least refrained from muddying the NUMA waters further with
my ramblings. I'll keep an eye on subjects with 'NUMA' or 'zone' though,
just so I can be more informed about these things.

Jesse

2001-10-08 18:01:21

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

>> So would a workable (if naieve) attempt to use Andrea's
>> memory-zones-grouped-into-classes approach on NUMA just involve making a
>> class/zone list for each node? (Okay, you've got to identify nodes, and
>> group together processors, bridges, DMAable devices, etc, but it seems like
>> that has to be done anyway, class/zone or not.) How does what people want to
>> do for NUMA improve on that?
>
> I fear it becomes an N! problem.
>
> I'd like to hear what Andrea has planned since without docs its hard to
> speculate on how the 2.4.10 vm works anyway

Can you describe why it's N! ? Are you talking about the worst possible case,
or a two level local / non-local problem?

Thanks,

Martin.

2001-10-08 18:05:21

by Alan

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

> > speculate on how the 2.4.10 vm works anyway
>
> Can you describe why it's N! ? Are you talking about the worst possible case,
> or a two level local / non-local problem?

Worst possible. I dont think in reality it will be nearly that bad

2001-10-08 18:24:10

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

>> > speculate on how the 2.4.10 vm works anyway
>>
>> Can you describe why it's N! ? Are you talking about the worst possible case,
>> or a two level local / non-local problem?
>
> Worst possible. I dont think in reality it will be nearly that bad

The worst possible case I can conceive (in the future architectures
that I know of) is 4 different levels. I don't think the number of access
speed levels is ever related to the number of processors ?
(users of other NUMA architectures feel free to slap me at this point).

So I *think* the worst possible case is still linear (to number of nodes)
in terms of how many classzone type things we'd need? And the number
of classzone type things any given access would have to search through
for an access is constant? The number of zones searched would be
(worst case) linear to number of nodes?

As we're intending to code this real soon now, this is more than just idle
speculation for my own amusement ;-)

Thanks,

Martin.


2001-10-08 18:26:00

by Alan

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

> The worst possible case I can conceive (in the future architectures
> that I know of) is 4 different levels. I don't think the number of access
> speed levels is ever related to the number of processors ?
> (users of other NUMA architectures feel free to slap me at this point).

The classzone code seems to deal in combinations of memory zones, not in
specific zones. It lacks docs and the comments seem at best bogus and
from the old code so I may be wrong. So its relative weightings for
each combination of memory we might want to consider for each case

Andrea ?

Alan

2001-10-08 18:36:30

by Jesse Barnes

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

On Mon, 8 Oct 2001, Martin J. Bligh wrote:

> The worst possible case I can conceive (in the future architectures
> that I know of) is 4 different levels. I don't think the number of access
> speed levels is ever related to the number of processors ?
> (users of other NUMA architectures feel free to slap me at this point).

So you're saying that at most any given node is 4 hops away from any
other for your arch?

> So I *think* the worst possible case is still linear (to number of nodes)
> in terms of how many classzone type things we'd need? And the number
> of classzone type things any given access would have to search through
> for an access is constant? The number of zones searched would be
> (worst case) linear to number of nodes?

That's how we have our stuff coded at the moment, but with classzones you
might be able to get that down even further. For instance, you could have
classzones that correspond to the number of hops a set of nodes is from a
given node. Having such classzones might make finding nearby memory
easier.

Jesse

2001-10-08 18:58:36

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

>> The worst possible case I can conceive (in the future architectures
>> that I know of) is 4 different levels. I don't think the number of access
>> speed levels is ever related to the number of processors ?
>> (users of other NUMA architectures feel free to slap me at this point).
>
> So you're saying that at most any given node is 4 hops away from any
> other for your arch?

For the current architecture (well, for NUMA-Q) it's 0 or 1. For future
architectures, there will be more (forgive me for deliberately not being
specific ... I'd have to ask for more blessing first). Up to about 4. Ish.

Depending on how much extra latency each hop introduces, it may well
not be worth adding the complexity of differentiating beyond local vs
remote? At least at first ...

Do you know how many hops SGI can get, and how much extra latency
you introduce? I know we're something like 10:1 ratio at the moment
between local and remote.

I guess my main point was that the number of levels was more like constant
than linear. Maybe for large interconnected switched systems with small
switches, it's n log n, but in practice I think log n is small enough to be
considered constant (the number of levels of switches).

>> So I *think* the worst possible case is still linear (to number of nodes)
>> in terms of how many classzone type things we'd need? And the number
>> of classzone type things any given access would have to search through
>> for an access is constant? The number of zones searched would be
>> (worst case) linear to number of nodes?
>
> That's how we have our stuff coded at the moment, but with classzones you
> might be able to get that down even further. For instance, you could have
> classzones that correspond to the number of hops a set of nodes is from a
> given node. Having such classzones might make finding nearby memory easier.

That's what I was planning on ... we'd need m x n classzones, where m
was the number of levels, and n the number of nodes. Each search would
obviously be through m classzones. I'll go poke at the current code some more.

M.

2001-10-08 19:10:58

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]



On Mon, 8 Oct 2001, Martin J. Bligh wrote:

> >> The worst possible case I can conceive (in the future architectures
> >> that I know of) is 4 different levels. I don't think the number of access
> >> speed levels is ever related to the number of processors ?
> >> (users of other NUMA architectures feel free to slap me at this point).
> >
> > So you're saying that at most any given node is 4 hops away from any
> > other for your arch?
>
> For the current architecture (well, for NUMA-Q) it's 0 or 1. For future
> architectures, there will be more (forgive me for deliberately not being
> specific ... I'd have to ask for more blessing first). Up to about 4. Ish.
>
> Depending on how much extra latency each hop introduces, it may well
> not be worth adding the complexity of differentiating beyond local vs
> remote? At least at first ...
>
> Do you know how many hops SGI can get, and how much extra latency
> you introduce? I know we're something like 10:1 ratio at the moment
> between local and remote.
>
> I guess my main point was that the number of levels was more like constant
> than linear. Maybe for large interconnected switched systems with small
> switches, it's n log n, but in practice I think log n is small enough to be
> considered constant (the number of levels of switches).
>
> >> So I *think* the worst possible case is still linear (to number of nodes)
> >> in terms of how many classzone type things we'd need? And the number
> >> of classzone type things any given access would have to search through
> >> for an access is constant? The number of zones searched would be
> >> (worst case) linear to number of nodes?
> >
> > That's how we have our stuff coded at the moment, but with classzones you
> > might be able to get that down even further. For instance, you could have
> > classzones that correspond to the number of hops a set of nodes is from a
> > given node. Having such classzones might make finding nearby memory easier.
>
> That's what I was planning on ... we'd need m x n classzones, where m
> was the number of levels, and n the number of nodes. Each search would
> obviously be through m classzones. I'll go poke at the current code some more.

You say "numbers of levels" as in each level being a given number of nodes
on that "level" distance ?

2001-10-08 19:13:08

by Jesse Barnes

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

On Mon, 8 Oct 2001, Martin J. Bligh wrote:

> Depending on how much extra latency each hop introduces, it may well
> not be worth adding the complexity of differentiating beyond local vs
> remote? At least at first ...

Well, there's already some code to do that (mm/numa.c), but I'm not sure
how applicable it will be to your arch.

> Do you know how many hops SGI can get, and how much extra latency
> you introduce? I know we're something like 10:1 ratio at the moment
> between local and remote.

I think we're something like 1.5:1, and we have machines with up to 256
nodes at the moment, so there can be quite a few hops in the worst case.

> I guess my main point was that the number of levels was more like constant
> than linear. Maybe for large interconnected switched systems with small
> switches, it's n log n, but in practice I think log n is small enough to be
> considered constant (the number of levels of switches).

Depends on how big your node count gets I guess.

> That's what I was planning on ... we'd need m x n classzones, where m
> was the number of levels, and n the number of nodes. Each search would
> obviously be through m classzones. I'll go poke at the current code some more.

Yeah, classzones is one way to go about this. There are some other simple
ways to do nearest node allocation though, given the current
codebase. I'm still trying to figure out which is the most flexible.

Jesse

2001-10-08 19:24:51

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

>> That's what I was planning on ... we'd need m x n classzones, where m
>> was the number of levels, and n the number of nodes. Each search would
>> obviously be through m classzones. I'll go poke at the current code some more.
>
> You say "numbers of levels" as in each level being a given number of nodes
> on that "level" distance ?

Yes.

For example, if the only different access speeds you have were "on the local
node" vs "on another node", and access times to all *other* nodes were the
same, you'd have 2 levels.

If you have "on the local node" (10 ns) vs "on any node 1 hop away" (100ns),
"on any node 2 hops away" (110ns), that'd be 3 levels. (latency numbers picked
out of my portable random number generator ;-) ).

If the latencies on a 4 level system turn out to be 10,100,101,102 then it's only
going to be worth defining 2 levels. If they turn out to be 10,100,1000, 10000,
then it'll (probably) be worth doing 4 ....

M.

2001-10-08 19:39:03

by Peter Rival

[permalink] [raw]
Subject: Re: Whining about NUMA. :) [Was whining about 2.5...]

Just to put in my $0.02 on this... Compaq systems will span the range
on this. The current Wildfire^WGS Series systems have two levels -
either "local" or "remote", which is just under 3:1 latency vs. local.
This is all public knowledge, if you care to dig through all the docs. ;)

With the new EV7 systems coming out soon (next year?) every CPU has a
switch and memory controller built in, so as you add CPUs (up to 64) you
potentially add levels of latency. I can't say what they are, but the
numbers I've been given so far are _much_ better than that. Just
another data point. :)

- Pete