Are there some i686 SMP systems with more then 12 GB ram out there ?
Is there a known problem with 2.4.x kernel and such systems ?
I have the following problem:
A Intel SRPM8 Server with 32 GB ram and RH 7.2 and kernel 2.4.17 on it.
After doing a lot of disc access the system slows down and the oom
killer begins his work. (only after running some cp processes.)
Because the system is running out of low memory.
Disable the oom killer has no affect, the low memory is going to 0
and the system dies.
It looks like as the buffer_heads would fill the low memory up,
whether there is sufficient memory available or not, as long as
there is sufficient high memory for caching.
Any ideas ?
Here some information from /proc/meminfo and /proc/slabinfo some
seconds before dying:
total: used: free: shared: buffers: cached:
Mem: 33781227520 10787348480 22993879040 0 21807104 10447962112
Swap: 0 0 0
MemTotal: 32989480 kB
MemFree: 22454960 kB
MemShared: 0 kB
Buffers: 21296 kB
Cached: 10203088 kB
SwapCached: 0 kB
Active: 27248 kB
Inact_dirty: 10206312 kB
Inact_clean: 0 kB
Inact_target: 2046712 kB
HighTotal: 32636928 kB
HighFree: 22424128 kB
LowTotal: 352552 kB
LowFree: 30832 kB
SwapTotal: 0 kB
SwapFree: 0 kB
slabinfo - version: 1.1 (statistics) (SMP)
kmem_cache 112 112 284 8 8 1 : 112 112 8 0 0 : 124 62 : 12 9 0 0
ip_fib_hash 145 145 24 1 1 1 : 145 145 1 0 0 : 252 126 : 8 3 0 0
urb_priv 0 0 56 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
journal_head 755 11591 56 49 173 1 : 11591 1255589 173 0 0 : 252 126 : 1271927 10219 1272001 9959
revoke_table 169 169 20 1 1 1 : 169 169 1 0 0 : 252 126 : 0 3 0 0
revoke_record 126 145 24 1 1 1 : 126 126 1 0 0 : 252 126 : 2 2 3 0
clip_arp_cache 0 0 124 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
ip_mrt_cache 0 0 84 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
tcp_tw_bucket 0 0 128 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
tcp_bind_bucket 561 580 24 4 4 1 : 561 561 4 0 0 : 252 126 : 4 11 0 0
tcp_open_request 252 252 92 6 6 1 : 252 252 6 0 0 : 252 126 : 0 12 6 0
inet_peer_cache 0 0 48 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
ip_dst_cache 176 176 176 8 8 1 : 176 176 8 0 0 : 252 126 : 40 16 22 0
arp_cache 64 64 120 2 2 1 : 64 64 2 0 0 : 252 126 : 0 4 1 0
blkdev_requests 1056 1056 88 24 24 1 : 1056 1056 24 0 0 : 252 126 : 1128 48 256 0
dnotify cache 0 0 28 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
file lock cache 312 312 100 8 8 1 : 312 312 8 0 0 : 252 126 : 19870 16 19876 0
fasync cache 0 0 24 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
uid_cache 339 339 32 3 3 1 : 339 339 3 0 0 : 252 126 : 1 6 1 0
skbuff_head_cache 2438 2438 168 106 106 1 : 2438 3950 106 0 0 : 252 126 : 3943 224 3035 12
sock 129 129 1280 43 43 1 : 129 189 43 0 0 : 60 30 : 435 87 443 2
sigqueue 252 252 140 9 9 1 : 252 252 9 0 0 : 252 126 : 1111 18 1118 0
cdev_cache 702 702 48 9 9 1 : 702 702 9 0 0 : 252 126 : 145 18 5 0
bdev_cache 354 354 64 6 6 1 : 354 354 6 0 0 : 252 126 : 20 12 23 0
mnt_cache 224 224 68 4 4 1 : 224 224 4 0 0 : 252 126 : 7 7 2 0
inode_cache 2224 2224 488 278 278 1 : 3872 4638 486 0 0 : 124 62 : 73933 984 72421 23
dentry_cache 2732 3828 116 116 116 1 : 5181 7701 157 0 0 : 252 126 : 76421 333 74349 21
dquot 0 0 112 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
filp 578 578 112 17 17 1 : 578 578 17 0 0 : 252 126 : 397 34 0 0
names_cache 18 18 4096 18 18 1 : 18 18 18 0 0 : 60 30 : 66979 36 66997 0
buffer_head 2557740 2559882 104 69151 69186 1 : 2559882 3919926 69186 0 0 : 252 126 : 5029666 149166 2542612 10811
mm_struct 216 216 144 8 8 1 : 216 216 8 0 0 : 252 126 : 1345 16 1317 0
vm_area_struct 1724 1850 76 37 37 1 : 1850 7772 37 0 0 : 252 126 : 53909 121 53030 48
fs_cache 588 588 44 7 7 1 : 588 588 7 0 0 : 252 126 : 1347 13 1320 0
files_cache 171 171 424 19 19 1 : 171 171 19 0 0 : 124 62 : 1335 37 1320 0
signal_act 162 162 1312 54 54 1 : 162 162 54 0 0 : 60 30 : 1305 103 1317 0
pae_pgd 791 791 32 7 7 1 : 791 791 7 0 0 : 252 126 : 1346 14 1317 0
size-131072(DMA) 0 0 131072 0 0 32 : 0 0 0 0 0 : 0 0 : 0 0 0 0
size-131072 0 0 131072 0 0 32 : 0 0 0 0 0 : 0 0 : 0 0 0 0
size-65536(DMA) 0 0 65536 0 0 16 : 0 0 0 0 0 : 0 0 : 0 0 0 0
size-65536 0 0 65536 0 0 16 : 0 0 0 0 0 : 0 0 : 0 0 0 0
size-32768(DMA) 0 0 32768 0 0 8 : 0 0 0 0 0 : 0 0 : 0 0 0 0
size-32768 1 1 32768 1 1 8 : 1 2 1 0 0 : 0 0 : 0 0 0 0
size-16384(DMA) 1 1 16384 1 1 4 : 1 1 1 0 0 : 0 0 : 0 0 0 0
size-16384 1 1 16384 1 1 4 : 1 1 1 0 0 : 0 0 : 0 0 0 0
size-8192(DMA) 0 0 8192 0 0 2 : 0 0 0 0 0 : 0 0 : 0 0 0 0
size-8192 2 3 8192 2 3 2 : 3 41 3 0 0 : 0 0 : 0 0 0 0
size-4096(DMA) 0 0 4096 0 0 1 : 0 0 0 0 0 : 60 30 : 0 0 0 0
size-4096 207 237 4096 207 237 1 : 237 597 237 0 0 : 60 30 : 750 486 946 13
size-2048(DMA) 0 0 2048 0 0 1 : 0 0 0 0 0 : 60 30 : 0 0 0 0
size-2048 368 428 2048 194 214 1 : 428 2198 214 0 0 : 60 30 : 3169 487 3326 61
size-1024(DMA) 0 0 1024 0 0 1 : 0 0 0 0 0 : 124 62 : 0 0 0 0
size-1024 448 448 1024 112 112 1 : 448 448 112 0 0 : 124 62 : 1012 157 890 0
size-512(DMA) 0 0 512 0 0 1 : 0 0 0 0 0 : 124 62 : 0 0 0 0
size-512 520 520 512 65 65 1 : 520 520 65 0 0 : 124 62 : 460 115 116 0
size-256(DMA) 0 0 264 0 0 1 : 0 0 0 0 0 : 124 62 : 0 0 0 0
size-256 630 630 264 42 42 1 : 630 940 42 0 0 : 124 62 : 5914 82 5810 5
size-128(DMA) 28 28 136 1 1 1 : 28 28 1 0 0 : 252 126 : 0 2 0 0
size-128 868 868 136 31 31 1 : 868 869 31 0 0 : 252 126 : 2589 45 2374 0
size-64(DMA) 0 0 72 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
size-64 583 583 72 11 11 1 : 583 583 11 0 0 : 252 126 : 1200 19 975 0
size-32(DMA) 92 92 40 1 1 1 : 92 92 1 0 0 : 252 126 : 16 2 0 0
size-32 1384 2392 40 22 26 1 : 2392 2526 26 0 0 : 252 126 : 2569647 52 2568729 9
Regards,
Harald Holzer
> Are there some i686 SMP systems with more then 12 GB ram out there ?
Very very few.
> Is there a known problem with 2.4.x kernel and such systems ?
Several 8)
Hardware limits:
- 36bit addressing mode on x86 processors is slower
- Many device drivers cant handle > 32bit DMA
- The CPU can't efficiently map all that memory at once
Software:
- The block I/O layer doesn't cleanly handle large systems
- The page struct is too big which puts undo loads on the
memory that the CPU can map
- We don't discard page tables when we can and should
- We should probably switch to a larger virtual page size
on big machines.
The ones that actually bite hard are the block I/O layer and the page
struct size. Making the block layer handle its part well is a 2.5 thing.
> It looks like as the buffer_heads would fill the low memory up,
> whether there is sufficient memory available or not, as long as
> there is sufficient high memory for caching.
That may well be happening. The Red Hat supplied 7.2 and 7.2 errata kernels
were tested on 8Gb, I don't know what else larger.
Because much of the memory cannot be used for kernel objects there is an
imbalance in available resources and its very hard to balance them sanely.
I'm not sure how many 8Gb+ machines Andrea has handy to tune the VM on
either.
Alan
On Saturday, 29. December 2001 18:45, Alan cox wrote:
> > Are there some i686 SMP systems with more then 12 GB ram out there ?
>
> Very very few.
>
> > Is there a known problem with 2.4.x kernel and such systems ?
>
> Several 8)
>
> Hardware limits:
> - 36bit addressing mode on x86 processors is slower
> - Many device drivers cant handle > 32bit DMA
> - The CPU can't efficiently map all that memory at once
>
> Software:
> - The block I/O layer doesn't cleanly handle large systems
> - The page struct is too big which puts undo loads on the
> memory that the CPU can map
> - We don't discard page tables when we can and should
> - We should probably switch to a larger virtual page size
> on big machines.
>
> The ones that actually bite hard are the block I/O layer and the page
> struct size. Making the block layer handle its part well is a 2.5 thing.
>
> > It looks like as the buffer_heads would fill the low memory up,
> > whether there is sufficient memory available or not, as long as
> > there is sufficient high memory for caching.
>
> That may well be happening. The Red Hat supplied 7.2 and 7.2 errata kernels
> were tested on 8Gb, I don't know what else larger.
>
> Because much of the memory cannot be used for kernel objects there is an
> imbalance in available resources and its very hard to balance them sanely.
> I'm not sure how many 8Gb+ machines Andrea has handy to tune the VM on
> either.
I think Andrea has access to some. Maybe SAP?
Have you tried with 2.4.17rc2aa2?
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.17rc2aa2.bz2
The 10_vm-21 part applies clean to 2.4.17 (final), too.
I have it running without a glitch. But sadly a way smaller (much smaller)
system...;-)
Regards,
Dieter
--
Dieter N?tzel
Graduate Student, Computer Science
University of Hamburg
Department of Computer Science
@home: [email protected]
On Sat, 29 Dec 2001, Alan Cox wrote:
> Because much of the memory cannot be used for kernel objects there is
> an imbalance in available resources and its very hard to balance them
> sanely. I'm not sure how many 8Gb+ machines Andrea has handy to tune
> the VM on either.
Along those lines -- I have in front of me the source to
"/linux/mm/page_alloc.c" (2.4.17 kernel) which reads (partially)
lines 29-32:
static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" };
static int zone_balance_ratio[MAX_NR_ZONES] __initdata = { 128, 128, 128, };
static int zone_balance_min[MAX_NR_ZONES] __initdata = { 20 , 20, 20, };
static int zone_balance_max[MAX_NR_ZONES] __initdata = { 255 , 255, 255, };
lines 718-725:
mask = (realsize / zone_balance_ratio[j]);
if (mask < zone_balance_min[j])
mask = zone_balance_min[j];
else if (mask > zone_balance_max[j])
mask = zone_balance_max[j];
zone->pages_min = mask;
zone->pages_low = mask*2;
zone->pages_high = mask*3;
What it *looks* like the programmer (Andrea??) intended was to make the
watermarks proportional to the amount of memory in each zone. So for the
dma, normal and highmem zones, one would have 1/128th of the amount of
memory as "min", 1/64th as "low" and 3/128th as "high". Leaving aside
any debate over whether these are appropriate values or not and whether
or not "free memory is wasted memory", what in fact appears to be
happening is that the "else if" clause is limiting "min" to 255 pages
(about a megabyte on i386), and "low" and "high" to 2 and 3 megabytes
respectively.
Could someone with a big box and a benchmark that drives it out of free
memory please try commenting out the "else if" clause and see if it
makes a difference? I tried this on my puny 512 MB Athlon and verified
that the right values were there with "sysrq", but I don't have anything
bigger to try it on and I don't have a benchmark to test it with either.
--
M. Edward Borasky
[email protected]
http://www.borasky-research.net
On Sat, 29 Dec 2001, M. Edward (Ed) Borasky wrote:
> On Sat, 29 Dec 2001, Alan Cox wrote:
>
> > Because much of the memory cannot be used for kernel objects there
> > is an imbalance in available resources and its very hard to balance
> > them sanely. I'm not sure how many 8Gb+ machines Andrea has handy
> > to tune the VM on either.
>
> Along those lines -- I have in front of me the source to
> "/linux/mm/page_alloc.c" (2.4.17 kernel) which reads (partially)
[snip]
> Could someone with a big box and a benchmark that drives it out of
> free memory please try commenting out the "else if" clause and see if
> it makes a difference? I tried this on my puny 512 MB Athlon and
> verified that the right values were there with "sysrq", but I don't
> have anything bigger to try it on and I don't have a benchmark to test
> it with either.
And here it is as a patch against 2.4.17:
diff -ur linux/mm/page_alloc.c linux-2.4.17znmeb/mm/page_alloc.c
--- linux/mm/page_alloc.c Mon Nov 19 16:35:40 2001
+++ linux-2.4.17znmeb/mm/page_alloc.c Sat Dec 29 16:04:25 2001
@@ -718,8 +718,13 @@
mask = (realsize / zone_balance_ratio[j]);
if (mask < zone_balance_min[j])
mask = zone_balance_min[j];
+ /* else if clause commented out for testing
+ * M. Edward Borasky, Borasky Research
+ * 2001-12-29
+ *
else if (mask > zone_balance_max[j])
mask = zone_balance_max[j];
+ */
zone->pages_min = mask;
zone->pages_low = mask*2;
zone->pages_high = mask*3;
Apologies if pine with vim as the editor messes this puppy up :-).
--
M. Edward Borasky
[email protected]
http://www.borasky-research.net
I brought my inner child to "Take Your Child To Work Day."
I tested your suggestion with an 2.4.17rc2aa2 kernel,
but it didnt help.
The system dies after copying more then 6-8GB on data.
Here are the last lines from /var/log/messages:
Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order allocation failed (gfp=0x70/0)
Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order allocation failed (gfp=0x70/0)
Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order allocation failed (gfp=0x1f0/0)
Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order allocation failed (gfp=0x70/0)
Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order allocation failed (gfp=0xf0/0)
Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order allocation failed (gfp=0x70/0)
Dec 30 01:47:32 localhost last message repeated 3 times
Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order allocation failed (gfp=0x1f0/0)
I started to begin searching in the linux/fs/buffer.c, and found the
following interesting lines (line: 1450+):
void create_empty_buffers(struct page *page, kdev_t dev, unsigned long blocksize)
{
struct buffer_head *bh, *head, *tail;
/* FIXME: create_buffers should fail if there's no enough memory */
head = create_buffers(page, blocksize, 1);
if (page->buffers)
BUG();
bh = head;
Could the create_buffer function cause this problem ?
Harald Holzer
On Sun, 2001-12-30 at 01:25, M. Edward (Ed) Borasky wrote:
> On Sat, 29 Dec 2001, M. Edward (Ed) Borasky wrote:
>
> > On Sat, 29 Dec 2001, Alan Cox wrote:
> >
> > > Because much of the memory cannot be used for kernel objects there
> > > is an imbalance in available resources and its very hard to balance
> > > them sanely. I'm not sure how many 8Gb+ machines Andrea has handy
> > > to tune the VM on either.
> >
> > Along those lines -- I have in front of me the source to
> > "/linux/mm/page_alloc.c" (2.4.17 kernel) which reads (partially)
>
> [snip]
>
>
> > Could someone with a big box and a benchmark that drives it out of
> > free memory please try commenting out the "else if" clause and see if
> > it makes a difference? I tried this on my puny 512 MB Athlon and
> > verified that the right values were there with "sysrq", but I don't
> > have anything bigger to try it on and I don't have a benchmark to test
> > it with either.
>
> And here it is as a patch against 2.4.17:
>
> diff -ur linux/mm/page_alloc.c linux-2.4.17znmeb/mm/page_alloc.c
> --- linux/mm/page_alloc.c Mon Nov 19 16:35:40 2001
> +++ linux-2.4.17znmeb/mm/page_alloc.c Sat Dec 29 16:04:25 2001
> @@ -718,8 +718,13 @@
> mask = (realsize / zone_balance_ratio[j]);
> if (mask < zone_balance_min[j])
> mask = zone_balance_min[j];
> + /* else if clause commented out for testing
> + * M. Edward Borasky, Borasky Research
> + * 2001-12-29
> + *
> else if (mask > zone_balance_max[j])
> mask = zone_balance_max[j];
> + */
> zone->pages_min = mask;
> zone->pages_low = mask*2;
> zone->pages_high = mask*3;
>
>
> Apologies if pine with vim as the editor messes this puppy up :-).
> --
> M. Edward Borasky
>
> [email protected]
> http://www.borasky-research.net
>
> I brought my inner child to "Take Your Child To Work Day."
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Hmmm ... 0 order allocation failures ... is that new *after* my patch or
were you getting those before? Maybe we've *moved* the problem??
--
Take Your Trading to the Next Level!
M. Edward Borasky, Meta-Trading Coach
[email protected]
http://www.meta-trading-coach.com
http://groups.yahoo.com/group/meta-trading-coach
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Harald Holzer
> Sent: Saturday, December 29, 2001 6:15 PM
> To: M. Edward (Ed) Borasky; [email protected]
> Subject: Re: i686 SMP systems with more then 12 GB ram with 2.4.x kernel
> ?
>
>
> I tested your suggestion with an 2.4.17rc2aa2 kernel,
> but it didnt help.
> The system dies after copying more then 6-8GB on data.
> Here are the last lines from /var/log/messages:
>
> Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order
> allocation failed (gfp=0x70/0)
> Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order
> allocation failed (gfp=0x70/0)
> Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order
> allocation failed (gfp=0x1f0/0)
> Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order
> allocation failed (gfp=0x70/0)
> Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order
> allocation failed (gfp=0xf0/0)
> Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order
> allocation failed (gfp=0x70/0)
> Dec 30 01:47:32 localhost last message repeated 3 times
> Dec 30 01:47:32 localhost kernel: __alloc_pages: 0-order
> allocation failed (gfp=0x1f0/0)
>
> I started to begin searching in the linux/fs/buffer.c, and found the
> following interesting lines (line: 1450+):
>
> void create_empty_buffers(struct page *page, kdev_t dev, unsigned
> long blocksize)
> {
> struct buffer_head *bh, *head, *tail;
>
> /* FIXME: create_buffers should fail if there's no enough memory */
> head = create_buffers(page, blocksize, 1);
> if (page->buffers)
> BUG();
>
> bh = head;
>
> Could the create_buffer function cause this problem ?
>
> Harald Holzer
>
> On Sun, 2001-12-30 at 01:25, M. Edward (Ed) Borasky wrote:
> > On Sat, 29 Dec 2001, M. Edward (Ed) Borasky wrote:
> >
> > > On Sat, 29 Dec 2001, Alan Cox wrote:
> > >
> > > > Because much of the memory cannot be used for kernel objects there
> > > > is an imbalance in available resources and its very hard to balance
> > > > them sanely. I'm not sure how many 8Gb+ machines Andrea has handy
> > > > to tune the VM on either.
> > >
> > > Along those lines -- I have in front of me the source to
> > > "/linux/mm/page_alloc.c" (2.4.17 kernel) which reads (partially)
> >
> > [snip]
> >
> >
> > > Could someone with a big box and a benchmark that drives it out of
> > > free memory please try commenting out the "else if" clause and see if
> > > it makes a difference? I tried this on my puny 512 MB Athlon and
> > > verified that the right values were there with "sysrq", but I don't
> > > have anything bigger to try it on and I don't have a benchmark to test
> > > it with either.
> >
> > And here it is as a patch against 2.4.17:
> >
> > diff -ur linux/mm/page_alloc.c linux-2.4.17znmeb/mm/page_alloc.c
> > --- linux/mm/page_alloc.c Mon Nov 19 16:35:40 2001
> > +++ linux-2.4.17znmeb/mm/page_alloc.c Sat Dec 29 16:04:25 2001
> > @@ -718,8 +718,13 @@
> > mask = (realsize / zone_balance_ratio[j]);
> > if (mask < zone_balance_min[j])
> > mask = zone_balance_min[j];
> > + /* else if clause commented out for testing
> > + * M. Edward Borasky, Borasky Research
> > + * 2001-12-29
> > + *
> > else if (mask > zone_balance_max[j])
> > mask = zone_balance_max[j];
> > + */
> > zone->pages_min = mask;
> > zone->pages_low = mask*2;
> > zone->pages_high = mask*3;
> >
> >
> > Apologies if pine with vim as the editor messes this puppy up :-).
> > --
> > M. Edward Borasky
> >
> > [email protected]
> > http://www.borasky-research.net
> >
> > I brought my inner child to "Take Your Child To Work Day."
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
> Because much of the memory cannot be used for kernel objects there is an
> imbalance in available resources and its very hard to balance them sanely.
As I understand it, in a Linux / i686 system, there are three zones: DMA
(0 - 2^24-1), low (2^24 - 2^30-1) and high (2^30 and up). And the hardware
(PAE) apparently distinguishes memory addresses above 2^32-1 as well.
Questions:
1. Shouldn't there be *four* zones: (DMA, low, high and PAE)?
2. Isn't the boundary at 2^30 really irrelevant and the three "correct"
zones are (0 - 2^24-1), (2^24 - 2^32-1) and (2^32 - 2^36-1)?
3. On a system without ISA DMA devices, can DMA and low be merged into a
single zone?
4. It's pretty obvious exactly which functions require memory under 2^24 --
ISA DMA. But exactly which functions require memory under 2^30 and which
functions require memory under 2^32? It seems relatively easy to write a
Perl script to truck through the kernel source and figure this out; has
anyone done it? It would seem to me a valuable piece of information -- what
the demands are for the relatively precious areas of memory under 1 GB and
under 4 GB.
--
M. Edward Borasky
[email protected]
http://www.borasky-research.net
> 1. Shouldn't there be *four* zones: (DMA, low, high and PAE)?
Probably not. PAE isnt special. With PAE you pay the page table penalties
for all RAM.
> 2. Isn't the boundary at 2^30 really irrelevant and the three "correct"
> zones are (0 - 2^24-1), (2^24 - 2^32-1) and (2^32 - 2^36-1)?
Nope. The limit for directly mapped memory is 2^30.
> 3. On a system without ISA DMA devices, can DMA and low be merged into a
> single zone?
Rarely. PCI vendors are not exactly angels when it comes to implementing all
32bits of a DMA transfer
> > 2. Isn't the boundary at 2^30 really irrelevant and the three "correct"
> > zones are (0 - 2^24-1), (2^24 - 2^32-1) and (2^32 - 2^36-1)?
>
> Nope. The limit for directly mapped memory is 2^30.
Ouch! That makes low memory *extremely* precious. Intuitively, the demand
for directly mapped memory will be proportional to the demand for all
memory, with a proportionality constant depending on the purpose for the
system and the efficiency of the application set. We've seen (apparently --
I haven't looked at any data, just messages on the list) cases where we can
force this to happen with benchmarks designed to embarrass the VM :)) but
have we seen it in real applications?
Thanks for taking the time to answer these questions. I'm struggling to
understand where the performance walls are in large i686 systems, in both
Linux and Windows. In the end, though, relentless application of Moore's Law
to the IA64 must be the correct answer :)).
--
M. Edward Borasky
[email protected]
http://www.borasky-research.net
Followup to: <[email protected]>
By author: Alan Cox <[email protected]>
In newsgroup: linux.dev.kernel
>
> > 2. Isn't the boundary at 2^30 really irrelevant and the three "correct"
> > zones are (0 - 2^24-1), (2^24 - 2^32-1) and (2^32 - 2^36-1)?
>
> Nope. The limit for directly mapped memory is 2^30.
>
2^30-2^27 to be exact (assuming a 3:1 split and 128MB vmalloc zone.)
-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>
The OSDL can get you access to this size of machine.
Tim
On Sat, 2001-12-29 at 10:18, Harald Holzer wrote:
> Are there some i686 SMP systems with more then 12 GB ram out there ?
>
> Is there a known problem with 2.4.x kernel and such systems ?
>
> I have the following problem:
>
> A Intel SRPM8 Server with 32 GB ram and RH 7.2 and kernel 2.4.17 on it.
> After doing a lot of disc access the system slows down and the oom
> killer begins his work. (only after running some cp processes.)
> Because the system is running out of low memory.
>
> Disable the oom killer has no affect, the low memory is going to 0
> and the system dies.
>
> It looks like as the buffer_heads would fill the low memory up,
> whether there is sufficient memory available or not, as long as
> there is sufficient high memory for caching.
>
> Any ideas ?
>
> Here some information from /proc/meminfo and /proc/slabinfo some
> seconds before dying:
>
> total: used: free: shared: buffers: cached:
> Mem: 33781227520 10787348480 22993879040 0 21807104 10447962112
> Swap: 0 0 0
> MemTotal: 32989480 kB
> MemFree: 22454960 kB
> MemShared: 0 kB
> Buffers: 21296 kB
> Cached: 10203088 kB
> SwapCached: 0 kB
> Active: 27248 kB
> Inact_dirty: 10206312 kB
> Inact_clean: 0 kB
> Inact_target: 2046712 kB
> HighTotal: 32636928 kB
> HighFree: 22424128 kB
> LowTotal: 352552 kB
> LowFree: 30832 kB
> SwapTotal: 0 kB
> SwapFree: 0 kB
>
> slabinfo - version: 1.1 (statistics) (SMP)
> kmem_cache 112 112 284 8 8 1 : 112 112 8 0 0 : 124 62 : 12 9 0 0
> ip_fib_hash 145 145 24 1 1 1 : 145 145 1 0 0 : 252 126 : 8 3 0 0
> urb_priv 0 0 56 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
> journal_head 755 11591 56 49 173 1 : 11591 1255589 173 0 0 : 252 126 : 1271927 10219 1272001 9959
> revoke_table 169 169 20 1 1 1 : 169 169 1 0 0 : 252 126 : 0 3 0 0
> revoke_record 126 145 24 1 1 1 : 126 126 1 0 0 : 252 126 : 2 2 3 0
> clip_arp_cache 0 0 124 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
> ip_mrt_cache 0 0 84 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
> tcp_tw_bucket 0 0 128 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
> tcp_bind_bucket 561 580 24 4 4 1 : 561 561 4 0 0 : 252 126 : 4 11 0 0
> tcp_open_request 252 252 92 6 6 1 : 252 252 6 0 0 : 252 126 : 0 12 6 0
> inet_peer_cache 0 0 48 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
> ip_dst_cache 176 176 176 8 8 1 : 176 176 8 0 0 : 252 126 : 40 16 22 0
> arp_cache 64 64 120 2 2 1 : 64 64 2 0 0 : 252 126 : 0 4 1 0
> blkdev_requests 1056 1056 88 24 24 1 : 1056 1056 24 0 0 : 252 126 : 1128 48 256 0
> dnotify cache 0 0 28 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
> file lock cache 312 312 100 8 8 1 : 312 312 8 0 0 : 252 126 : 19870 16 19876 0
> fasync cache 0 0 24 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
> uid_cache 339 339 32 3 3 1 : 339 339 3 0 0 : 252 126 : 1 6 1 0
> skbuff_head_cache 2438 2438 168 106 106 1 : 2438 3950 106 0 0 : 252 126 : 3943 224 3035 12
> sock 129 129 1280 43 43 1 : 129 189 43 0 0 : 60 30 : 435 87 443 2
> sigqueue 252 252 140 9 9 1 : 252 252 9 0 0 : 252 126 : 1111 18 1118 0
> cdev_cache 702 702 48 9 9 1 : 702 702 9 0 0 : 252 126 : 145 18 5 0
> bdev_cache 354 354 64 6 6 1 : 354 354 6 0 0 : 252 126 : 20 12 23 0
> mnt_cache 224 224 68 4 4 1 : 224 224 4 0 0 : 252 126 : 7 7 2 0
> inode_cache 2224 2224 488 278 278 1 : 3872 4638 486 0 0 : 124 62 : 73933 984 72421 23
> dentry_cache 2732 3828 116 116 116 1 : 5181 7701 157 0 0 : 252 126 : 76421 333 74349 21
> dquot 0 0 112 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
> filp 578 578 112 17 17 1 : 578 578 17 0 0 : 252 126 : 397 34 0 0
> names_cache 18 18 4096 18 18 1 : 18 18 18 0 0 : 60 30 : 66979 36 66997 0
> buffer_head 2557740 2559882 104 69151 69186 1 : 2559882 3919926 69186 0 0 : 252 126 : 5029666 149166 2542612 10811
> mm_struct 216 216 144 8 8 1 : 216 216 8 0 0 : 252 126 : 1345 16 1317 0
> vm_area_struct 1724 1850 76 37 37 1 : 1850 7772 37 0 0 : 252 126 : 53909 121 53030 48
> fs_cache 588 588 44 7 7 1 : 588 588 7 0 0 : 252 126 : 1347 13 1320 0
> files_cache 171 171 424 19 19 1 : 171 171 19 0 0 : 124 62 : 1335 37 1320 0
> signal_act 162 162 1312 54 54 1 : 162 162 54 0 0 : 60 30 : 1305 103 1317 0
> pae_pgd 791 791 32 7 7 1 : 791 791 7 0 0 : 252 126 : 1346 14 1317 0
> size-131072(DMA) 0 0 131072 0 0 32 : 0 0 0 0 0 : 0 0 : 0 0 0 0
> size-131072 0 0 131072 0 0 32 : 0 0 0 0 0 : 0 0 : 0 0 0 0
> size-65536(DMA) 0 0 65536 0 0 16 : 0 0 0 0 0 : 0 0 : 0 0 0 0
> size-65536 0 0 65536 0 0 16 : 0 0 0 0 0 : 0 0 : 0 0 0 0
> size-32768(DMA) 0 0 32768 0 0 8 : 0 0 0 0 0 : 0 0 : 0 0 0 0
> size-32768 1 1 32768 1 1 8 : 1 2 1 0 0 : 0 0 : 0 0 0 0
> size-16384(DMA) 1 1 16384 1 1 4 : 1 1 1 0 0 : 0 0 : 0 0 0 0
> size-16384 1 1 16384 1 1 4 : 1 1 1 0 0 : 0 0 : 0 0 0 0
> size-8192(DMA) 0 0 8192 0 0 2 : 0 0 0 0 0 : 0 0 : 0 0 0 0
> size-8192 2 3 8192 2 3 2 : 3 41 3 0 0 : 0 0 : 0 0 0 0
> size-4096(DMA) 0 0 4096 0 0 1 : 0 0 0 0 0 : 60 30 : 0 0 0 0
> size-4096 207 237 4096 207 237 1 : 237 597 237 0 0 : 60 30 : 750 486 946 13
> size-2048(DMA) 0 0 2048 0 0 1 : 0 0 0 0 0 : 60 30 : 0 0 0 0
> size-2048 368 428 2048 194 214 1 : 428 2198 214 0 0 : 60 30 : 3169 487 3326 61
> size-1024(DMA) 0 0 1024 0 0 1 : 0 0 0 0 0 : 124 62 : 0 0 0 0
> size-1024 448 448 1024 112 112 1 : 448 448 112 0 0 : 124 62 : 1012 157 890 0
> size-512(DMA) 0 0 512 0 0 1 : 0 0 0 0 0 : 124 62 : 0 0 0 0
> size-512 520 520 512 65 65 1 : 520 520 65 0 0 : 124 62 : 460 115 116 0
> size-256(DMA) 0 0 264 0 0 1 : 0 0 0 0 0 : 124 62 : 0 0 0 0
> size-256 630 630 264 42 42 1 : 630 940 42 0 0 : 124 62 : 5914 82 5810 5
> size-128(DMA) 28 28 136 1 1 1 : 28 28 1 0 0 : 252 126 : 0 2 0 0
> size-128 868 868 136 31 31 1 : 868 869 31 0 0 : 252 126 : 2589 45 2374 0
> size-64(DMA) 0 0 72 0 0 1 : 0 0 0 0 0 : 252 126 : 0 0 0 0
> size-64 583 583 72 11 11 1 : 583 583 11 0 0 : 252 126 : 1200 19 975 0
> size-32(DMA) 92 92 40 1 1 1 : 92 92 1 0 0 : 252 126 : 16 2 0 0
> size-32 1384 2392 40 22 26 1 : 2392 2526 26 0 0 : 252 126 : 2569647 52 2568729 9
>
> Regards,
> Harald Holzer
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)
Today i checked some memory configurations and noticed that the low
memory decreases, when i add more memory to the system,
and the size of reserved memory increases:
at 1GB ram, are 16,936kB low mem reserved.
4GB ram, 72,824kB reserved
8GB ram, 142,332kB reserved
16GB ram, 269,424kB reserved
32GB ram, 532,080kB reserved, usable low mem: 352 MB
64GB ram ??
Which function does the reserved memory fulfill ?
Is it all for paging ?
Harald Holzer
Memory related startup messages at 32GB:
Jan 1 15:56:05 localhost kernel: Linux version 2.4.17-64g (root@bigbox) (gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)) #1 SMP Tue Jan 1 14:19:36 CET 2002
Jan 1 15:56:05 localhost kernel: BIOS-provided physical RAM map:
Jan 1 15:56:05 localhost kernel: BIOS-e820: 0000000000000000 - 000000000009d800 (usable)
Jan 1 15:56:05 localhost kernel: BIOS-e820: 000000000009d800 - 00000000000a0000 (reserved)
Jan 1 15:56:05 localhost kernel: BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
Jan 1 15:56:05 localhost kernel: BIOS-e820: 0000000000100000 - 0000000003ff8000 (usable)
Jan 1 15:56:05 localhost kernel: BIOS-e820: 0000000003ff8000 - 0000000003fffc00 (ACPI data)
Jan 1 15:56:05 localhost kernel: BIOS-e820: 0000000003fffc00 - 0000000004000000 (ACPI NVS)
Jan 1 15:56:05 localhost kernel: BIOS-e820: 0000000004000000 - 00000000f0000000 (usable)
Jan 1 15:56:05 localhost kernel: BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
Jan 1 15:56:05 localhost kernel: BIOS-e820: 0000000100000000 - 0000000810000000 (usable)
Jan 1 15:56:05 localhost kernel: 32128MB HIGHMEM available.
Jan 1 15:56:05 localhost kernel: found SMP MP-table at 000f65d0
Jan 1 15:56:05 localhost kernel: hm, page 000f6000 reserved twice.
Jan 1 15:56:05 localhost kernel: hm, page 000f7000 reserved twice.
Jan 1 15:56:05 localhost kernel: hm, page 0009d000 reserved twice.
Jan 1 15:56:05 localhost kernel: hm, page 0009e000 reserved twice.
Jan 1 15:56:05 localhost kernel: On node 0 totalpages: 8454144
Jan 1 15:56:05 localhost kernel: zone(0): 4096 pages.
Jan 1 15:56:05 localhost kernel: zone(1): 225280 pages.
Jan 1 15:56:05 localhost kernel: zone(2): 8224768 pages.
Jan 1 15:56:05 localhost kernel: Intel MultiProcessor Specification v1.4
Jan 1 15:56:05 localhost kernel: Virtual Wire compatibility mode.
Jan 1 15:56:05 localhost kernel: OEM ID: INTEL Product ID: SPM8 APIC at: 0xFEE00000
Jan 1 15:56:05 localhost kernel: Processor #7 Pentium(tm) Pro APIC version 17
Jan 1 15:56:05 localhost kernel: Processor #0 Pentium(tm) Pro APIC version 17
Jan 1 15:56:05 localhost kernel: Processor #1 Pentium(tm) Pro APIC version 17
Jan 1 15:56:05 localhost kernel: Processor #2 Pentium(tm) Pro APIC version 17
Jan 1 15:56:05 localhost kernel: Processor #3 Pentium(tm) Pro APIC version 17
Jan 1 15:56:05 localhost kernel: Processor #4 Pentium(tm) Pro APIC version 17
Jan 1 15:56:05 localhost kernel: Processor #5 Pentium(tm) Pro APIC version 17
Jan 1 15:56:05 localhost kernel: Processor #6 Pentium(tm) Pro APIC version 17
Jan 1 15:56:05 localhost kernel: I/O APIC #8 Version 19 at 0xFEC00000.
Jan 1 15:56:05 localhost kernel: Processors: 8
Jan 1 15:56:05 localhost kernel: Kernel command line: BOOT_IMAGE=linux-17-64g ro root=802 BOOT_FILE=/boot/vmlinuz-2.4.17-64g console=ttyS0,38400
Jan 1 15:56:05 localhost kernel: Initializing CPU#0
Jan 1 15:56:05 localhost kernel: Detected 700.082 MHz processor.
Jan 1 15:56:05 localhost kernel: Console: colour VGA+ 80x25
Jan 1 15:56:05 localhost kernel: Calibrating delay loop... 1395.91 BogoMIPS
-->Jan 1 15:56:05 localhost kernel: Memory: 33021924k/33816576k available (1081k kernel code, 532080k reserved, 290k data, 248k init, 32636928k highmem)
Jan 1 15:56:05 localhost kernel: Dentry-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Jan 1 15:56:05 localhost kernel: Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Jan 1 15:56:05 localhost kernel: Mount-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Jan 1 15:56:05 localhost kernel: Buffer-cache hash table entries: 524288 (order: 9, 2097152 bytes)
Jan 1 15:56:05 localhost kernel: Page-cache hash table entries: 524288 (order: 9, 2097152 bytes)
> 16GB ram, 269,424kB reserved
> 32GB ram, 532,080kB reserved, usable low mem: 352 MB
> 64GB ram ??
64Gb basically you can forget
> Which function does the reserved memory fulfill ?
> Is it all for paging ?
A lot of it is the page structs (64bytes per page - which really should be
nearer the 32 some rival Unix OS's achieve on x86)
On Thu, Jan 03, 2002 at 12:50:50AM +0100, Harald Holzer wrote:
> Today i checked some memory configurations and noticed that the low
> memory decreases, when i add more memory to the system,
> and the size of reserved memory increases:
>
> at 1GB ram, are 16,936kB low mem reserved.
> 4GB ram, 72,824kB reserved
> 8GB ram, 142,332kB reserved
> 16GB ram, 269,424kB reserved
> 32GB ram, 532,080kB reserved, usable low mem: 352 MB
> 64GB ram ??
>
> Which function does the reserved memory fulfill ?
> Is it all for paging ?
Yeah, page tables mostly, they have to be kept in low-mem. It's making struct
page's for every single page in that system, and must be kept in the kernel
memory...
I'd doubt if you could get the system to boot at 64 gigs
--
Mark Zealey
[email protected]
[email protected]
UL++++>$ G!>(GCM/GCS/GS/GM) dpu? s:-@ a16! C++++>$ P++++>+++++$ L+++>+++++$
!E---? W+++>$ N- !o? !w--- O? !M? !V? !PS !PE--@ PGP+? r++ !t---?@ !X---?
!R- b+ !tv b+ DI+ D+? G+++ e>+++++ !h++* r!-- y--
(http://www.geekcode.com)
On 3 Jan 2002, Harald Holzer wrote:
> at 1GB ram, are 16,936kB low mem reserved.
> 4GB ram, 72,824kB reserved
> 8GB ram, 142,332kB reserved
> 16GB ram, 269,424kB reserved
> 32GB ram, 532,080kB reserved, usable low mem: 352 MB
> 64GB ram ??
>
> Which function does the reserved memory fulfill ?
> Is it all for paging ?
The kernel stores various data structures there, in particular
the mem_map[] array, which has one data structure for each
page.
In the standard kernel, that is 52 bytes per page, giving you
a space usage of 416 MB for the mem_map[] array.
I'm currently integrating a patch into my VM tree which removes
the wait queue from the page struct, bringing the size down to
36 bytes per page, or 288 MB, giving a space saving of 128 MB.
Another item to look into is removing the page cache hash table
and replacing it by a radix tree or hash trie, in the hopes of
improving scalability while at the same time saving some space.
As for page table overhead, on machines like yours we really
should be using 4 MB pages for the larger data segments, which
will cut down the page table size by a factor of 512 ;)
regards,
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
On Thu, 3 Jan 2002, Alan Cox wrote:
> > Which function does the reserved memory fulfill ?
> > Is it all for paging ?
>
> A lot of it is the page structs (64bytes per page - which really
> should be nearer the 32 some rival Unix OS's achieve on x86)
The 2.4 kernel has the page struct at 52 bytes in size,
William Lee Irwin and I have brought this down to 36.
Expect to see this integrated into the rmap VM soon ;)
regards,
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
On Thu, 3 Jan 2002 11:28:45 -0200 (BRST)
Rik van Riel <[email protected]> wrote:
> Another item to look into is removing the page cache hash table
> and replacing it by a radix tree or hash trie, in the hopes of
> improving scalability while at the same time saving some space.
Ah, didn't we see such a patch lately in LKML? If I remember correct I saw some
comparison charts too and some people testing it were happy with it. Just
searched through the list: 24. dec :-) by Momchil Velikov Can someone with big
mem have a look at the saving? How about 18-pre?
Regards,
Stephan
> Today i checked some memory configurations and noticed that the low
> memory decreases, when i add more memory to the system,
> and the size of reserved memory increases:
>
> at 1GB ram, are 16,936kB low mem reserved.
> 4GB ram, 72,824kB reserved
> 8GB ram, 142,332kB reserved
> 16GB ram, 269,424kB reserved
> 32GB ram, 532,080kB reserved, usable low mem: 352 MB
> 64GB ram ??
If you need 64G of RAM and decent performance you dont want an x86.
Use a sparc64, alpha or ppc64 linux machine.
Anton
On Thu, 3 Jan 2002, Stephan von Krawczynski wrote:
> On Thu, 3 Jan 2002 11:28:45 -0200 (BRST)
> Rik van Riel <[email protected]> wrote:
>
> > Another item to look into is removing the page cache hash table
> > and replacing it by a radix tree or hash trie, in the hopes of
> > improving scalability while at the same time saving some space.
>
> Ah, didn't we see such a patch lately in LKML? If I remember correct I
> saw some comparison charts too and some people testing it were happy
> with it. Just searched through the list: 24. dec :-) by Momchil
> Velikov Can someone with big mem have a look at the saving? How about
> 18-pre?
>From what velco told me on IRC, he is still tuning his work
and looking at further improvements.
One thing to keep in mind is that most pages are in the
page cache; we wouldn't want to reduce space in one data
structure just to use more space elsewhere, this is
something to look at very carefully...
regards,
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
On Thu, 3 Jan 2002, Rik van Riel wrote:
> On Thu, 3 Jan 2002, Alan Cox wrote:
> > A lot of it is the page structs (64bytes per page - which really
> > should be nearer the 32 some rival Unix OS's achieve on x86)
>
> The 2.4 kernel has the page struct at 52 bytes in size,
> William Lee Irwin and I have brought this down to 36.
Please restate those numbers, Rik: I share Alan's belief that the
current standard 2.4 kernel has page struct at 64 bytes in size.
Hugh
On Fri, 4 Jan 2002, Hugh Dickins wrote:
> On Thu, 3 Jan 2002, Rik van Riel wrote:
> > On Thu, 3 Jan 2002, Alan Cox wrote:
> > > A lot of it is the page structs (64bytes per page - which really
> > > should be nearer the 32 some rival Unix OS's achieve on x86)
> >
> > The 2.4 kernel has the page struct at 52 bytes in size,
> > William Lee Irwin and I have brought this down to 36.
>
> Please restate those numbers, Rik: I share Alan's belief that the
> current standard 2.4 kernel has page struct at 64 bytes in size.
Indeed, I counted wrong ... substracted the waitqueue when counting
the first time, then substracted it again ;)
The struct page in the current kernel is indeed 64 bytes. In
the rmap VM it's also 64 bytes (60 bytes if highmem is disabled).
After removal of the waitqueue, that'll be 52 bytes, or 48 if
highmem is disabled.
kind regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
In message <[email protected]>, > : Alan Cox writes:
> > 2. Isn't the boundary at 2^30 really irrelevant and the three "correct"
> > zones are (0 - 2^24-1), (2^24 - 2^32-1) and (2^32 - 2^36-1)?
>
> Nope. The limit for directly mapped memory is 2^30.
The limit *per L1 Page Table Base Pointer*, that is. You could
in theory have a different L1 Page Table base pointer for each
task (including each proc 0 in linux). You can also pull a few
tricks such as instantiating a 4 GB kernel virtual address space
while in kernel mode (using a virtual windowing mechanism as is used
for high mem today to map in user space for copying in data from
user space if/when needed). The latter takes some tricky code to
get mapping correct but it wasn't a lot of code in PTX. Just needed
a lot of careful thought, review, testing, etc.
I don't know if there are real examples of large memory systems
exhausting the ~1 GB of kernel virtual address space on machines
with > 12-32 GB of physical memory (we had this problem in PTX which
created the need for a larger kernel virtual address space in some
contexts).
> > 3. On a system without ISA DMA devices, can DMA and low be merged into a
> > single zone?
>
> Rarely. PCI vendors are not exactly angels when it comes to implementing
> all 32bits of a DMA transfer
Would be nice to have a config option like "CONFIG_PCI_36" to imply
that all devices on a PAE system were able to access all of memory,
globally removing the need for bounce buffering and allowing a native
PCI setup for mapping memory addresses...
gerrit
On Wed, Jan 02, 2002 at 01:17:59PM -0800, Gerrit Huizenga wrote:
> I don't know if there are real examples of large memory systems
> exhausting the ~1 GB of kernel virtual address space on machines
> with > 12-32 GB of physical memory (we had this problem in PTX which
> created the need for a larger kernel virtual address space in some
> contexts).
The ~800MB or so of kernel address space is exhausted with struct page
entries around 48GB of physical memory or so
SGI's original highmem patch switched page tables on entry to kernel
space, so there is code already tested that we can borrow. But I'm
not sure if it's worth it as the overhead it adds makes life really
suck: we would lose the ability to use global pages, as well as always
encounter tlb misses on the kernel<->userspace transition. PAE shows
up as a 5% performance loss on normal loads, and this would make it
worse. We're probably better off implementing PSE. Of course, making
these kinds of choices is hard without actual statistics of the
usage patterns we're targetting.
> Would be nice to have a config option like "CONFIG_PCI_36" to imply
> that all devices on a PAE system were able to access all of memory,
> globally removing the need for bounce buffering and allowing a native
> PCI setup for mapping memory addresses...
That would be neat.
-ben
--
Fish.
> up as a 5% performance loss on normal loads, and this would make it
> worse. We're probably better off implementing PSE. Of course, making
> these kinds of choices is hard without actual statistics of the
> usage patterns we're targetting.
You don't neccessarily need PSE. Migrating to an option to support > 4K
_virtual_ page size is more flexible for x86, although it would need
glibc getpagesize() fixing I think, and might mean a few apps wouldnt
run in that configuration.
Alan
On Jan 01 2002, H. Peter Anvin ([email protected]) wrote:
> By author: Alan Cox <[email protected]>
> >
> > > 2. Isn't the boundary at 2^30 really irrelevant and the three "correct"
> > > zones are (0 - 2^24-1), (2^24 - 2^32-1) and (2^32 - 2^36-1)?
> >
> > Nope. The limit for directly mapped memory is 2^30.
> >
>
> 2^30-2^27 to be exact (assuming a 3:1 split and 128MB vmalloc zone.)
>
> -hpa
For my better understanding, where's the 128MB vmalloc zone assumption
defined, please?
I'm pretty sure I understand that the 3:1 split you refer to is
defined by PAGE_OFFSET in asm-i386/page.h
But when I tried to find the answer in the source for the vmalloc
zone, I looked in linux/mm.h, linux/mmzone.h, linux/vmalloc.h, and
mm/vmalloc.c, but couldn't find anything there or in O'Reilly's kernel
book that I could follow/understand.
Thanks for any pointers.
Take care,
Daniel
--
Daniel A. Freedman
Laboratory for Atomic and Solid State Physics
Department of Physics
Cornell University
Is this what your looking for? Just below the definition of PAGE_OFFSET in
page.h:
/*
* This much address space is reserved for vmalloc() and iomap()
* as well as fixmap mappings.
*/
#define __VMALLOC_RESERVE (128 << 20)
On Sunday 06 January 2002 12:39 pm, Daniel Freedman wrote:
> On Jan 01 2002, H. Peter Anvin ([email protected]) wrote:
> > By author: Alan Cox <[email protected]>
> >
> > > > 2. Isn't the boundary at 2^30 really irrelevant and the three
> > > > "correct" zones are (0 - 2^24-1), (2^24 - 2^32-1) and (2^32 -
> > > > 2^36-1)?
> > >
> > > Nope. The limit for directly mapped memory is 2^30.
> >
> > 2^30-2^27 to be exact (assuming a 3:1 split and 128MB vmalloc zone.)
> >
> > -hpa
>
> For my better understanding, where's the 128MB vmalloc zone assumption
> defined, please?
>
> I'm pretty sure I understand that the 3:1 split you refer to is
> defined by PAGE_OFFSET in asm-i386/page.h
>
> But when I tried to find the answer in the source for the vmalloc
> zone, I looked in linux/mm.h, linux/mmzone.h, linux/vmalloc.h, and
> mm/vmalloc.c, but couldn't find anything there or in O'Reilly's kernel
> book that I could follow/understand.
>
> Thanks for any pointers.
>
> Take care,
>
> Daniel
Hi Marvin,
Thanks for the quick reply.
On Sun, Jan 06, 2002, Marvin Justice wrote:
> Is this what your looking for? Just below the definition of PAGE_OFFSET in
> page.h:
>
> /*
> * This much address space is reserved for vmalloc() and iomap()
> * as well as fixmap mappings.
> */
> #define __VMALLOC_RESERVE (128 << 20)
However, while it does seem to be exactly the definition for 128MB
vmalloc offset that I was looking for, I don't seem to have this
definition in my source tree (2.4.16):
freedman@planck:/usr/src/linux$ grep -r __VMALLOC_RESERVE *
freedman@planck:/usr/src/linux$
Any idea why this is so?
Thanks again,
Daniel
> On Sunday 06 January 2002 12:39 pm, Daniel Freedman wrote:
> > On Jan 01 2002, H. Peter Anvin ([email protected]) wrote:
> > > By author: Alan Cox <[email protected]>
> > >
> > > > > 2. Isn't the boundary at 2^30 really irrelevant and the three
> > > > > "correct" zones are (0 - 2^24-1), (2^24 - 2^32-1) and (2^32 -
> > > > > 2^36-1)?
> > > >
> > > > Nope. The limit for directly mapped memory is 2^30.
> > >
> > > 2^30-2^27 to be exact (assuming a 3:1 split and 128MB vmalloc zone.)
> > >
> > > -hpa
> >
> > For my better understanding, where's the 128MB vmalloc zone assumption
> > defined, please?
> >
> > I'm pretty sure I understand that the 3:1 split you refer to is
> > defined by PAGE_OFFSET in asm-i386/page.h
> >
> > But when I tried to find the answer in the source for the vmalloc
> > zone, I looked in linux/mm.h, linux/mmzone.h, linux/vmalloc.h, and
> > mm/vmalloc.c, but couldn't find anything there or in O'Reilly's kernel
> > book that I could follow/understand.
> >
> > Thanks for any pointers.
> >
> > Take care,
> >
> > Daniel
--
Daniel A. Freedman
Laboratory for Atomic and Solid State Physics
Department of Physics
Cornell University
On Sunday 06 January 2002 01:45 pm, Daniel Freedman wrote:
> Hi Marvin,
>
> Thanks for the quick reply.
>
> On Sun, Jan 06, 2002, Marvin Justice wrote:
> > Is this what your looking for? Just below the definition of PAGE_OFFSET
> > in page.h:
> >
> > /*
> > * This much address space is reserved for vmalloc() and iomap()
> > * as well as fixmap mappings.
> > */
> > #define __VMALLOC_RESERVE (128 << 20)
>
> However, while it does seem to be exactly the definition for 128MB
> vmalloc offset that I was looking for, I don't seem to have this
> definition in my source tree (2.4.16):
>
> freedman@planck:/usr/src/linux$ grep -r __VMALLOC_RESERVE *
> freedman@planck:/usr/src/linux$
>
> Any idea why this is so?
>
> Thanks again,
>
> Daniel
>
Hmmm. Looks like it was moved sometime between 2.4.16 and 2.4.18pre1. In my
2.4.16 tree it's located in arch/i386/kernel/setup.c and without the leading
underscores.
-M
On Sun, Jan 06, 2002 at 04:16:07PM +0000, Alan Cox wrote:
> You don't neccessarily need PSE. Migrating to an option to support > 4K
> _virtual_ page size is more flexible for x86, although it would need
> glibc getpagesize() fixing I think, and might mean a few apps wouldnt
> run in that configuration.
Perhaps, but if the majority of people using 64GB of ram are served well
by PSE, then it's worth getting that 5% of performance back.
-ben
--
Fish.
On Sun, Jan 06, 2002 at 04:16:07PM +0000, Alan Cox wrote:
You don't neccessarily need PSE. Migrating to an option to support
> 4K _virtual_ page size is more flexible for x86, although it
would need glibc getpagesize() fixing I think, and might mean a
few apps wouldnt run in that configuration.
If someone has a minute or so, can someone briefly explain the
difference(s) between PSE and PAE?
--cw
The COUGAR project is something I've been thinking about the past few
months. Unfortunately I no longer have *any* free time to devote to it. So
I'm releasing the proposal on the web, in the hopes that someone in the
kernel community will pick it up and make a project out of it. See
http://www.borasky-research.net/Cougar.htm.
--
M. Edward Borasky
[email protected]
http://www.borasky-research.net
> If someone has a minute or so, can someone briefly explain the
> difference(s) between PSE and PAE?
>
Here's my (probably simple minded) understanding. With the PSE bit turned on
in one of the x86 control registers (cr3?), page sizes are 4MB instead of the
usual 4KB. One advantage of large pages is that there are fewer page tables
and struct page's to store.
PAE is turned on by setting a different bit. It allows for the possibility up
to 64GB of physical ram on i686. Actual addresses are still just 32 bits,
however, so any given process is limited to 4GB (actually Linux puts a limit
of 3GB). But by using a 3 level paging scheme it's possible to map process
A's 32 bit address space to a different region of physical ram than process
B's which, in turn, is mapped to a different physical region than process C,
etc.
As far as I know, it's possible to set both bits simultaneously.
-Marvin
On Sun, Jan 06, 2002 at 08:18:33PM -0600, Marvin Justice wrote:
Here's my (probably simple minded) understanding. With the PSE bit
turned on in one of the x86 control registers (cr3?), page sizes
are 4MB instead of the usual 4KB. One advantage of large pages is
that there are fewer page tables and struct page's to store.
Ah, I knew 4MB pages were possible... I was under the impression _all_
pages had to be 4MB which would seem to suck badly as they would be
too coarse for many applications (but for certain large sci. apps. I'm
sure this would be perfect, less TLB thrashing too with sparse
data-sets).
On the whole, I'm not sure I can see how 4MB pages _everywhere_ in
user-space would be a win for many people at all...
--cw
There are 2MB pages as well. Probably would be a better choice than
4MB. Also only has a 2 tier paging mechanism instead of a 3 tier one when
paging with 4KB which should help take care of the current slowdown with
highmem.
----- Original Message -----
From: "Chris Wedgwood" <[email protected]>
To: "Marvin Justice" <[email protected]>
Cc: "Alan Cox" <[email protected]>; "Benjamin LaHaise"
<[email protected]>; "Gerrit Huizenga" <[email protected]>; "M. Edward
Borasky" <[email protected]>; "Harald Holzer" <[email protected]>;
<[email protected]>
Sent: Sunday, January 06, 2002 9:38 PM
Subject: Re: i686 SMP systems with more then 12 GB ram with 2.4.x kernel ?
> On Sun, Jan 06, 2002 at 08:18:33PM -0600, Marvin Justice wrote:
>
> Here's my (probably simple minded) understanding. With the PSE bit
> turned on in one of the x86 control registers (cr3?), page sizes
> are 4MB instead of the usual 4KB. One advantage of large pages is
> that there are fewer page tables and struct page's to store.
>
> Ah, I knew 4MB pages were possible... I was under the impression _all_
> pages had to be 4MB which would seem to suck badly as they would be
> too coarse for many applications (but for certain large sci. apps. I'm
> sure this would be perfect, less TLB thrashing too with sparse
> data-sets).
>
> On the whole, I'm not sure I can see how 4MB pages _everywhere_ in
> user-space would be a win for many people at all...
>
>
> --cw
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Sun, Jan 06, 2002, Marvin Justice wrote:
> On Sunday 06 January 2002 01:45 pm, Daniel Freedman wrote:
> > Hi Marvin,
> >
> > Thanks for the quick reply.
> >
> > On Sun, Jan 06, 2002, Marvin Justice wrote:
> > > Is this what your looking for? Just below the definition of PAGE_OFFSET
> > > in page.h:
> > >
> > > /*
> > > * This much address space is reserved for vmalloc() and iomap()
> > > * as well as fixmap mappings.
> > > */
> > > #define __VMALLOC_RESERVE (128 << 20)
> >
> > However, while it does seem to be exactly the definition for 128MB
> > vmalloc offset that I was looking for, I don't seem to have this
> > definition in my source tree (2.4.16):
> >
> > freedman@planck:/usr/src/linux$ grep -r __VMALLOC_RESERVE *
> > freedman@planck:/usr/src/linux$
>
> Hmmm. Looks like it was moved sometime between 2.4.16 and 2.4.18pre1. In my
> 2.4.16 tree it's located in arch/i386/kernel/setup.c and without the leading
> underscores.
>
> -M
<sheepishly buries head in sand> Oops... Sorry about missing that.
Thanks for the help and take care,
Daniel
--
Daniel A. Freedman
Laboratory for Atomic and Solid State Physics
Department of Physics
Cornell University
On Sun, 6 Jan 2002, Alan Cox wrote:
>
> You don't neccessarily need PSE. Migrating to an option to support > 4K
> _virtual_ page size is more flexible for x86, although it would need
> glibc getpagesize() fixing I think, and might mean a few apps wouldnt
> run in that configuration.
Larger kernel PAGE_SIZE can work, still presenting 4KB page size to user
space for compat. The interesting part is holding anon pages together,
not fragmenting to use PAGE_SIZE for each MMUPAGE_SIZE of user space.
I have patches against 2.4.6 and 2.4.7 which did that; but didn't keep
them up to date because there's a fair effort going through drivers
deciding which PAGE_s need to be MMUPAGE_s. I intend to resurrect
that work against 2.5 later on (or sooner if there's interest).
Hugh