2002-12-06 00:07:49

by Norman Gaywood

[permalink] [raw]
Subject: Maybe a VM bug in 2.4.18-18 from RH 8.0?

I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18

The system is a 4 processor, 16GB memory Dell PE6600 running RH8.0 +
errata. More details at the end of this message.

By doing a large copy I can trigger this problem in about 30-40 minutes. At
the end of that time, kswapd will start to get a larger % of CPU and
the system load will be around 2-3. The system will feel sluggish at an
interactive shell and it will take several seconds before a command like
top would start to display. If I let it go for another 30 minutes the
system is unusable were it could take 10 minutes or more to do simple
commands. If I let it go for several hours after that, the following
messages can appear on the console depending on the type of copy:

ENOMEM in journal_get_undo_access_Rsmp_df5dec49, retrying.

or

EMOMEM in do_get_write_access, retrying.

The problem can be triggered by almost any type of copy command. In
particular, this command can trigger it:

tar cf /dev/tape .

for . large enough. Unfortunately this was how I was intending to backup
the system.

"Large enough" is several gigabytes. It also seems to depend on how much
memory is used. In particular, how much memory is used by cache. Also in
the equation is the number of files. Copying one big file does not seem
to trigger the problem. I initially discovered the problem when doing an
rsync copy over a network of the user home directories.

Can it be stopped? Yes. On the [email protected] mailing list,
Stephan Wonczak suggested that I should put the system under some memory
pressure while doing the copy. The program he supplied used about 750
megabytes just to use some memory. I tried running this at 10 second
intervals while doing a copy but it did not help. Since the system has
16 Gig of memory, I tried to give it some real memory pressure and ran
7 processes that used 1.8G each like this:

#!/bin/sh
SLEEP=600
COUNT=20

while [ `expr $COUNT - 1` != 0 ]
do
date
# 2000 by 1_000_000 seems to be a 1.8G process
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }'
sleep $SLEEP
done

This bought the cache down to about 3-4 Gig used after it ran. With this
running the system performed the copy with no problems! No doubt there
is a happy medium between these two extremes.

There is a suggestion that I may not see this problem when the system is
under real load. Since I am only setting up the system at the moment there
are no users giving the system something to do. The copy is the only real
work during these tests. I find it difficult to say "she'll be right",
(as we do in Aus) and throw the system into production hoping that it
will just work.

So what do I do now? I have a what I believe a trigger for a VM problem
in a widely used version of linux. Anyone have some patches for me to
try that won't take me too far from the RH 8.0 base system.

Here are the system details:

PE6600 running RH 8.0 with latest errata. Note that I have upgraded to
kernel 2.4.18-19.7.tg3.120bigmem which I understand to be the latest
RH8 errata kernel + patches to stop the tg3 hanging problem. This came
from http://people.redhat.com/jgarzik/tg3/. I have also tried the latest
RH errata kernel using the bcm5700 driver and it has the same problem.

HW includes:
Adaptec AIC-7892 SCSI BIOS v25704
3 Adaptex SCSI Card 39160 BIOS v2.57.2S2
8 HITACHI DK32DJ-72MC 160 drives
2 Quantum ATLAS10K3-73-SCA 160 drives

uname -a
Linux alan.une.edu.au 2.4.18-19.7.tg3.120bigmem #1 SMP Mon Nov 25 15:15:29 EST 2002 i686 i686 i386 GNU/Linux

cat /proc/meminfo
total: used: free: shared: buffers: cached:
Mem: 16671522816 444915712 16226607104 0 136830976 56520704
Swap: 34365202432 0 34365202432
MemTotal: 16280784 kB
MemFree: 15846296 kB
MemShared: 0 kB
Buffers: 133624 kB
Cached: 55196 kB
SwapCached: 0 kB
Active: 249984 kB
Inact_dirty: 18088 kB
Inact_clean: 480 kB
Inact_target: 53708 kB
HighTotal: 15597504 kB
HighFree: 15434932 kB
LowTotal: 683280 kB
LowFree: 411364 kB
SwapTotal: 33559768 kB
SwapFree: 33559768 kB
Committed_AS: 177044 kB

df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md2 8254136 2825112 5009736 37% /
/dev/md0 101018 25627 70175 27% /boot
/dev/md6 211671024 88323536 112595200 44% /home
/dev/md1 16515968 1785024 13891956 12% /opt
none 8140392 0 8140392 0% /dev/shm
/dev/md4 4126976 149944 3767392 4% /tmp
/dev/md3 16515968 168172 15508808 2% /var
/dev/md5 8522932 1596520 6493468 20% /var/spool/mail
/dev/sdh1 70557052 32832 66940124 1% /.automount/alan/disks/alan/h1
/dev/sdi1 70557052 22856784 44116172 35% /.automount/alan/disks/alan/i1
/dev/sdj1 70557052 13619440 53353516 21% /.automount/alan/disks/alan/j1

df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/md2 1048576 167838 880738 17% /
/dev/md0 26104 59 26045 1% /boot
/dev/md6 26886144 1941926 24944218 8% /home
/dev/md1 2101152 49285 2051867 3% /opt
none 2035098 1 2035097 1% /dev/shm
/dev/md4 524288 26 524262 1% /tmp
/dev/md3 2101152 4877 2096275 1% /var
/dev/md5 1082720 2535 1080185 1% /var/spool/mail
/dev/sdh1 8962048 12 8962036 1% /.automount/alan/disks/alan/h1
/dev/sdi1 8962048 712400 8249648 8% /.automount/alan/disks/alan/i1
/dev/sdj1 8962048 10497 8951551 1% /.automount/alan/disks/alan/j1

--
Norman Gaywood -- School of Mathematical and Computer Sciences
University of New England, Armidale, NSW 2351, Australia
[email protected] http://turing.une.edu.au/~norm
Phone: +61 2 6773 2412 Fax: +61 2 6773 3312


2002-12-06 00:28:33

by Pete Zaitcev

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

> I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18

> By doing a large copy I can trigger this problem in about 30-40 minutes. At
> the end of that time, kswapd will start to get a larger % of CPU and
> the system load will be around 2-3. The system will feel sluggish at an
> interactive shell and it will take several seconds before a command like
> top would start to display. [...]

Check your /proc/slabinfo, just in case, to rule out a leak.

> cat /proc/meminfo
> total: used: free: shared: buffers: cached:
> Mem: 16671522816 444915712 16226607104 0 136830976 56520704
> Swap: 34365202432 0 34365202432
> MemTotal: 16280784 kB
> MemFree: 15846296 kB
> MemShared: 0 kB
> Buffers: 133624 kB
> Cached: 55196 kB
> SwapCached: 0 kB
> Active: 249984 kB
> Inact_dirty: 18088 kB
> Inact_clean: 480 kB
> Inact_target: 53708 kB
> HighTotal: 15597504 kB
> HighFree: 15434932 kB
> LowTotal: 683280 kB
> LowFree: 411364 kB
> SwapTotal: 33559768 kB
> SwapFree: 33559768 kB
> Committed_AS: 177044 kB

This is not interesting. Get it _after_ the box becomes sluggish.

Remember, the 2.4.18 stream in RH does not have its own VM, distinct
from Marcelo+Riel. So, you can come to linux-kernel for advice,
but first, get it all reproduced with Marcelo's tree with
Riel's patches all the same.

-- Pete

2002-12-06 00:52:47

by Andrew Morton

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

Norman Gaywood wrote:
>
> I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18
>
> 16GB
> ...
> tar cf /dev/tape .
>

This machine will die due to buffer_heads which are attached
to highmem pagecache, and due to inodes which are pinned by
highmem pagecache.

> ...
> while [ `expr $COUNT - 1` != 0 ]
> do
> date
> # 2000 by 1_000_000 seems to be a 1.8G process
> perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
> ...

This will evict the highmem pagecache. That frees the buffer_heads
and unpins the inodes.

> So what do I do now?

I guess talk to Red Hat. These are well-known problems and there
should be fixes for them in a "bigmem" kernel.

Otherwise, the -aa kernels have patches to address these problems.
One option would be to roll your own kernel, based on a kernel.org
kernel and a matching patch from
http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/

> ...
> Anyone have some patches for me to
> try that won't take me too far from the RH 8.0 base system.

Hard. The relevant patches are:

http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/05_vm_16_active_free_zone_bhs-1
and
http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/10_inode-highmem-2

The first one will not come vaguely close to applying to an
RH 2.4.18 kernel.

The second one may well apply, and will probably fix the problem.

2002-12-06 01:00:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 11:13:26AM +1100, Norman Gaywood wrote:
> I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18
>
> The system is a 4 processor, 16GB memory Dell PE6600 running RH8.0 +
> errata. More details at the end of this message.

Thanks to lots of feedback from users in the last months I fixed all
known vm bugs todate that can be reproduced on those big machines.
They're all included in my tree and in the current UL/SuSE releases.
Over time I should have posted all of them to the kernel list in one way
or another. The most critical ones are now pending for merging in
2.4.21pre. So in the meantime you want to try to reproduce on top of
2.4.20aa1 or the UL kernel and (unless your problem is a tape driver bug ;)
I'm pretty sure it will fix the problems on your big machine.

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1.gz
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/

Hope this helps,

Andrea

2002-12-06 01:09:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Thu, Dec 05, 2002 at 05:00:15PM -0800, Andrew Morton wrote:
> Hard. The relevant patches are:
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/05_vm_16_active_free_zone_bhs-1
> and
> http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/10_inode-highmem-2

yep, those are the two I had in mind when I said they're pending for
2.4.21pre inclusion. He may still suffer other known problems besides
the above two critical highmem fixes (for example if
lower_zone_reserve_ratio is not applied and there's no other fix around
it IMHO, that's generic OS problem not only for linux, and that was my
only sensible solution to fix it, the approch in mainline is way too
weak to make a real difference) though probably whatever else problem
would probably need something more complicated than a tar to reproduce.

Andrea

2002-12-06 01:20:24

by Norman Gaywood

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Thu, Dec 05, 2002 at 07:35:49PM -0500, Pete Zaitcev wrote:
> > I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18
>
> > By doing a large copy I can trigger this problem in about 30-40 minutes. At
> > the end of that time, kswapd will start to get a larger % of CPU and
> > the system load will be around 2-3. The system will feel sluggish at an
> > interactive shell and it will take several seconds before a command like
> > top would start to display. [...]
>
> Check your /proc/slabinfo, just in case, to rule out a leak.

Here is a /proc/slabinfo diff of a good system and a very sluggish one:

1c1
< Mon Nov 25 17:13:04 EST 2002
---
> Mon Nov 25 22:35:58 EST 2002
6c6
< nfs_inode_cache 6 6 640 1 1 1 : 124 62
---
> nfs_inode_cache 1 6 640 1 1 1 : 124 62
8,11c8,11
< ip_fib_hash 224 224 32 2 2 1 : 252 126
< journal_head 3101 36113 48 69 469 1 : 252 126
< revoke_table 250 250 12 1 1 1 : 252 126
< revoke_record 672 672 32 6 6 1 : 252 126
---
> ip_fib_hash 10 224 32 2 2 1 : 252 126
> journal_head 12 154 48 2 2 1 : 252 126
> revoke_table 7 250 12 1 1 1 : 252 126
> revoke_record 0 0 32 0 0 1 : 252 126
14,20c14,20
< tcp_tw_bucket 210 210 128 7 7 1 : 252 126
< tcp_bind_bucket 896 896 32 8 8 1 : 252 126
< tcp_open_request 180 180 128 6 6 1 : 252 126
< inet_peer_cache 0 0 64 0 0 1 : 252 126
< ip_dst_cache 105 105 256 7 7 1 : 252 126
< arp_cache 90 90 128 3 3 1 : 252 126
< blkdev_requests 16548 17430 128 561 581 1 : 252 126
---
> tcp_tw_bucket 0 0 128 0 0 1 : 252 126
> tcp_bind_bucket 28 784 32 7 7 1 : 252 126
> tcp_open_request 0 0 128 0 0 1 : 252 126
> inet_peer_cache 1 58 64 1 1 1 : 252 126
> ip_dst_cache 40 105 256 7 7 1 : 252 126
> arp_cache 4 90 128 3 3 1 : 252 126
> blkdev_requests 16384 16410 128 547 547 1 : 252 126
22c22
< file_lock_cache 328 328 92 8 8 1 : 252 126
---
> file_lock_cache 3 82 92 2 2 1 : 252 126
24,27c24,27
< uid_cache 672 672 32 6 6 1 : 252 126
< skbuff_head_cache 1107 2745 256 77 183 1 : 252 126
< sock 270 270 1280 90 90 1 : 60 30
< sigqueue 870 870 132 30 30 1 : 252 126
---
> uid_cache 9 448 32 4 4 1 : 252 126
> skbuff_head_cache 816 1110 256 74 74 1 : 252 126
> sock 81 129 1280 43 43 1 : 60 30
> sigqueue 29 29 132 1 1 1 : 252 126
29,33c29,33
< cdev_cache 498 2262 64 12 39 1 : 252 126
< bdev_cache 290 290 64 5 5 1 : 252 126
< mnt_cache 232 232 64 4 4 1 : 252 126
< inode_cache 543337 553490 512 79070 79070 1 : 124 62
< dentry_cache 373336 554430 128 18481 18481 1 : 252 126
---
> cdev_cache 16 290 64 5 5 1 : 252 126
> bdev_cache 27 174 64 3 3 1 : 252 126
> mnt_cache 19 174 64 3 3 1 : 252 126
> inode_cache 305071 305081 512 43583 43583 1 : 124 62
> dentry_cache 418 2430 128 81 81 1 : 252 126
35,43c35,43
< filp 930 930 128 31 31 1 : 252 126
< names_cache 48 48 4096 48 48 1 : 60 30
< buffer_head 831810 831810 128 27727 27727 1 : 252 126
< mm_struct 510 510 256 34 34 1 : 252 126
< vm_area_struct 4488 4740 128 158 158 1 : 252 126
< fs_cache 696 696 64 12 12 1 : 252 126
< files_cache 469 469 512 67 67 1 : 124 62
< signal_act 388 418 1408 38 38 4 : 60 30
< pae_pgd 696 696 64 12 12 1 : 252 126
---
> filp 1041 1230 128 41 41 1 : 252 126
> names_cache 7 8 4096 7 8 1 : 60 30
> buffer_head 3431966 3432150 128 114405 114405 1 : 252 126
> mm_struct 198 315 256 21 21 1 : 252 126
> vm_area_struct 5905 5970 128 199 199 1 : 252 126
> fs_cache 204 464 64 8 8 1 : 252 126
> files_cache 204 217 512 31 31 1 : 124 62
> signal_act 246 286 1408 26 26 4 : 60 30
> pae_pgd 198 638 64 11 11 1 : 252 126
51c51
< size-16384 16 24 16384 16 24 4 : 0 0
---
> size-16384 20 20 16384 20 20 4 : 0 0
53c53
< size-8192 5 11 8192 5 11 2 : 0 0
---
> size-8192 9 9 8192 9 9 2 : 0 0
55c55
< size-4096 287 407 4096 287 407 1 : 60 30
---
> size-4096 56 56 4096 56 56 1 : 60 30
57c57
< size-2048 426 666 2048 213 333 1 : 60 30
---
> size-2048 281 314 2048 157 157 1 : 60 30
59c59
< size-1024 1024 1272 1024 256 318 1 : 124 62
---
> size-1024 659 712 1024 178 178 1 : 124 62
61c61
< size-512 3398 3584 512 445 448 1 : 124 62
---
> size-512 2782 2856 512 357 357 1 : 124 62
63c63
< size-256 777 1155 256 67 77 1 : 252 126
---
> size-256 101 255 256 17 17 1 : 252 126
65c65
< size-128 4836 19200 128 244 640 1 : 252 126
---
> size-128 2757 3750 128 125 125 1 : 252 126
67c67
< size-64 8958 20550 128 356 685 1 : 252 126
---
> size-64 178 510 128 17 17 1 : 252 126
69c69
< size-32 23262 43674 64 433 753 1 : 252 126
---
> size-32 711 1218 64 21 21 1 : 252 126


> > cat /proc/meminfo
> This is not interesting. Get it _after_ the box becomes sluggish.

I don't have one of those, but here is a top of a sluggish system:

3:51pm up 43 min, 3 users, load average: 1.69, 1.28, 0.92
109 processes: 108 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states: 0.0% user, 0.3% system, 0.0% nice, 99.2% idle
CPU1 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle
CPU2 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle
CPU3 states: 0.0% user, 1.4% system, 0.0% nice, 98.0% idle
CPU4 states: 0.0% user, 58.2% system, 0.0% nice, 41.2% idle
CPU5 states: 0.0% user, 96.4% system, 0.0% nice, 3.0% idle
CPU6 states: 0.0% user, 0.5% system, 0.0% nice, 99.0% idle
CPU7 states: 0.0% user, 0.3% system, 0.0% nice, 99.2% idle
Mem: 16280784K av, 15747124K used, 533660K free, 0K shrd, 20952K buff
Swap: 33559768K av, 0K used, 33559768K free 15037240K cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
19 root 25 0 0 0 0 SW 96.7 0.0 1:52 kswapd
1173 root 21 0 10592 10M 424 D 58.2 0.0 3:30 cp
202 root 15 0 0 0 0 DW 1.9 0.0 0:04 kjournald
205 root 15 0 0 0 0 DW 0.9 0.0 0:10 kjournald
21 root 15 0 0 0 0 SW 0.5 0.0 0:01 kupdated
1121 root 16 0 1056 1056 836 R 0.5 0.0 0:09 top
1 root 15 0 476 476 424 S 0.0 0.0 0:04 init
2 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU0
3 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU1
4 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU2
5 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU3
6 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU4
7 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU5
8 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU6
9 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU7
10 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd

> Remember, the 2.4.18 stream in RH does not have its own VM, distinct
> from Marcelo+Riel. So, you can come to linux-kernel for advice,
> but first, get it all reproduced with Marcelo's tree with
> Riel's patches all the same.

Yep, I understand that. I just thought this might be of interest
however. It's pretty hard to find a place to talk about this problem
with someone who might know something! I've got a service request in
with RH but no answer yet, but it's only been 1.5 days.

While I've been writing this it looks like Andrew Morton and Andrea
Arcangeli have given me some great answers and have declared this a
"well known problem". Looks like I've got something to try.

--
Norman Gaywood -- School of Mathematical and Computer Sciences
University of New England, Armidale, NSW 2351, Australia
[email protected] http://turing.une.edu.au/~norm
Phone: +61 2 6773 2412 Fax: +61 2 6773 3312

2002-12-06 01:27:06

by Andrew Morton

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

Andrea Arcangeli wrote:
>
> ...
> He may still suffer other known problems besides
> the above two critical highmem fixes (for example if
> lower_zone_reserve_ratio is not applied and there's no other fix around
> it IMHO, that's generic OS problem not only for linux, and that was my
> only sensible solution to fix it, the approch in mainline is way too
> weak to make a real difference)

argh. I hate that one ;) Giving away 100 megabytes of memory
hurts.

I've never been able to find the workload which makes this
necessary. Can you please describe an "exploit" against
2.4.20 which demonstrates the need for this?

Thanks.

2002-12-06 01:36:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Thu, Dec 05, 2002 at 05:34:34PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > ...
> > He may still suffer other known problems besides
> > the above two critical highmem fixes (for example if
> > lower_zone_reserve_ratio is not applied and there's no other fix around
> > it IMHO, that's generic OS problem not only for linux, and that was my
> > only sensible solution to fix it, the approch in mainline is way too
> > weak to make a real difference)
>
> argh. I hate that one ;) Giving away 100 megabytes of memory
> hurts.

100M hurts on a 4G box? No-way ;)

it hurts when such 100M of normal zone are mlocked
by an highmem-capable users and you can't allocate one more inode but
you have still 3G free of highmem (google is doing this, they even drop
a check so they can mlock > half of the ram).

Or it hurts when you can't allocate an inode because such 100M are in
pagetables on a 64G box and you still have 60G free of highmem.

> I've never been able to find the workload which makes this
> necessary. Can you please describe an "exploit" against

ask google...

> 2.4.20 which demonstrates the need for this?

even simpler, swapoff -a and malloc and have fun! ;) (again ask google,
they run w/o swap for obvious good reasons)

Or if you have enough time, wait those 100M to be filled by pagetables
on a 64G box.

Andrea

2002-12-06 02:08:48

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 02:44:29AM +0100, Andrea Arcangeli wrote:
> Or it hurts when you can't allocate an inode because such 100M are in
> pagetables on a 64G box and you still have 60G free of highmem.

This is the zone vs. zone watermark stuff that penalizes/fails
allocations made with a given GFP mask from being satisfied by
fallback. This is largely old news wrt. various kinds of inability
to pressure those ZONE_NORMAL (maybe also ZONE_DMA) consumers.

Admission control for fallback is valuable, sure. I suspect the
question akpm raised is about memory utilization. My own issues are
centered around allocations targeted directly at ZONE_NORMAL,
which fallback prevention does not address, so the watermark patch
is not something I'm personally very concerned about.

64GB isn't getting any testing that I know of; I'd hold off until
someone's actually stood up and confessed to attempting to boot
Linux on such a beast. Or until I get some more RAM. =)


Bill

2002-12-06 02:21:11

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Thu, Dec 05, 2002 at 06:15:59PM -0800, William Lee Irwin III wrote:
> On Fri, Dec 06, 2002 at 02:44:29AM +0100, Andrea Arcangeli wrote:
> > Or it hurts when you can't allocate an inode because such 100M are in
> > pagetables on a 64G box and you still have 60G free of highmem.
>
> This is the zone vs. zone watermark stuff that penalizes/fails
> allocations made with a given GFP mask from being satisfied by
> fallback. This is largely old news wrt. various kinds of inability
> to pressure those ZONE_NORMAL (maybe also ZONE_DMA) consumers.
>
> Admission control for fallback is valuable, sure. I suspect the
> question akpm raised is about memory utilization. My own issues are
> centered around allocations targeted directly at ZONE_NORMAL,
> which fallback prevention does not address, so the watermark patch
> is not something I'm personally very concerned about.

you must be very concerned about it too.

If you don't have the fallback prevention all your efforts around the
allocations targeted directoy zone normal will be completely worthless.

Either that or you want to drop ZONE_NORMAL enterely because it means
nothing uses zone-normal dynamically anymore (ZONE_NORMAL seen as a
place that is directly mapped, not necessairly always 32bit dma
capable).

> 64GB isn't getting any testing that I know of; I'd hold off until
> someone's actually stood up and confessed to attempting to boot
> Linux on such a beast. Or until I get some more RAM. =)

64GB is an example, a good example for this thing, but a 16G machine or
a 4G machine can run in the very same issues. As said just swapoff -a
and malloc(1G) and such 1G is all ZONE_NORMAL before you could allocate
enough inodes for your workload. Or alloc 1G of pagetables by setting
everything protnone, and sugh 1G of pagetables goes in zone-normal
because the highmem is filled by cache. Choose whatever is your
preferred example of real life bug fixed by the lowmem-reservation patch
that is absolutely necessary to run stable on a big box with normal zone
and highmem (not only a 64G box).

The only place where you must not be concerned about these fixes are the
64bit archs.

Andrea

2002-12-06 02:34:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Thu, Dec 05, 2002 at 06:15:59PM -0800, William Lee Irwin III wrote:
>> Admission control for fallback is valuable, sure. I suspect the
>> question akpm raised is about memory utilization. My own issues are
>> centered around allocations targeted directly at ZONE_NORMAL,
>> which fallback prevention does not address, so the watermark patch
>> is not something I'm personally very concerned about.

On Fri, Dec 06, 2002 at 03:28:53AM +0100, Andrea Arcangeli wrote:
> you must be very concerned about it too.
> If you don't have the fallback prevention all your efforts around the
> allocations targeted directoy zone normal will be completely worthless.
> Either that or you want to drop ZONE_NORMAL enterely because it means
> nothing uses zone-normal dynamically anymore (ZONE_NORMAL seen as a
> place that is directly mapped, not necessairly always 32bit dma
> capable).

Yes, it's necessary; no, I've never directly encountered the issue it
fixes. Sorry about the miscommunication there.


On Thu, Dec 05, 2002 at 06:15:59PM -0800, William Lee Irwin III wrote:
>> 64GB isn't getting any testing that I know of; I'd hold off until
>> someone's actually stood up and confessed to attempting to boot
>> Linux on such a beast. Or until I get some more RAM. =)

On Fri, Dec 06, 2002 at 03:28:53AM +0100, Andrea Arcangeli wrote:
> 64GB is an example, a good example for this thing, but a 16G machine or
> a 4G machine can run in the very same issues. As said just swapoff -a
> and malloc(1G) and such 1G is all ZONE_NORMAL before you could allocate
> enough inodes for your workload. Or alloc 1G of pagetables by setting
> everything protnone, and sugh 1G of pagetables goes in zone-normal
> because the highmem is filled by cache. Choose whatever is your
> preferred example of real life bug fixed by the lowmem-reservation patch
> that is absolutely necessary to run stable on a big box with normal zone
> and highmem (not only a 64G box).
> The only place where you must not be concerned about these fixes are the
> 64bit archs.

64GB on 32-bit is in the territory where it's dead, either literally,
performance-wise, or by virtue of dropping hardware on the floor (as
it's basically no longer 64GB) due to deeper design limitations.

No idea why there's not more support behind or interest in page
clustering. It's an optimization (not required) for 64-bit/saner arches.


Bill

2002-12-06 05:17:51

by Andrew Morton

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

William Lee Irwin III wrote:
>
> Yes, it's necessary; no, I've never directly encountered the issue it
> fixes. Sorry about the miscommunication there.

The google thing.

The basic problem is in allowing allocations which _could_ use
highmem to use the normal zone as anon memory or pagecache.

Because the app could mlock that memory. So for a simple
demonstration:

- mem=2G
- read a 1.2G file
- malloc 800M, now mlock it.

Those 800M will be in ZONE_NORMAL, simply because that was where the
free memory was. And you're dead, even though you've only mlocked
800M. The same thing happens if you have lots of anon memory in the
normal zone and there is no swapspace available.

Linus's approach was to raise the ZONE_NORMAL pages_min limit for
allocations which _could_ use highmem. So a GFP_HIGHUSER allocation
has a pages_min limit of (say) 4M when considering the normal zone,
but a GFP_KERNEL allocation has a limit of 2M.

Andrea's patch does the same thing, via a separate table. He has
set the threshold much higher (100M on a 4G box). AFAICT, the
algorithms are identical - I was planning on just adding a multiplier
to set Linus's ratio - it is currently hardwired to "1". Search for
"mysterious" in mm/page_alloc.c ;)

It's not clear to me why -aa defaults to 100 megs when the problem
only occurs with no swap or when the app is using mlock. The default
multiplier (of variable local_min) should be zero. Swapless machines
or heavy mlock users can crank it up.

But mlocking 700M on a 4G box would kill it as well. The google
application, IIRC, mlocks 1G on a 2G machine. Daniel put them
onto the 2G+2G split and all was well.

Anyway, thanks. I'll take another look at Andrea's implementation.

Now, regarding mlock(mmap(open(/dev/hda1))) ;)

2002-12-06 05:40:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Thu, Dec 05, 2002 at 09:25:15PM -0800, Andrew Morton wrote:
> William Lee Irwin III wrote:
> >
> > Yes, it's necessary; no, I've never directly encountered the issue it
> > fixes. Sorry about the miscommunication there.
>
> The google thing.
>
> The basic problem is in allowing allocations which _could_ use
> highmem to use the normal zone as anon memory or pagecache.
>
> Because the app could mlock that memory. So for a simple
> demonstration:
>
> - mem=2G
> - read a 1.2G file
> - malloc 800M, now mlock it.
>
> Those 800M will be in ZONE_NORMAL, simply because that was where the
> free memory was. And you're dead, even though you've only mlocked
> 800M. The same thing happens if you have lots of anon memory in the
> normal zone and there is no swapspace available.
>
> Linus's approach was to raise the ZONE_NORMAL pages_min limit for
> allocations which _could_ use highmem. So a GFP_HIGHUSER allocation
> has a pages_min limit of (say) 4M when considering the normal zone,
> but a GFP_KERNEL allocation has a limit of 2M.
>
> Andrea's patch does the same thing, via a separate table. He has
> set the threshold much higher (100M on a 4G box). AFAICT, the
> algorithms are identical - I was planning on just adding a multiplier
> to set Linus's ratio - it is currently hardwired to "1". Search for
> "mysterious" in mm/page_alloc.c ;)
>
> It's not clear to me why -aa defaults to 100 megs when the problem
> only occurs with no swap or when the app is using mlock. The default
> multiplier (of variable local_min) should be zero. Swapless machines
> or heavy mlock users can crank it up.
>
> But mlocking 700M on a 4G box would kill it as well. The google
> application, IIRC, mlocks 1G on a 2G machine. Daniel put them
> onto the 2G+2G split and all was well.
>
> Anyway, thanks. I'll take another look at Andrea's implementation.

you should because it seems you didn't realize how my code works. the
algorithm is autotuned at boot and depends on the zone sizes, and it
applies to the dma zone too with respect to the normal zone, the highmem
case is just one of the cases that the fix for the general problem
resolves, and you're totally wrong saying that mlocking 700m on a 4G box
could kill it. I call it the per-claszone point of view watermark. If
you are capable of highmem (mlock users are) you must left 100M or 10M
or 10G free on the normal zone (depends on the watermark setting tuned
at boot that is calculated in function of the zone sizes) etc... so it
doesn't matter if you mlock 700M or 700G, it can't kill it. The split
doesn't matter at all. 2.5 misses this important fix too btw.

If you ignore this bugfix people will notice and there's no other way
to fix it completely (unless you want to drop the zone-normal and
zone-dma enterely, actually zone-dma matters much less because even if
it exists basically nobody uses it).

>
> Now, regarding mlock(mmap(open(/dev/hda1))) ;)


Andrea

2002-12-06 05:53:28

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

William Lee Irwin III wrote:
>> Yes, it's necessary; no, I've never directly encountered the issue it
>> fixes. Sorry about the miscommunication there.

On Thu, Dec 05, 2002 at 09:25:15PM -0800, Andrew Morton wrote:
> Linus's approach was to raise the ZONE_NORMAL pages_min limit for
> allocations which _could_ use highmem. So a GFP_HIGHUSER allocation
> has a pages_min limit of (say) 4M when considering the normal zone,
> but a GFP_KERNEL allocation has a limit of 2M.
> Andrea's patch does the same thing, via a separate table. He has
> set the threshold much higher (100M on a 4G box). AFAICT, the
> algorithms are identical - I was planning on just adding a multiplier
> to set Linus's ratio - it is currently hardwired to "1". Search for
> "mysterious" in mm/page_alloc.c ;)

There's no mystery here aside from a couple of magic numbers and a
not-very-well-explained admission control policy.

Tweaking magic numbers a la 2.4.x-aa until more infrastructure is
available (2.7) sounds good to me.

Thanks,
Bill

2002-12-06 06:07:23

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 06:48:04AM +0100, Andrea Arcangeli wrote:
> you should because it seems you didn't realize how my code works. the
> algorithm is autotuned at boot and depends on the zone sizes, and it
> applies to the dma zone too with respect to the normal zone, the highmem
> case is just one of the cases that the fix for the general problem
> resolves, and you're totally wrong saying that mlocking 700m on a 4G box
> could kill it. I call it the per-claszone point of view watermark. If
> you are capable of highmem (mlock users are) you must left 100M or 10M
> or 10G free on the normal zone (depends on the watermark setting tuned
> at boot that is calculated in function of the zone sizes) etc... so it
> doesn't matter if you mlock 700M or 700G, it can't kill it. The split
> doesn't matter at all. 2.5 misses this important fix too btw.
> If you ignore this bugfix people will notice and there's no other way
> to fix it completely (unless you want to drop the zone-normal and
> zone-dma enterely, actually zone-dma matters much less because even if
> it exists basically nobody uses it).

This problem is not universal; pure GFP_KERNEL allocations are the main
problem here. The fix is necessary for anti-google bits but not a
panacea for all workloads. The issue here is basically forkbombs (i.e.
databases) with potentially high cross-process sharing.

Bill

2002-12-06 06:48:25

by Andrew Morton

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

Andrea Arcangeli wrote:
>
> the
> algorithm is autotuned at boot and depends on the zone sizes, and it
> applies to the dma zone too with respect to the normal zone, the highmem
> case is just one of the cases that the fix for the general problem
> resolves,

Linus's incremental min will protect ZONE_DMA in the same manner.

> and you're totally wrong saying that mlocking 700m on a 4G box
> could kill it.

It is possible to mlock 700M of the normal zone on a 4G -aa kernel.
I can't immediately think of anything apart from vma's which will
make it fall over, but it will run like crap.

> 2.5 misses this important fix too btw.

It does not appear to be an important fix at all. There have been
zero reports of it on any mailing list which I read since the google
days.

Yes, it needs to be addressed. But it is not worth taking 100 megabytes
of pagecache away from everyone. That is just a matter of choosing
the default value.

2.5 has much bigger problems than this - radix_tree nodes and pte_chains
in particular.

2002-12-06 07:06:56

by GrandMasterLee

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, 2002-12-06 at 00:55, Andrew Morton wrote:
> Andrea Arcangeli wrote:
[...]
> > and you're totally wrong saying that mlocking 700m on a 4G box
> > could kill it.
>
> It is possible to mlock 700M of the normal zone on a 4G -aa kernel.
> I can't immediately think of anything apart from vma's which will
> make it fall over, but it will run like crap.


Just curious, but how long would it take a system with 8GB RAM, using 4G
or 64G kernel to fall over? One thing I've noticed, is that 2.4.19aa2
runs great on a box with 8GB when I don't allocate all that much, but
seems to run into issues after a large DB has been running on it for
several days. (i.e. the system get's generally a little slower, less
responsive, and in some cases crashes after 7 days).

Yes, I know, sounds like a memory leak in something, but aside from
patching Oracle from 8.1.7.4(dba's can't find any new patches ATM), I've
tried everything except changing my kernel.

Could this be similar behaviour?

--The GrandMaster

2002-12-06 07:18:09

by Andrew Morton

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

GrandMasterLee wrote:
>
> On Fri, 2002-12-06 at 00:55, Andrew Morton wrote:
> > Andrea Arcangeli wrote:
> [...]
> > > and you're totally wrong saying that mlocking 700m on a 4G box
> > > could kill it.
> >
> > It is possible to mlock 700M of the normal zone on a 4G -aa kernel.
> > I can't immediately think of anything apart from vma's which will
> > make it fall over, but it will run like crap.
>
> Just curious, but how long would it take a system with 8GB RAM, using 4G
> or 64G kernel to fall over?

A few seconds if you ran the wrong thing. Never if you ran something
else.

> One thing I've noticed, is that 2.4.19aa2
> runs great on a box with 8GB when I don't allocate all that much, but
> seems to run into issues after a large DB has been running on it for
> several days. (i.e. the system get's generally a little slower, less
> responsive, and in some cases crashes after 7 days).

"crashes"? kernel, or application? What additional info is
available?

> Yes, I know, sounds like a memory leak in something, but aside from
> patching Oracle from 8.1.7.4(dba's can't find any new patches ATM), I've
> tried everything except changing my kernel.
>
> Could this be similar behaviour?

No, it's something else. Possibly a leak, possibly vma structures.

You should wait until the machine is sluggish, then capture
the output of:

vmstat 1
cat /proc/meminfo
cat /proc/slabinfo
ps aux

2002-12-06 07:26:38

by GrandMasterLee

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, 2002-12-06 at 01:25, Andrew Morton wrote:
> GrandMasterLee wrote:
> >
[...]
> > Just curious, but how long would it take a system with 8GB RAM, using 4G
> > or 64G kernel to fall over?
>
> A few seconds if you ran the wrong thing. Never if you ran something
> else.
>
> > One thing I've noticed, is that 2.4.19aa2
> > runs great on a box with 8GB when I don't allocate all that much, but
> > seems to run into issues after a large DB has been running on it for
> > several days. (i.e. the system get's generally a little slower, less
> > responsive, and in some cases crashes after 7 days).
>
> "crashes"? kernel, or application? What additional info is
> available?

Machine will panic. I've actually captured some and sent them to this
list, but I've been told that my stack was corrupt. Problem is, ATM, I
can't find a memory problem. Memtest86 locks up on test 4(as in, machine
needs hard booting), no matter if it's 8GB or 4GB RAM installed. An no
matter if *known good* ram is being tested as well. So I don't think
it's that per se.

> > Yes, I know, sounds like a memory leak in something, but aside from
> > patching Oracle from 8.1.7.4(dba's can't find any new patches ATM), I've
> > tried everything except changing my kernel.
> >
> > Could this be similar behaviour?
>
> No, it's something else. Possibly a leak, possibly vma structures.

Could that yield a corrupt stack?

> You should wait until the machine is sluggish, then capture
> the output of:
>
> vmstat 1
> cat /proc/meminfo
> cat /proc/slabinfo
> ps aux

I shall gather the information sometime 12/06/2002. TIA

--The GrandMaster

2002-12-06 07:43:45

by Andrew Morton

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

GrandMasterLee wrote:
>
> ...
> > "crashes"? kernel, or application? What additional info is
> > available?
>
> Machine will panic. I've actually captured some and sent them to this
> list, but I've been told that my stack was corrupt.

OK. In your second oops trace the `swapper' process had used 5k of its
8k kernel stack processing an XFS IO completion interrupt. And I don't
think `swapper' uses much stack of its own.

If some other process happens to be using 3k of stack when the same
interrupt hits it, it's game over.

So at a guess, I'd say you're being hit by excessive stack use in
the XFS filesystem. I think the XFS team have done some work on that
recently so an upgrade may help.

Or it may be something completely different ;)

2002-12-06 10:29:04

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?


> 64GB isn't getting any testing that I know of; I'd hold off until
> someone's actually stood up and confessed to attempting to boot
> Linux on such a beast. Or until I get some more RAM. =)

United Linux at least has tested this according to
http://www.unitedlinux.com/en/press/pr111902.html
Hardware functionality is exploited through advanced features such as
large memory support for up to 64 GB of RAM

so I'm sure Andrea's VM deals with it gracefully

2002-12-06 11:30:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Thu, Dec 05, 2002 at 11:51:10PM -0800, Andrew Morton wrote:
> So at a guess, I'd say you're being hit by excessive stack use in
> the XFS filesystem. I think the XFS team have done some work on that
> recently so an upgrade may help.

Yes, XFS 1.1 used a lot of stack. XFS 1.2pre (and the stuff in 2.5)
uses much less. He's also using the qla2xxx drivers that aren't exactly
stack-friendly either.

2002-12-06 12:41:45

by Rik van Riel

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, 6 Dec 2002, Norman Gaywood wrote:
> On Thu, Dec 05, 2002 at 07:35:49PM -0500, Pete Zaitcev wrote:

> > Check your /proc/slabinfo, just in case, to rule out a leak.
>
> Here is a /proc/slabinfo diff of a good system and a very sluggish one:

> > inode_cache 305071 305081 512 43583 43583 1 : 124 62
> > buffer_head 3431966 3432150 128 114405 114405 1 : 252 126

Guess what ? 120 MB in inode cache and 450 MB in buffer heads,
or 570 MB of zone_normal eaten with just these two items.

Looks like the RH kernel needs Stephen Tweedie's patch to
reclaim the buffer heads once IO is done ;)

regards,

Rik
--
A: No.
Q: Should I include quotations after my reply?
http://www.surriel.com/ http://guru.conectiva.com/

2002-12-06 14:15:48

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

At some point in the past, I wrote:
>> 64GB isn't getting any testing that I know of; I'd hold off until
>> someone's actually stood up and confessed to attempting to boot
>> Linux on such a beast. Or until I get some more RAM. =)

On Fri, Dec 06, 2002 at 11:36:15AM +0100, Arjan van de Ven wrote:
> United Linux at least has tested this according to
> http://www.unitedlinux.com/en/press/pr111902.html
> Hardware functionality is exploited through advanced features such as
> large memory support for up to 64 GB of RAM
> so I'm sure Andrea's VM deals with it gracefully

I'm not convinced of grace even if I were to take it from this that it
were directly tested, which seems doubtful given the nature of the page.
This page sounds more like CONFIG_HIGHMEM64G is an option.

And besides, the report is useless unless it's got actual technical
content and descriptions reported by an kernel hacker.


Bill

2002-12-06 14:49:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Thu, Dec 05, 2002 at 10:55:53PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > the
> > algorithm is autotuned at boot and depends on the zone sizes, and it
> > applies to the dma zone too with respect to the normal zone, the highmem
> > case is just one of the cases that the fix for the general problem
> > resolves,
>
> Linus's incremental min will protect ZONE_DMA in the same manner.

of how many bytes?

>
> > and you're totally wrong saying that mlocking 700m on a 4G box
> > could kill it.
>
> It is possible to mlock 700M of the normal zone on a 4G -aa kernel.
> I can't immediately think of anything apart from vma's which will
> make it fall over, but it will run like crap.

you're missing the whole point. the vma are zone-normal users. You're
saying that you can run out of ZONE_NORMAL if you run
alloc_page(GFP_KERNEL) for some hundred thousand times. Yeah that's not
a big news.

I'm saying you *can't* run out of zone-normal due highmem allocations so
if you run alloc_pages(GFP_HIGHMEM), period.

that's a completely different thing.

I thought you understood what the problem is, not sure why you say you
can run out of zone-normal running 100000 times alloc_page(GFP_KERNEL),
that has *nothing* to do with the bug we're discussing here, if you
don't want to run out of zone-normal after 100000 GFP_KERNEL page
allocations you can only drop the zone-normal.

The bug we're discussing here is that w/o my fix you will run out of
zone-normal despite you didn't start allocating zone-normal yet and
despite you still have 60G free in the highmem zone. This is what the
patch prevents, nothing more nothing less.

And it's not so much specific to google, they were just unlucky
triggering it, as said just allocate plenty of pagetables (they are
highmem capable in my tree and 2.5) or swapoff -a, and you'll run in the
very same scenario that needs my fix in all normal workloads that
allocates some more than some hunded mbytes of ram.

And this is definitely a generic problem, not even specific to linux,
it's an OS wide design problem while dealing with the balancing of
different zones that have overlapping but not equivalent capabilities,
it even applies to zone-dma with respect to zone-normal and zone-highmem
and there's no other fix around it at the moment.

Mainline fixes it in a very weak way, it reserves a few meges only,
that's not nearly enough if you need to allocate more than one more
inode etc... The lowmem reservation must allow the machine to do
interesting workloads for the whole uptime, not to defer the failure of
a few seconds. A few megs aren't nearly enough.

If interesting workloads needs huge zone-normal, just reserve more of it
at boot and they will work. if all the zone-normal isn't enough you fall
into a totally different problem, that is the zone-normal existence in
the first place and it has nothing to do with this bug, and you can fix
the other problem only by dropping the zone-normal (of course if you do
that you will in turn fix this problem too, but the problems are
different).

The only alternate fix is to be able to migrate pagetables (1st level
only, pte) and all the other highmem capable allocations at runtime
(pagecache, shared memory etc..). Which is clearly not possible in 2.5
and 2.4.

Once that will be possible/implemented my fix can go away and you can
simply migrate the highmem capable allocations from zone-normal to
highmem. That would be the only alternate and also dynamic/superior fix
but it's not feasible at the moment, at the very least not in 2.4. It
would also have some performance implications, I'm sure lots of people
prefers to throw away 500M of ram in a 32G machine than riskying to
spend the cpu time in memcopies, so it would not be *that* superior, it
would be inferior in some ways.

Reserving 500M of ram on a 32G machine doesn't really matter at all, so
the current fix is certainly the best thing we can do for 2.4, and for
2.5 too unless you want to implement highmem migration for all highmem
capable kernel objects (which would work fine too).

Also your possible multiplicator via sysctl remains a much inferior to
my fix that is able to cleanly enforce classzone-point-of-view
watermarks (not fixed watermarks), you would need to change
multiplicator depending on zone size and depending on the zone to make
it equivalent, so yes, you could implement it equivally but it would be
much less clean and readable than my current code (and more hardly
tunable with a kernel paramter at boot like my current fix is).

> > 2.5 misses this important fix too btw.
>
> It does not appear to be an important fix at all. There have been

well if you ignore it people can use my tree, I personally need that fix
for myself on big boxes so I'm going to retain it in one form or
another (the form in mainline is too weak as said and just adding a
multiplicator would not be equivalent as said above).

> 2.5 has much bigger problems than this - radix_tree nodes and pte_chains
> in particular.

I'm not saying there aren't bigger problems in 2.5, but I don't classify
this one as a minor one, infact it was a showstopper for a long time in
2.4 (one of the last ones), until I fixed it and it still is a problem
because the 2.4 fix is way too weak (a few megs aren't enough to
guarantee big workloads to succeed).

Andrea

2002-12-06 15:05:07

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 03:57:19PM +0100, Andrea Arcangeli wrote:
> The only alternate fix is to be able to migrate pagetables (1st level
> only, pte) and all the other highmem capable allocations at runtime
> (pagecache, shared memory etc..). Which is clearly not possible in 2.5
> and 2.4.

Actually it should not be difficult for 2.5, though it's not done now.
Shared pagetables would complicate the implementation slightly. I've
gotten 100% backlash from my proposals in this area, so I'm not
touching it at all out of aggravation or whatever.


Bill

2002-12-06 15:05:21

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 11:36:15AM +0100, Arjan van de Ven wrote:
>> United Linux at least has tested this according to
>> http://www.unitedlinux.com/en/press/pr111902.html
>> Hardware functionality is exploited through advanced features such as
>> large memory support for up to 64 GB of RAM
>> so I'm sure Andrea's VM deals with it gracefully

On Fri, Dec 06, 2002 at 06:23:02AM -0800, William Lee Irwin III wrote:
> I'm not convinced of grace even if I were to take it from this that it
> were directly tested, which seems doubtful given the nature of the page.
> This page sounds more like CONFIG_HIGHMEM64G is an option.
> And besides, the report is useless unless it's got actual technical
> content and descriptions reported by an kernel hacker.

Well, since I've not seen recent attempts at the Right Way To Do It (TM),
there's also a remote possibility of someone changing the user/kernel
split just to get a bloated mem_map to fit. Many of the smaller apps,
e.g. /bin/sh etc. are indifferent to the ABI violation.


Bill

2002-12-06 16:12:34

by GrandMasterLee

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, 2002-12-06 at 01:51, Andrew Morton wrote:
> GrandMasterLee wrote:
> >
> > ...
> > > "crashes"? kernel, or application? What additional info is
> > > available?
> >
> > Machine will panic. I've actually captured some and sent them to this
> > list, but I've been told that my stack was corrupt.
>
> OK. In your second oops trace the `swapper' process had used 5k of its
> 8k kernel stack processing an XFS IO completion interrupt. And I don't
> think `swapper' uses much stack of its own.

The second Oops is the *best* one IMO. I got it just over 7 days. (like
7 days 6 hours or something. I've still been testing the crud out of
this kernel on like hardware, and can't reproduce it. I'd love to know a
method for reproducing this for my beta environment.

> If some other process happens to be using 3k of stack when the same
> interrupt hits it, it's game over.
>
> So at a guess, I'd say you're being hit by excessive stack use in
> the XFS filesystem. I think the XFS team have done some work on that
> recently so an upgrade may help.

Since we run ~1TB dbs on the systems, and a LOT of IO, and Qlogic
drivers, I think that's the culprit. Will swapper use less stack in more
recent kernels?(XFS will be updated as part of a plan for the new year
I'm putting together. Till then, it's reboot every 7 days)


> Or it may be something completely different ;)

I hope not. :)

--The GrandMaster

2002-12-06 22:21:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Thu, Dec 05, 2002 at 06:41:40PM -0800, William Lee Irwin III wrote:
> No idea why there's not more support behind or interest in page
> clustering. It's an optimization (not required) for 64-bit/saner arches.

softpagesize sounds a good idea to try for archs with a page size < 8k
indeed, modulo a few places where the 4k pagesize is part of the
userspace abi, for that reason on x86-64 Andi recently suggested to
changed the abi to assume a bigger page size and I suggested to assume
it to be 2M and not a smaller thing as originally suggested, that way we
waste some more virtual space (not an issue on 64bit) and some cache
color (not a big deal either, those caches are multiway associative even
if not fully associative), so eventually in theory we could even switch
the page size to 2M ;)

however don't mistake softpagesize with the PAGE_CACHE_SIZE (the latter
I think was completed some time ago by Hugh). I think PAGE_CACHE_SIZE
is a broken idea (i'm talking about the PAGE_CACHE_SIZE at large, not
about the implementation that may even be fine with Hugh's patch
applied).

PAGE_CACHE_SIZE will never work well due the fragmentation problems it
introduces. So I definitely vote for dropping PAGE_CACHE_SIZE and to
experiment with a soft PAGE_SIZE, multiple of the hardware PAGE_SIZE.
That means the allocator minimal granularity will return 8k. on x86 that
breaks a bit the ABI. on x86-64 the softpagesize would breaks only the 32bit
compatibilty mode abi a little so it would be even less severe. And I
think the softpagesize should be a config option so it can be
experimented without breaking the default config even on x86.

the soft PAGE_SIZE will also decrease of an order of magnitude the page
fault rate, the number of pte will be the same but we'll cluster the pte
refills all served from the same I/O anyways (readhaead usually loads
the next pages too anyways). So it's a kind of quite obvious design
optimization to experiment with (maybe for 2.7?).

Andrea

2002-12-06 22:27:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 07:12:38AM -0800, William Lee Irwin III wrote:
> split just to get a bloated mem_map to fit. Many of the smaller apps,
> e.g. /bin/sh etc. are indifferent to the ABI violation.

the problem of the split is that it would reduce the address space
available to userspace that is quite critical on big machines (one of
the big advantages of 64bit that can't be fixed on 32bit) but I wouldn't
classify it as an ABI violation, infact the little I can remember about
the 2.0 kernels [I almost never read that code] is that it had shared
address space and tlb flush while entering/exiting kernel, so I can bet
the user stack in 2.0 was put at 4G, not at 3G. 2.2 had to put it at 3G
because then the address space was shared with the obvious performance
advantages, so while I didn't read any ABI, I deduce you can't say the
ABI got broken if the stack is put at 2G or 1G or 3.5G or 4G again with
x86-64 (of course x86-64 can give the full 4G to userspace because the
kernel runs in the negative part of the [64bit] address space, as 2.0
could too).

Andrea

2002-12-06 23:14:12

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Thu, Dec 05, 2002 at 06:41:40PM -0800, William Lee Irwin III wrote:
>> No idea why there's not more support behind or interest in page
>> clustering. It's an optimization (not required) for 64-bit/saner arches.

On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> softpagesize sounds a good idea to try for archs with a page size < 8k
> indeed, modulo a few places where the 4k pagesize is part of the
> userspace abi, for that reason on x86-64 Andi recently suggested to
> changed the abi to assume a bigger page size and I suggested to assume
> it to be 2M and not a smaller thing as originally suggested, that way we
> waste some more virtual space (not an issue on 64bit) and some cache
> color (not a big deal either, those caches are multiway associative even
> if not fully associative), so eventually in theory we could even switch
> the page size to 2M ;)

The patch I'm talking about introduces a distinction between the size
of an area mapped by a PTE or TLB entry (MMUPAGE_SIZE) and the kernel's
internal allocation unit (PAGE_SIZE), and does (AFAICT) properly
vectored PTE operations in the VM to support the system's native page
size, and does a whole kernel audit of drivers/ and fs/ PAGE_SIZE usage
so that the distinction between PAGE_SIZE and MMUPAGE_SIZE is understood.


On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> however don't mistake softpagesize with the PAGE_CACHE_SIZE (the latter
> I think was completed some time ago by Hugh). I think PAGE_CACHE_SIZE
> is a broken idea (i'm talking about the PAGE_CACHE_SIZE at large, not
> about the implementation that may even be fine with Hugh's patch
> applied).

PAGE_CACHE_SIZE is mostly an fs thing, so there's not much danger of
confusion, at least not in my mind.


On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> PAGE_CACHE_SIZE will never work well due the fragmentation problems it
> introduces. So I definitely vote for dropping PAGE_CACHE_SIZE and to
> experiment with a soft PAGE_SIZE, multiple of the hardware PAGE_SIZE.
> That means the allocator minimal granularity will return 8k. on x86 that
> breaks a bit the ABI. on x86-64 the softpagesize would breaks only the 32bit
> compatibilty mode abi a little so it would be even less severe. And I
> think the softpagesize should be a config option so it can be
> experimented without breaking the default config even on x86.

Hmm, from the appearances of the patch (my ability to test the patch
is severely hampered by its age) it should actually maintain hardware
pagesize mmap() granularity, ABI compatibility, etc.


On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> the soft PAGE_SIZE will also decrease of an order of magnitude the page
> fault rate, the number of pte will be the same but we'll cluster the pte
> refills all served from the same I/O anyways (readhaead usually loads
> the next pages too anyways). So it's a kind of quite obvious design
> optimization to experiment with (maybe for 2.7?).

Sounds like the right timing for me.

A 16KB or 64KB kernel allocation unit would then annihilate
sizeof(mem_map) concerns on 3/1 splits. 720MB -> 180MB or 45MB.
Or on my home machine (768MB PC) 6MB -> 1.5MB or 384KB, which
is a substantial reduction in cache footprint and outright
memory footprint.

I think this is a perfect example of how the increased awareness of
space consumption highmem gives us helps us optimize all boxen.


Thanks,
Bill

2002-12-06 23:24:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 07:12:20AM -0800, William Lee Irwin III wrote:
> On Fri, Dec 06, 2002 at 03:57:19PM +0100, Andrea Arcangeli wrote:
> > The only alternate fix is to be able to migrate pagetables (1st level
> > only, pte) and all the other highmem capable allocations at runtime
> > (pagecache, shared memory etc..). Which is clearly not possible in 2.5
> > and 2.4.
>
> Actually it should not be difficult for 2.5, though it's not done now.

"difficult" is a relative word, nothing is difficult but everything is
difficult, depends the way you feel about it.

but note that even with rmap you don't know the pmd that points to the
pte that you want to relocate and for the anon pages you miss
information about mm and virtual address where those pages are
allocated, so basically rmap is useless for doing it, you need to do the
pagetable walking ala swap_out, in turn it's not easier at all in 2.5
than it could been in 2.4 (but of course this is a 2.5 thing only, I
just want to say that if it's not difficult in 2.5 it wasn't difficult
in 2.4 either).

Andrea

2002-12-06 23:38:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Sat, Dec 07, 2002 at 12:32:43AM +0100, Andrea Arcangeli wrote:
> but note that even with rmap you don't know the pmd that points to the
> pte that you want to relocate and for the anon pages you miss
> information about mm and virtual address where those pages are
> allocated, so basically rmap is useless for doing it, you need to do the
> pagetable walking ala swap_out, in turn it's not easier at all in 2.5
> than it could been in 2.4 (but of course this is a 2.5 thing only, I
> just want to say that if it's not difficult in 2.5 it wasn't difficult
> in 2.4 either).

Actually, we do. From include/asm-generic/rmap.h:

static inline void pgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address)
{
#ifdef BROKEN_PPC_PTE_ALLOC_ONE
/* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */
extern int mem_init_done;

if (!mem_init_done)
return;
#endif
page->mapping = (void *)mm;
page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
inc_page_state(nr_page_table_pages);
}

So pagetable pages are tagged with the right information, and in
principle could even be tagged here with the pmd in page->private.

These fields are actually required for use by try_to_unmap_one(),
and something similar could be done for a try_to_move_one(). This
information remains intact with shared pagetables, and is generalized
so that the PTE page is tagged with a list of mm's (the mm_chain),
and in that case no unique pmd could be directly stored in the page,
but it could just as easily be derived from the mm's in the mm_chain.

But there's no denying it would involve a substantial amount of work.


Bill

2002-12-06 23:42:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 03:21:25PM -0800, William Lee Irwin III wrote:
> On Thu, Dec 05, 2002 at 06:41:40PM -0800, William Lee Irwin III wrote:
> >> No idea why there's not more support behind or interest in page
> >> clustering. It's an optimization (not required) for 64-bit/saner arches.
>
> On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> > softpagesize sounds a good idea to try for archs with a page size < 8k
> > indeed, modulo a few places where the 4k pagesize is part of the
> > userspace abi, for that reason on x86-64 Andi recently suggested to
> > changed the abi to assume a bigger page size and I suggested to assume
> > it to be 2M and not a smaller thing as originally suggested, that way we
> > waste some more virtual space (not an issue on 64bit) and some cache
> > color (not a big deal either, those caches are multiway associative even
> > if not fully associative), so eventually in theory we could even switch
> > the page size to 2M ;)
>
> The patch I'm talking about introduces a distinction between the size
> of an area mapped by a PTE or TLB entry (MMUPAGE_SIZE) and the kernel's
> internal allocation unit (PAGE_SIZE), and does (AFAICT) properly
> vectored PTE operations in the VM to support the system's native page
> size, and does a whole kernel audit of drivers/ and fs/ PAGE_SIZE usage
> so that the distinction between PAGE_SIZE and MMUPAGE_SIZE is understood.

My point is that making any distinction will lead to inevitable
fragmentation of memory.

Going to an higher kernel wide PAGE_SIZE and avoiding the distinction
will even fix the 8k fragmentation issue with the kernel stack ;) Not to
tell allowing more workloads to be able to use all ram of the 32bit 64G
boxes.

> On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> > however don't mistake softpagesize with the PAGE_CACHE_SIZE (the latter
> > I think was completed some time ago by Hugh). I think PAGE_CACHE_SIZE
> > is a broken idea (i'm talking about the PAGE_CACHE_SIZE at large, not
> > about the implementation that may even be fine with Hugh's patch
> > applied).
>
> PAGE_CACHE_SIZE is mostly an fs thing, so there's not much danger of
> confusion, at least not in my mind.

ok, I thought MMUPAGE_SIZE and PAGE_CACHE_SIZE were related, but of
course they don't need to.

> On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> > PAGE_CACHE_SIZE will never work well due the fragmentation problems it
> > introduces. So I definitely vote for dropping PAGE_CACHE_SIZE and to
> > experiment with a soft PAGE_SIZE, multiple of the hardware PAGE_SIZE.
> > That means the allocator minimal granularity will return 8k. on x86 that
> > breaks a bit the ABI. on x86-64 the softpagesize would breaks only the 32bit
> > compatibilty mode abi a little so it would be even less severe. And I
> > think the softpagesize should be a config option so it can be
> > experimented without breaking the default config even on x86.
>
> Hmm, from the appearances of the patch (my ability to test the patch
> is severely hampered by its age) it should actually maintain hardware
> pagesize mmap() granularity, ABI compatibility, etc.

If it only implements the MMUPAGE_SIZE, yes, it can.

You break the ABI as soon as you change the kernel wide PAGE_SIZE. it is
allowed only on 64bit binaries running on a x86-64 kernel. The 32bit
binaries running in compatibility mode as said would suffer a bit, but
most things should run and we can make hacks like using anon mappings if
the files are small just for the sake of running some app 32bit (like we
use anon mappings for a.out binaries needing 1k offsets today).

Said that even the MMUPAGE_SIZE alone would be useful, but I'd prefer if
the kernel wide PAGE_SIZE would be increased (with the disavantage of
breaking the ABI, but it would be a config option, even the 2G/3.5G/1G
split has the chance of breaking some app despite I wouldn't classify it
as an ABI violation for the reason explained in one of the last emails).

> On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> > the soft PAGE_SIZE will also decrease of an order of magnitude the page
> > fault rate, the number of pte will be the same but we'll cluster the pte
> > refills all served from the same I/O anyways (readhaead usually loads
> > the next pages too anyways). So it's a kind of quite obvious design
> > optimization to experiment with (maybe for 2.7?).
>
> Sounds like the right timing for me.
>
> A 16KB or 64KB kernel allocation unit would then annihilate
> sizeof(mem_map) concerns on 3/1 splits. 720MB -> 180MB or 45MB.
>
> Or on my home machine (768MB PC) 6MB -> 1.5MB or 384KB, which
> is a substantial reduction in cache footprint and outright
> memory footprint.

Yep.

>
> I think this is a perfect example of how the increased awareness of
> space consumption highmem gives us helps us optimize all boxen.

In this case funnily it has a chance to help some 64bit boxes too ;).

Andrea

2002-12-06 23:49:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 03:45:24PM -0800, William Lee Irwin III wrote:
> On Sat, Dec 07, 2002 at 12:32:43AM +0100, Andrea Arcangeli wrote:
> > but note that even with rmap you don't know the pmd that points to the
> > pte that you want to relocate and for the anon pages you miss
> > information about mm and virtual address where those pages are
> > allocated, so basically rmap is useless for doing it, you need to do the
> > pagetable walking ala swap_out, in turn it's not easier at all in 2.5
> > than it could been in 2.4 (but of course this is a 2.5 thing only, I
> > just want to say that if it's not difficult in 2.5 it wasn't difficult
> > in 2.4 either).
>
> Actually, we do. From include/asm-generic/rmap.h:
>
> static inline void pgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address)
> {
> #ifdef BROKEN_PPC_PTE_ALLOC_ONE
> /* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */
> extern int mem_init_done;
>
> if (!mem_init_done)
> return;
> #endif
> page->mapping = (void *)mm;
> page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
> inc_page_state(nr_page_table_pages);
> }
>
> So pagetable pages are tagged with the right information, and in
> principle could even be tagged here with the pmd in page->private.

sorry I didn't noticed the overlap of page->mapping to store the mm. But
yes, I should have realized that you had do because otherwise you
wouldn't know how to flush the tlb ;) so without the mm and address rmap
would be useless. So via the address and mapping you can walk the
pagetables and reach it with lower complexity than w/o rmap. Still doing
the pagetable walk wouldn't be an huge increase in complexity but it
would increase the "computational" complexity of the algorithm.

> These fields are actually required for use by try_to_unmap_one(),
> and something similar could be done for a try_to_move_one(). This
> information remains intact with shared pagetables, and is generalized
> so that the PTE page is tagged with a list of mm's (the mm_chain),
> and in that case no unique pmd could be directly stored in the page,
> but it could just as easily be derived from the mm's in the mm_chain.
>
> But there's no denying it would involve a substantial amount of work.
>
>
> Bill


Andrea

2002-12-06 23:53:53

by Andrew Morton

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

William Lee Irwin III wrote:
>
> ...
> A 16KB or 64KB kernel allocation unit would then annihilate

You want to be careful about this:

CPU: L1 I cache: 16K, L1 D cache: 16K

Because instantiating a 16k page into user pagetables in
one hit means that it must all be zeroed. With these large
pagesizes that means that the application is likely to get
100% L1 misses against the new page, whereas it currently
gets 100% hits.

I'd expect this performance dropoff to occur when going from 8k
to 16k. By the time you get to 32k it would be quite bad.

One way to address this could be to find a way of making the
pages present, but still cause a fault on first access. Then
have a special-case fastpath in the fault handler to really wipe
the page just before it is used. I don't know how though - maybe
_PAGE_USER?

get_user_pages() would need attention too - you don't want to
allow the user to perform O_DIRECT writes of uninitialised
pages to their files...

2002-12-07 00:14:11

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

> William Lee Irwin III wrote:
> >
> > ...
> > A 16KB or 64KB kernel allocation unit would then annihilate
>
On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote:
> You want to be careful about this:
> CPU: L1 I cache: 16K, L1 D cache: 16K
> Because instantiating a 16k page into user pagetables in
> one hit means that it must all be zeroed. With these large
> pagesizes that means that the application is likely to get
> 100% L1 misses against the new page, whereas it currently
> gets 100% hits.

16K is reasonable; after that one might as well go all the way.
About the only way to cope is amortizing it by cacheing zeroed pages,
and that has other downsides.


Bill

2002-12-07 00:14:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote:
> William Lee Irwin III wrote:
> >
> > ...
> > A 16KB or 64KB kernel allocation unit would then annihilate
>
> You want to be careful about this:
>
> CPU: L1 I cache: 16K, L1 D cache: 16K
>
> Because instantiating a 16k page into user pagetables in
> one hit means that it must all be zeroed. With these large
> pagesizes that means that the application is likely to get
> 100% L1 misses against the new page, whereas it currently
> gets 100% hits.
>
> I'd expect this performance dropoff to occur when going from 8k
> to 16k. By the time you get to 32k it would be quite bad.
>
> One way to address this could be to find a way of making the
> pages present, but still cause a fault on first access. Then
> have a special-case fastpath in the fault handler to really wipe
> the page just before it is used. I don't know how though - maybe
> _PAGE_USER?

I think taking the page fault itself is the biggest overhead that would
be nice to avoid on every second virtually consecutive page, if we've to
take the page fault on every page we could as well do the rest of the
work that should not that big compared to the overhead of
entering/exiting kernel and preparing to handle the fault.

>
> get_user_pages() would need attention too - you don't want to
> allow the user to perform O_DIRECT writes of uninitialised
> pages to their files...


Andrea

2002-12-07 00:22:59

by Andrew Morton

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

William Lee Irwin III wrote:
>
> > William Lee Irwin III wrote:
> > >
> > > ...
> > > A 16KB or 64KB kernel allocation unit would then annihilate
> >
> On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote:
> > You want to be careful about this:
> > CPU: L1 I cache: 16K, L1 D cache: 16K
> > Because instantiating a 16k page into user pagetables in
> > one hit means that it must all be zeroed. With these large
> > pagesizes that means that the application is likely to get
> > 100% L1 misses against the new page, whereas it currently
> > gets 100% hits.
>
> 16K is reasonable; after that one might as well go all the way.

16k will cause serious slowdowns.

> About the only way to cope is amortizing it by cacheing zeroed pages,
> and that has other downsides.

So will that. You've seen the kernbench profiles...

You will need to find a way to clear the page just before it
is used.

2002-12-07 00:23:42

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

At some point in the past, I wrote:
> My point is that making any distinction will lead to inevitable
> fragmentation of memory.

It's mostly userspace; the kernel is usually (hello drivers/ !) cautious
and uses slab.c's anti-internal fragmentation techniques for most structs.


At some point in the past, I wrote:
>> Hmm, from the appearances of the patch (my ability to test the patch
>> is severely hampered by its age) it should actually maintain hardware
>> pagesize mmap() granularity, ABI compatibility, etc.

On Sat, Dec 07, 2002 at 12:50:32AM +0100, Andrea Arcangeli wrote:
> If it only implements the MMUPAGE_SIZE, yes, it can.
> You break the ABI as soon as you change the kernel wide PAGE_SIZE. it is
> allowed only on 64bit binaries running on a x86-64 kernel. The 32bit
> binaries running in compatibility mode as said would suffer a bit, but
> most things should run and we can make hacks like using anon mappings if
> the files are small just for the sake of running some app 32bit (like we
> use anon mappings for a.out binaries needing 1k offsets today).

I'm not sure what to make of this. The distinction and PTE vectoring
API (AFAICT) allows PTE's to map sub-PAGE_SIZE-sized (MMUPAGE_SIZE to
be exact) regions. Someone start screaming if I misread the patch.


On Sat, Dec 07, 2002 at 12:50:32AM +0100, Andrea Arcangeli wrote:
> Said that even the MMUPAGE_SIZE alone would be useful, but I'd prefer if
> the kernel wide PAGE_SIZE would be increased (with the disavantage of
> breaking the ABI, but it would be a config option, even the 2G/3.5G/1G
> split has the chance of breaking some app despite I wouldn't classify it
> as an ABI violation for the reason explained in one of the last emails).

Userspace is required to have >= 3GB of virtualspace, according to the
SVR4 i386 ABI spec. But we don't follow that strictly anyway.


At some point in the past, I wrote:
>> I think this is a perfect example of how the increased awareness of
>> space consumption highmem gives us helps us optimize all boxen.

On Sat, Dec 07, 2002 at 12:50:32AM +0100, Andrea Arcangeli wrote:
> In this case funnily it has a chance to help some 64bit boxes too ;).

I've heard the sizeof(mem_map) footprint is worse on 64-bit because
while PAGE_SIZE remains the same, but pointers double in size. This
would help a bit there, too.


Bill

2002-12-07 00:28:14

by Andrew Morton

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

Andrea Arcangeli wrote:
>
> > One way to address this could be to find a way of making the
> > pages present, but still cause a fault on first access. Then
> > have a special-case fastpath in the fault handler to really wipe
> > the page just before it is used. I don't know how though - maybe
> > _PAGE_USER?
>
> I think taking the page fault itself is the biggest overhead that would
> be nice to avoid on every second virtually consecutive page, if we've to
> take the page fault on every page we could as well do the rest of the
> work that should not that big compared to the overhead of
> entering/exiting kernel and preparing to handle the fault.

Yes, 8k at a time would probably be OK. Say, L1-size/2.

I expect that anything bigger would cause 2x or worse slowdowns of a
range of apps.

2002-12-07 00:39:21

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote:
> One way to address this could be to find a way of making the
> pages present, but still cause a fault on first access. Then
> have a special-case fastpath in the fault handler to really wipe
> the page just before it is used. I don't know how though - maybe
> _PAGE_USER?

All of the problems there have to do with accounting which pieces of
the page are zeroed. The PTE's map the same size areas (MMUPAGE_SIZE
stays 4KB)... So after a partial zero we end up with a struct page
pointing at MMUPAGE_COUNT mmupages, and a PTE pointing at the one
that's been zeroed and not a whole lot of flag bits left to keep track
of which pieces are initialized. How about a single PG_zero flag and
map out which bits of the thing are already zeroed in page->private?
(basically the swapcache can be considered the owning fs and it then
then uses page->private for those shenanigans).


On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote:
> get_user_pages() would need attention too - you don't want to
> allow the user to perform O_DIRECT writes of uninitialised
> pages to their files...

Well, I'm not sure how that would happen. fs io should deal with
kernel PAGE_SIZE-sized units so we're dealing with anonymous memory
only. O_DIRECT if we perform a write would only find the part of the
page mapped by a PTE, which must have been pre-zeroed prior to being
mapped. Reads seem to be in equally good shape. Perhaps it's more of
"this is yet another things to audit when dealing with it"; I'll admit
that the audit needed for this thing is somewhat large.


Bill

2002-12-07 01:36:57

by Alan

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Sat, 2002-12-07 at 00:21, William Lee Irwin III wrote:
> 16K is reasonable; after that one might as well go all the way.
> About the only way to cope is amortizing it by cacheing zeroed pages,
> and that has other downsides.

Some of the lower end CPU's only have about 12-16K of L1. I don't think
thats a big problem since those aren't going to be highmem or large
memory users

2002-12-07 01:39:27

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Sat, 2002-12-07 at 00:21, William Lee Irwin III wrote:
>> 16K is reasonable; after that one might as well go all the way.
>> About the only way to cope is amortizing it by cacheing zeroed pages,
>> and that has other downsides.

On Sat, Dec 07, 2002 at 02:19:49AM +0000, Alan Cox wrote:
> Some of the lower end CPU's only have about 12-16K of L1. I don't think
> thats a big problem since those aren't going to be highmem or large
> memory users

It's an arch parameter, so they'd probably just
#define MMUPAGE_SIZE PAGE_SIZE
Hugh's original patch did that for all non-i386 arches.

Bill

2002-12-07 01:48:41

by Alan

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Sat, 2002-12-07 at 01:46, William Lee Irwin III wrote:
> It's an arch parameter, so they'd probably just
> #define MMUPAGE_SIZE PAGE_SIZE
> Hugh's original patch did that for all non-i386 arches.

These are low end x86 - but we could this based on

<= i586
i586
i686+

2002-12-07 01:48:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Fri, Dec 06, 2002 at 05:46:43PM -0800, William Lee Irwin III wrote:
> On Sat, 2002-12-07 at 00:21, William Lee Irwin III wrote:
> >> 16K is reasonable; after that one might as well go all the way.
> >> About the only way to cope is amortizing it by cacheing zeroed pages,
> >> and that has other downsides.
>
> On Sat, Dec 07, 2002 at 02:19:49AM +0000, Alan Cox wrote:
> > Some of the lower end CPU's only have about 12-16K of L1. I don't think
> > thats a big problem since those aren't going to be highmem or large
> > memory users
>
> It's an arch parameter, so they'd probably just
> #define MMUPAGE_SIZE PAGE_SIZE
> Hugh's original patch did that for all non-i386 arches.

I would say the most important thing to evaluate before the cpu and
cache size is the amount of ram in the machine. The major downside of
going to 8k is the loss of granularity in the paging, so a small machine
may not want to pagein the next page too unless it's been explicitly
touched by the program, to utilize the few available ram at best and to
have the most finegrined info possible about the working set in the
pagetables. The breakpoint depends on the workload. probably it would
make sense to keep at 4k all boxes <= 64M or something on those lines.

Andrea

2002-12-07 02:01:51

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Sat, 2002-12-07 at 01:46, William Lee Irwin III wrote:
>> It's an arch parameter, so they'd probably just
>> #define MMUPAGE_SIZE PAGE_SIZE
>> Hugh's original patch did that for all non-i386 arches.

On Sat, Dec 07, 2002 at 02:31:37AM +0000, Alan Cox wrote:
> These are low end x86 - but we could this based on
> <= i586
> i586
> i686+

It's relatively flexible as to the choice of PAGE_SIZE (it's
MMUPAGE_SIZE that's defined by hardware); about the only constraints
are that jacking it up where PAGE_SIZE spans pmd's breaks the core
vectoring API, PAGE_SIZE >= MMUPAGE_SIZE, both are powers of 2, the
vectors (which are of size MMUPAGE_COUNT*sizeof(pte_t *)) are stack-
allocated, and arch code has to understand small bits of it.

It sounds like we could pick sane defaults based on CPU revision.



Bill

2002-12-07 10:48:15

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

On Sat, 2002-12-07 at 01:01, Andrew Morton wrote:
> William Lee Irwin III wrote:
> >
> > ...
> > A 16KB or 64KB kernel allocation unit would then annihilate
>
> You want to be careful about this:
>
> CPU: L1 I cache: 16K, L1 D cache: 16K
>
> Because instantiating a 16k page into user pagetables in
> one hit means that it must all be zeroed. With these large
> pagesizes that means that the application is likely to get
> 100% L1 misses against the new page, whereas it currently
> gets 100% hits.

If you really want you can cheat that 100% statistic into something much
lower by zeroing the page from back to front (based on the exact
faulting address even, because you know THAT one will get used) and/or
zeroing the second half while bypassing the cache. At least it's 50%
hits then ;)

Still not 100% and I still agree that the 8Kb number is much nicer for
16Kb L1 cache machines....

2002-12-07 18:21:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?

Andrea Arcangeli <[email protected]> writes:

> On Fri, Dec 06, 2002 at 07:12:38AM -0800, William Lee Irwin III wrote:
> > split just to get a bloated mem_map to fit. Many of the smaller apps,
> > e.g. /bin/sh etc. are indifferent to the ABI violation.
>
> the problem of the split is that it would reduce the address space
> available to userspace that is quite critical on big machines (one of
> the big advantages of 64bit that can't be fixed on 32bit) but I wouldn't
> classify it as an ABI violation, infact the little I can remember about
> the 2.0 kernels [I almost never read that code] is that it had shared
> address space and tlb flush while entering/exiting kernel, so I can bet
> the user stack in 2.0 was put at 4G, not at 3G. 2.2 had to put it at 3G
> because then the address space was shared with the obvious performance
> advantages, so while I didn't read any ABI, I deduce you can't say the
> ABI got broken if the stack is put at 2G or 1G or 3.5G or 4G again with
> x86-64 (of course x86-64 can give the full 4G to userspace because the
> kernel runs in the negative part of the [64bit] address space, as 2.0
> could too).

As I remember it 2.0 used the 3/1 split the difference was that
segments had different base register values. So the kernel though it
was running at 0. %fs which retained a base address of 0 was used
when access to user space was desired.

Eric