2002-10-12 06:33:57

by Andrew Morton

[permalink] [raw]
Subject: 2.5.42-mm2


url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.42/2.5.42-mm2/

mm1 had a little problem in the compilation department - missing chunk
from fs/fcntl.c.

+fix-pgpgout.patch

Fix /proc/vmstat:pgpgin/pgpgout accounting for 512-byte IOs

+dio-fine-alignment.patch

Bring back the 512-byte alignment patch

+sard.patch

Keep sard ticking over

+remove-kiobufs.patch

Remove the kiobuf infrastructure.






kgdb.patch

oprofile-25.patch

misc.patch
misc

hugetlb-meminfo.patch
change hugetlbpage info in /proc/meminfo

dio-bio-add-fix-1.patch
Fix direct-io for bio_add_page()

net-loopback.patch
Disable second copy in the network loopback driver

swsusp-feature.patch
add shrink_all_memory() for swsusp

large-queue-throttle.patch
Improve writer throttling for small machines

exit-page-referenced.patch
Propagate pte referenced bit into pagecache during unmap

swappiness.patch
swappiness control

mapped-start-active.patch
start anonymous pages on the active list

rename-dirty_async_ratio.patch
rename dirty_async_ratio to dirty_ratio

auto-dirty-memory.patch
adaptive dirty-memory thresholding

batched-slab-asap.patch
batched slab shrinking and shrinker callback API

blkdev-o_direct-short-read.patch
Fix O_DIRECT blockdev reads at end-of-device

fix-pgpgout.patch
Fix block IO accounting for 512-byte requests

orlov-allocator.patch

blk-queue-bounce.patch
inline blk_queue_bounce

lseek-ext2_readdir.patch
remove lock_kernel() from ext2_readdir()

msync-correctness.patch
msync correctness fix

dio-fine-alignment.patch
Allow O_DIRECT to use 512-byte alignment

sard.patch
SARD disk accounting

write-deadlock.patch
Fix the generic_file_write-from-same-mmapped-page deadlock

rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)

spin-lock-check.patch
spinlock/rwlock checking infrastructure

hugetlb-prefault.patch
hugetlbpages: factor out some code for hugetlbfs

ramfs-aops.patch
Move ramfs address_space ops into libfs

hugetlb-header-split.patch
Move hugetlb declarations into their own header

hugetlbfs.patch
hugetlbfs file system

hugetlb-shm.patch
hugetlbfs backing for SYSV shared memory

page_reserved-accounting.patch
Global PageReserved accounting

use-page_reserved_accounting.patch
Use PG_reserved accounting in the VM

ramfs-prepare-write-speedup.patch
correctness fixes in libfs address_space ops

akpm-deadline.patch
deadline scheduler tweaks

intel-user-copy.patch
Faster copt_*_user for Intel ia32 CPUs

raid0-fix.patch
RAID0 fix

rmqueue_bulk.patch
bulk page allocator

free_pages_bulk.patch
Bulk page freeing function

hot_cold_pages.patch
Hot/Cold pages and zone->lock amortisation

readahead-cold-pages.patch
Use cache-cold pages for pagecache reads.

pagevec-hot-cold-hint.patch
hot/cold hints for truncate and page reclaim

page-reservation.patch
Page reservation API

o_streaming.patch
O_STREAMING support

remove-kiobufs.patch
Remove kiobufs and kiovecs

slab-split-01-rename.patch
slab cleanup: rename static functions

slab-split-02-SMP.patch
slab: enable the cpu arrays on uniprocessor

slab-split-03-tail.patch
slab: reduced internal fragmentation

slab-split-04-drain.patch
slab: take the spinlock in the drain function.

slab-split-05-name.patch
slab: remove spaces from /proc identifiers

slab-split-06-mand-cpuarray.patch
slab: cleanups and speedups

slab-split-07-inline.patch
slab: uninline poisoning checks

slab-split-08-reap.patch
slab: reap timers

cpucache_init-fix.patch
cpucache_init fix

slab-split-10-list_for_each_fix.patch
slab: for a list walking bug

shpte.patch

shpte-ifdef.patch
reduced ifdeffery in the shared pagetable code

shpte-mprotect-fix.patch
fix shared pagetable handling of mprotect

shpte-unmap-fix.patch
shared pagetable unmap fix

shmmap.patch
Proactively share page tables for shared memory

read_barrier_depends.patch
extended barrier primitives

rcu_ltimer.patch
RCU core

dcache_rcu.patch
Use RCU for dcache


2002-10-12 13:19:20

by Ed Tomlinson

[permalink] [raw]
Subject: Re: 2.5.42-mm2

Hi,

This builds fine but gets errors in depmod.

make -f arch/i386/lib/Makefile modules_install
if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.5.42-mm2; fi
depmod: *** Unresolved symbols in /lib/modules/2.5.42-mm2/kernel/fs/ext3/ext3.o
depmod: generic_file_aio_read
depmod: generic_file_aio_write
depmod: *** Unresolved symbols in /lib/modules/2.5.42-mm2/kernel/fs/nfs/nfs.o
depmod: generic_file_aio_read
depmod: generic_file_aio_write
depmod: *** Unresolved symbols in /lib/modules/2.5.42-mm2/kernel/fs/nfsd/nfsd.o
depmod: auth_domain_find
depmod: cache_fresh
depmod: unix_domain_find
depmod: auth_domain_put
depmod: cache_flush
depmod: cache_unregister
depmod: add_hex
depmod: cache_check
depmod: svcauth_unix_purge
depmod: get_word
depmod: cache_clean
depmod: cache_register
depmod: auth_unix_lookup
depmod: auth_unix_add_addr
depmod: cache_init
depmod: auth_unix_forget_old
depmod: add_word

Hope this helps,
Ed Tomlinson

2002-10-13 10:17:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.42-mm2

On Fri, Oct 11, 2002 at 11:39:33PM -0700, Andrew Morton wrote:
> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.42/2.5.42-mm2/

This patch does 5 things:

(1) when the OOM killer fails and the system panics, calls
show_free_areas()
(2) reorganizes show_free_areas() to use for_each_zone()
(3) adds per-cpu stats to show_free_areas()
(4) tags output from show_free_areas() with node and zone information
(5) initializes zone->per_cpu_pageset[cpu].pcp[temperature].reserved
in free_area_init_core()

The net effect is better reporting of where memory went, which was
essential to determining the cause of this failure, and that the
reserved page stuff can actually boot. Prior to this it was getting
total garbage in ->reserved after free_area_init_core():

Node 0, Zone DMA: per-cpu:
cpu 0 hot: low 32, high 96, batch 16, reserved 1683971840
cpu 0 cold: low 0, high 32, batch 16, reserved 1953719651
cpu 1 hot: low 32, high 96, batch 16, reserved 1702256479
cpu 1 cold: low 0, high 32, batch 16, reserved 825241951

And this caused a false bootmem OOM. It would have been impossible to
determine the cause of failure without show_free_areas() modifications,
and this is a box-killing bug that wipes out a significant fraction of
the high-end developer base from 2.5.x contributions as well as
preventing all i386 NUMA boxen, which the highest volume high-end
configurations, from booting. Furthermore, it also cleans up
show_free_areas() in a very straightforward fashion.

Against 2.5.42-mm2.


diff -urpN mm-2.5.42/mm/oom_kill.c virgin-2.5.42/mm/oom_kill.c
--- mm-2.5.42/mm/oom_kill.c 2002-10-11 21:22:08.000000000 -0700
+++ virgin-2.5.42/mm/oom_kill.c 2002-10-13 01:35:51.000000000 -0700
@@ -172,8 +172,10 @@ static void oom_kill(void)
p = select_bad_process();

/* Found nothing?!?! Either we hang forever, or we panic. */
- if (p == NULL)
+ if (!p) {
+ show_free_areas();
panic("Out of memory and no killable processes...\n");
+ }

/* kill all processes that share the ->mm (i.e. all threads) */
do_each_thread(g, q)
diff -urpN mm-2.5.42/mm/page_alloc.c virgin-2.5.42/mm/page_alloc.c
--- mm-2.5.42/mm/page_alloc.c 2002-10-13 02:37:25.000000000 -0700
+++ virgin-2.5.42/mm/page_alloc.c 2002-10-13 02:05:12.000000000 -0700
@@ -830,11 +830,11 @@ void si_meminfo(struct sysinfo *val)
*/
void show_free_areas(void)
{
- pg_data_t *pgdat;
struct page_state ps;
- int type;
+ int cpu, temperature;
unsigned long active;
unsigned long inactive;
+ struct zone *zone;

get_page_state(&ps);
get_zone_counts(&active, &inactive);
@@ -843,26 +843,24 @@ void show_free_areas(void)
K(nr_free_pages()),
K(nr_free_highpages()));

- for (pgdat = pgdat_list; pgdat; pgdat = pgdat->pgdat_next)
- for (type = 0; type < MAX_NR_ZONES; ++type) {
- struct zone *zone = &pgdat->node_zones[type];
- printk("Zone:%s"
- " freepages:%6lukB"
- " min:%6lukB"
- " low:%6lukB"
- " high:%6lukB"
- " active:%6lukB"
- " inactive:%6lukB"
- "\n",
- zone->name,
- K(zone->free_pages),
- K(zone->pages_min),
- K(zone->pages_low),
- K(zone->pages_high),
- K(zone->nr_active),
- K(zone->nr_inactive)
- );
- }
+ for_each_zone(zone)
+ printk("Node %d, Zone:%s"
+ " freepages:%6lukB"
+ " min:%6lukB"
+ " low:%6lukB"
+ " high:%6lukB"
+ " active:%6lukB"
+ " inactive:%6lukB"
+ "\n",
+ zone->zone_pgdat->node_id,
+ zone->name,
+ K(zone->free_pages),
+ K(zone->pages_min),
+ K(zone->pages_low),
+ K(zone->pages_high),
+ K(zone->nr_active),
+ K(zone->nr_inactive)
+ );

printk("( Active:%lu inactive:%lu dirty:%lu writeback:%lu free:%u )\n",
active,
@@ -871,26 +869,49 @@ void show_free_areas(void)
ps.nr_writeback,
nr_free_pages());

- for (pgdat = pgdat_list; pgdat; pgdat = pgdat->pgdat_next)
- for (type = 0; type < MAX_NR_ZONES; type++) {
- struct list_head *elem;
- struct zone *zone = &pgdat->node_zones[type];
- unsigned long nr, flags, order, total = 0;
+ for_each_zone(zone) {
+ struct list_head *elem;
+ unsigned long nr, flags, order, total = 0;
+
+ printk("Node %d, Zone %s: ", zone->zone_pgdat->node_id, zone->name);
+ if (!zone->present_pages) {
+ printk("empty\n");
+ continue;
+ }

- if (!zone->present_pages)
- continue;
+ spin_lock_irqsave(&zone->lock, flags);
+ for (order = 0; order < MAX_ORDER; order++) {
+ nr = 0;
+ list_for_each(elem, &zone->free_area[order].free_list)
+ ++nr;
+ total += nr << order;
+ printk("%lu*%lukB ", nr, K(1UL) << order);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ printk("= %lukB)\n", K(total));
+ }

- spin_lock_irqsave(&zone->lock, flags);
- for (order = 0; order < MAX_ORDER; order++) {
- nr = 0;
- list_for_each(elem, &zone->free_area[order].free_list)
- ++nr;
- total += nr << order;
- printk("%lu*%lukB ", nr, K(1UL) << order);
- }
- spin_unlock_irqrestore(&zone->lock, flags);
- printk("= %lukB)\n", K(total));
+ for_each_zone(zone) {
+ printk("Node %d, Zone %s: per-cpu:", zone->zone_pgdat->node_id, zone->name);
+
+ if (!zone->present_pages) {
+ printk(" empty\n");
+ continue;
+ } else
+ printk("\n");
+
+ for (cpu = 0; cpu < NR_CPUS; ++cpu) {
+ struct per_cpu_pageset *pageset = zone->pageset + cpu;
+ for (temperature = 0; temperature < 2; temperature++)
+ printk("cpu %d %s: low %d, high %d, batch %d, reserved %d\n",
+ cpu,
+ temperature ? "cold" : "hot",
+ pageset->pcp[temperature].low,
+ pageset->pcp[temperature].high,
+ pageset->pcp[temperature].batch,
+ pageset->pcp[temperature].reserved);
}
+ }

show_swap_cache_info();
}
@@ -1097,6 +1118,7 @@ static void __init free_area_init_core(s
pcp->low = 32;
pcp->high = 96;
pcp->batch = 16;
+ pcp->reserved = 0;
INIT_LIST_HEAD(&pcp->list);

pcp = &zone->pageset[cpu].pcp[1]; /* cold */
@@ -1104,6 +1126,7 @@ static void __init free_area_init_core(s
pcp->low = 0;
pcp->high = 32;
pcp->batch = 16;
+ pcp->reserved = 0;
INIT_LIST_HEAD(&pcp->list);
}
INIT_LIST_HEAD(&zone->active_list);

2002-10-13 17:41:36

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.42-mm2

William Lee Irwin III wrote:
>
> @@ -1104,6 +1126,7 @@ static void __init free_area_init_core(s
> pcp->low = 0;
> pcp->high = 32;
> pcp->batch = 16;
> + pcp->reserved = 0;
> INIT_LIST_HEAD(&pcp->list);
> }
> INIT_LIST_HEAD(&zone->active_list);

OK. But that's been there since 2.5.40-mm2. Why did it suddenly
bite?

2002-10-13 19:50:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.42-mm2

William Lee Irwin III wrote:
>> @@ -1104,6 +1126,7 @@ static void __init free_area_init_core(s
>> pcp->low = 0;
>> pcp->high = 32;
>> pcp->batch = 16;
>> + pcp->reserved = 0;
>> INIT_LIST_HEAD(&pcp->list);
>> }
>> INIT_LIST_HEAD(&zone->active_list);

On Sun, Oct 13, 2002 at 10:47:19AM -0700, Andrew Morton wrote:
> OK. But that's been there since 2.5.40-mm2. Why did it suddenly
> bite?

I must have been way too tired or something:

(1) It's embedded in struct zone, hence bootmem allocated, hence
already zeroed.

(2) The logs still show the show_free_areas() call immediately after
free_all_bootmem_core() seeing the garbage ->reserved values.

Bill

2002-10-13 19:58:27

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.5.42-mm2

On Sun, 13 Oct 2002, William Lee Irwin III wrote:

> (1) It's embedded in struct zone, hence bootmem allocated, hence
> already zeroed.

The struct zone doesn't get automatically zeroed on all architectures.

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-10-13 20:29:13

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.42-mm2

On Sun, Oct 13, 2002 at 12:52:36PM -0700, William Lee Irwin III wrote:
> (2) The logs still show the show_free_areas() call immediately after
> free_all_bootmem_core() seeing the garbage ->reserved values.

Disregard this. I reread the logs too early in the morning.


Bill

2002-10-13 20:40:35

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.42-mm2

On Sun, 13 Oct 2002, William Lee Irwin III wrote:
>> (1) It's embedded in struct zone, hence bootmem allocated, hence
>> already zeroed.

On Sun, Oct 13, 2002 at 06:04:02PM -0200, Rik van Riel wrote:
> The struct zone doesn't get automatically zeroed on all architectures.

It actually doesn't come out of bootmem. It's tacked onto min_low_pfn
because it's being dynamically allocated prior to init_bootmem().


Bill

2002-10-13 21:20:42

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.42-mm2

On Fri, Oct 11, 2002 at 11:39:33PM -0700, Andrew Morton wrote:
> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.42/2.5.42-mm2/

To future-proof NUMA-Q vs. similar issues to pcp->reserved:


--- linux-2.5.42/arch/i386/mm/discontig.c 2002-10-11 21:22:09.000000000 -0700
+++ virgin-2.5.42/arch/i386/mm/discontig.c 2002-10-13 14:18:19.000000000 -0700
@@ -70,6 +70,7 @@ static void __init allocate_pgdat(int ni
node_datasz = PFN_UP(sizeof(struct pglist_data));
NODE_DATA(nid) = (pg_data_t *)(__va(min_low_pfn << PAGE_SHIFT));
min_low_pfn += node_datasz;
+ memset(NODE_DATA(nid), 0, sizeof(struct pglist_data));
}

/*