2002-09-16 07:10:48

by Andrew Morton

[permalink] [raw]
Subject: 2.5.35-mm1


url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/

Significant rework of the new sleep/wakeup code - make it look totally
different from the current APIs to avoid confusion, and to make it
simpler to use.

Also increase the number of places where this API is used in networking;
Alexey says that some of these may be negative improvements, but
performance testing will nevertheless be interesting. The relevant
patches are:

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/prepare_to_wait.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/tcp-wakeups.patch

A 4x performance regression in heavy dbench testing has been fixed. The
VM was accidentally being fair to the dbench instances in page reclaim.
It's better to be unfair so just a few instances can get ahead and submit
more contiguous IO. It's a silly thing, but it's what I meant to do anyway.

Since 2.5.34-mm4:

-readv-writev.patch
-aio-sync-iocb.patch
-llzpr.patch
-buffermem.patch
-lpp.patch
-lpp-update.patch
-reversemaps-leak.patch
-sharedmem.patch
-ext3-sb.patch
-pagevec_lru_add.patch
-oom-fix.patch
-tlb-cleanup.patch
-dump-stack.patch
-wli-cleanup.patch

Merged

+release_pages-speedup.patch

Avoid a couple of lock-takings.

-wake-speedup.patch
+prepare_to_wait.patch

Renamed, reworked

+swapoff-deadlock.patch
+dirty-and-uptodate.patch
+shmem_rename.patch
+dirent-size.patch
+tmpfs-trivia.patch

Various fixes and cleanups from Hugh Dickins


linus.patch
cset-1.552-to-1.564.txt.gz

scsi_hack.patch
Fix block-highmem for scsi

ext3-htree.patch
Indexed directories for ext3

spin-lock-check.patch
spinlock/rwlock checking infrastructure

rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)

madvise-move.patch
move mdavise implementation into mm/madvise.c

split-vma.patch
VMA splitting patch

mmap-fixes.patch
mmap.c cleanup and lock ranking fixes

buffer-ops-move.patch
Move submit_bh() and ll_rw_block() into fs/buffer.c

slab-stats.patch
Display total slab memory in /proc/meminfo

writeback-control.patch
Cleanup and extension of the writeback paths

free_area_init-cleanup.patch
free_area_init() code cleanup

alloc_pages-cleanup.patch
alloc_pages cleanup and optimisation

statm_pgd_range-sucks.patch
Remove the pagetable walk from /proc/stat

remove-sync_thresh.patch
Remove /proc/sys/vm/dirty_sync_thresh

taka-writev.patch
Speed up writev

pf_nowarn.patch
Fix up the handling of PF_NOWARN

jeremy.patch
Spel Jermy's naim wright

release_pages-speedup.patch
Reduced locking in release_pages()

queue-congestion.patch
Infrastructure for communicating request queue congestion to the VM

nonblocking-ext2-preread.patch
avoid ext2 inode prereads if the queue is congested

nonblocking-pdflush.patch
non-blocking writeback infrastructure, use it for pdflush

nonblocking-vm.patch
Non-blocking page reclaim

prepare_to_wait.patch
New sleep/wakeup API

vm-wakeups.patch
Use the faster wakeups in the VM and block layers

sync-helper.patch
Speed up sys_sync() against multiple spindles

slabasap.patch
Early and smarter shrinking of slabs

write-deadlock.patch
Fix the generic_file_write-from-same-mmapped-page deadlock

buddyinfo.patch
Add /proc/buddyinfo - stats on the free pages pool

free_area.patch
Remove struct free_area_struct and free_area_t, use `struct free_area'

per-node-kswapd.patch
Per-node kswapd instance

topology-api.patch
NUMA topology API

radix_tree_gang_lookup.patch
radix tree gang lookup

truncate_inode_pages.patch
truncate/invalidate_inode_pages rewrite

proc_vmstat.patch
Move the vm accounting out of /proc/stat

kswapd-reclaim-stats.patch
Add kswapd_steal to /proc/vmstat

iowait.patch
I/O wait statistics

tcp-wakeups.patch
Use fast wakeups in TCP/IPV4

swapoff-deadlock.patch
Fix a tmpfs swapoff deadlock

dirty-and-uptodate.patch
page state cleanup

shmem_rename.patch
shmem_rename() directory link count fix

dirent-size.patch
tmpfs: show a non-zero size for directories

tmpfs-trivia.patch
tmpfs: small fixlets


2002-09-18 09:37:33

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.5.35-mm1

Hi!

> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/
>
> Significant rework of the new sleep/wakeup code - make it look totally
> different from the current APIs to avoid confusion, and to make it
> simpler to use.

Did you add any hooks to allow me to free memory for swsusp?
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2002-09-18 21:26:17

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.35-mm1

Pavel Machek wrote:
>
> Hi!
>
> > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/
> >
> > Significant rework of the new sleep/wakeup code - make it look totally
> > different from the current APIs to avoid confusion, and to make it
> > simpler to use.
>
> Did you add any hooks to allow me to free memory for swsusp?

I just did then. You'll need to call

freed = shrink_all_memory(99);

to free up 99 pages. It returns the number which it actually
freed. If that's not 99 then it's time to give up. There is
no oom-killer in this code path.

I haven't tested it yet. And it's quite a long way back in the
queue I'm afraid - it has a dependency chain, and I prefer to
send stuff to Linus which has been tested for a couple of weeks, and
hasn't changed for one week.

Can you use the allocate-lots-then-free-it trick in the meanwhile?



include/linux/swap.h | 1 +
mm/vmscan.c | 46 ++++++++++++++++++++++++++++++++++++++++------
2 files changed, 41 insertions, 6 deletions

--- 2.5.36/mm/vmscan.c~swsusp-feature Wed Sep 18 13:55:20 2002
+++ 2.5.36-akpm/mm/vmscan.c Wed Sep 18 14:29:13 2002
@@ -694,12 +694,19 @@ try_to_free_pages(struct zone *classzone
}

/*
- * kswapd will work across all this node's zones until they are all at
- * pages_high.
+ * For kswapd, balance_pgdat() will work across all this node's zones until
+ * they are all at pages_high.
+ *
+ * If `nr_pages' is non-zero then it is the number of pages which are to be
+ * reclaimed, regardless of the zone occupancies. This is a software suspend
+ * special.
+ *
+ * Returns the number of pages which were actually freed.
*/
-static void kswapd_balance_pgdat(pg_data_t *pgdat)
+static int balance_pgdat(pg_data_t *pgdat, int nr_pages)
{
- int priority = DEF_PRIORITY;
+ int to_free = nr_pages;
+ int priority;
int i;

for (priority = DEF_PRIORITY; priority; priority--) {
@@ -712,13 +719,15 @@ static void kswapd_balance_pgdat(pg_data
int to_reclaim;

to_reclaim = zone->pages_high - zone->free_pages;
+ if (nr_pages && to_free > 0)
+ to_reclaim = min(to_free, SWAP_CLUSTER_MAX*8);
if (to_reclaim <= 0)
continue;
success = 0;
max_scan = zone->nr_inactive >> priority;
if (max_scan < to_reclaim * 2)
max_scan = to_reclaim * 2;
- shrink_zone(zone, max_scan, GFP_KSWAPD,
+ to_free -= shrink_zone(zone, max_scan, GFP_KSWAPD,
to_reclaim, &nr_mapped);
shrink_slab(max_scan + nr_mapped, GFP_KSWAPD);
}
@@ -726,6 +735,7 @@ static void kswapd_balance_pgdat(pg_data
break; /* All zones are at pages_high */
blk_congestion_wait(WRITE, HZ/4);
}
+ return nr_pages - to_free;
}

/*
@@ -772,10 +782,34 @@ int kswapd(void *p)
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
schedule();
finish_wait(&pgdat->kswapd_wait, &wait);
- kswapd_balance_pgdat(pgdat);
+ balance_pgdat(pgdat, 0);
blk_run_queues();
}
}
+
+#ifdef CONFIG_SOFTWARE_SUSPEND
+/*
+ * Try to free `nr_pages' of memory, system-wide. Returns the number of freed
+ * pages.
+ */
+int shrink_all_memory(int nr_pages)
+{
+ pg_data_t *pgdat;
+ int nr_to_free = nr_pages;
+ int ret = 0;
+
+ for_each_pgdat(pgdat) {
+ int freed;
+
+ freed = balance_pgdat(pgdat, nr_to_free);
+ ret += freed;
+ nr_to_free -= freed;
+ if (nr_to_free <= 0)
+ break;
+ }
+ return ret;
+}
+#endif

static int __init kswapd_init(void)
{
--- 2.5.36/include/linux/swap.h~swsusp-feature Wed Sep 18 14:03:01 2002
+++ 2.5.36-akpm/include/linux/swap.h Wed Sep 18 14:16:29 2002
@@ -163,6 +163,7 @@ extern void swap_setup(void);

/* linux/mm/vmscan.c */
extern int try_to_free_pages(struct zone *, unsigned int, unsigned int);
+int shrink_all_memory(int nr_pages);

/* linux/mm/page_io.c */
int swap_readpage(struct file *file, struct page *page);

.

2002-09-18 21:49:22

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.5.35-mm1

Hi!

> > > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/
> > >
> > > Significant rework of the new sleep/wakeup code - make it look totally
> > > different from the current APIs to avoid confusion, and to make it
> > > simpler to use.
> >
> > Did you add any hooks to allow me to free memory for swsusp?
>
> I just did then. You'll need to call
>
> freed = shrink_all_memory(99);

Thanx a lot.

> to free up 99 pages. It returns the number which it actually
> freed. If that's not 99 then it's time to give up. There is
> no oom-killer in this code path.

So... I'll do something like shrink_all_memory(1000000) and it will
free as much as possible, right?

> I haven't tested it yet. And it's quite a long way back in the
> queue I'm afraid - it has a dependency chain, and I prefer to

So if I apply this to my tree it will not work (that's what
"dependency chain means", right?). Okay, thanx anyway.

> send stuff to Linus which has been tested for a couple of weeks, and
> hasn't changed for one week.
>
> Can you use the allocate-lots-then-free-it trick in the meanwhile?

In the meanwhile, swsusp only working when there's lot of ram is
probably okay. As IDE patch is not in, swsusp is dangerous, anyway.

Pavel

> --- 2.5.36/mm/vmscan.c~swsusp-feature Wed Sep 18 13:55:20 2002
> +++ 2.5.36-akpm/mm/vmscan.c Wed Sep 18 14:29:13 2002
> @@ -694,12 +694,19 @@ try_to_free_pages(struct zone *classzone
> }
>
> /*
> - * kswapd will work across all this node's zones until they are all at
> - * pages_high.
> + * For kswapd, balance_pgdat() will work across all this node's zones until
> + * they are all at pages_high.
> + *
> + * If `nr_pages' is non-zero then it is the number of pages which are to be
> + * reclaimed, regardless of the zone occupancies. This is a software suspend
> + * special.
> + *
> + * Returns the number of pages which were actually freed.
> */
> -static void kswapd_balance_pgdat(pg_data_t *pgdat)
> +static int balance_pgdat(pg_data_t *pgdat, int nr_pages)
> {
> - int priority = DEF_PRIORITY;
> + int to_free = nr_pages;
> + int priority;
> int i;
>
> for (priority = DEF_PRIORITY; priority; priority--) {
> @@ -712,13 +719,15 @@ static void kswapd_balance_pgdat(pg_data
> int to_reclaim;
>
> to_reclaim = zone->pages_high - zone->free_pages;
> + if (nr_pages && to_free > 0)
> + to_reclaim = min(to_free, SWAP_CLUSTER_MAX*8);
> if (to_reclaim <= 0)
> continue;
> success = 0;
> max_scan = zone->nr_inactive >> priority;
> if (max_scan < to_reclaim * 2)
> max_scan = to_reclaim * 2;
> - shrink_zone(zone, max_scan, GFP_KSWAPD,
> + to_free -= shrink_zone(zone, max_scan, GFP_KSWAPD,
> to_reclaim, &nr_mapped);
> shrink_slab(max_scan + nr_mapped, GFP_KSWAPD);
> }
> @@ -726,6 +735,7 @@ static void kswapd_balance_pgdat(pg_data
> break; /* All zones are at pages_high */
> blk_congestion_wait(WRITE, HZ/4);
> }
> + return nr_pages - to_free;
> }
>
> /*
> @@ -772,10 +782,34 @@ int kswapd(void *p)
> prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> schedule();
> finish_wait(&pgdat->kswapd_wait, &wait);
> - kswapd_balance_pgdat(pgdat);
> + balance_pgdat(pgdat, 0);
> blk_run_queues();
> }
> }
> +
> +#ifdef CONFIG_SOFTWARE_SUSPEND
> +/*
> + * Try to free `nr_pages' of memory, system-wide. Returns the number of freed
> + * pages.
> + */
> +int shrink_all_memory(int nr_pages)
> +{
> + pg_data_t *pgdat;
> + int nr_to_free = nr_pages;
> + int ret = 0;
> +
> + for_each_pgdat(pgdat) {
> + int freed;
> +
> + freed = balance_pgdat(pgdat, nr_to_free);
> + ret += freed;
> + nr_to_free -= freed;
> + if (nr_to_free <= 0)
> + break;
> + }
> + return ret;
> +}
> +#endif
>
> static int __init kswapd_init(void)
> {
> --- 2.5.36/include/linux/swap.h~swsusp-feature Wed Sep 18 14:03:01 2002
> +++ 2.5.36-akpm/include/linux/swap.h Wed Sep 18 14:16:29 2002
> @@ -163,6 +163,7 @@ extern void swap_setup(void);
>
> /* linux/mm/vmscan.c */
> extern int try_to_free_pages(struct zone *, unsigned int, unsigned int);
> +int shrink_all_memory(int nr_pages);
>
> /* linux/mm/page_io.c */
> int swap_readpage(struct file *file, struct page *page);
>
> .

--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

2002-09-19 07:45:50

by Daniel Phillips

[permalink] [raw]
Subject: Re: 2.5.35-mm1

On Monday 16 September 2002 09:15, Andrew Morton wrote:
> A 4x performance regression in heavy dbench testing has been fixed. The
> VM was accidentally being fair to the dbench instances in page reclaim.
> It's better to be unfair so just a few instances can get ahead and submit
> more contiguous IO. It's a silly thing, but it's what I meant to do anyway.

Curious... did the performance hit show anywhere other than dbench?

--
Daniel

2002-09-19 08:14:51

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.35-mm1

Daniel Phillips wrote:
>
> On Monday 16 September 2002 09:15, Andrew Morton wrote:
> > A 4x performance regression in heavy dbench testing has been fixed. The
> > VM was accidentally being fair to the dbench instances in page reclaim.
> > It's better to be unfair so just a few instances can get ahead and submit
> > more contiguous IO. It's a silly thing, but it's what I meant to do anyway.
>
> Curious... did the performance hit show anywhere other than dbench?

Other benchmarky tests would have suffered, but I did not check.

I have logic in there which is designed to throttle heavy writers
within the page allocator, as well as within balance_dirty_pages.
basically:

generic_file_write()
{
current->backing_dev_info = mapping->backing_dev_info;
alloc_page()
current->backing_dev_info = 0;
}

shrink_list()
{
if (PageDirty(page)) {
if (page->mapping->backing_dev_info == current->backing_dev_info)
blocking_write(page->mapping);
else
nonblocking_write(page->mapping);
}
}


What this says is "if this task is prepared to block against this
page's queue, then write the dirty data, even if that would block".

This means that all the dbench instances will write each other's
dirty data as it comes off the tail of the LRU. Which provides
some additional throttling, and means that we don't just refile
the page.

But the logic was not correctly implemented. The dbench instances
were performing non-blocking writes. This meant that all 64 instances
were cheerfully running all the time, submitting IO all over the disk.
The /proc/meminfo:Writeback figure never even hit a megabyte. That
number tells us how much memory is currently in the request queue.
Clearly, it was very fragmented.

By forcing the dbench instance to block on the queue, particular instances
were able to submit decent amounts of IO. The `Writeback' figure went
back to around 4 megabytes, because the individual requests were
larger - more merging.