url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/
Significant rework of the new sleep/wakeup code - make it look totally
different from the current APIs to avoid confusion, and to make it
simpler to use.
Also increase the number of places where this API is used in networking;
Alexey says that some of these may be negative improvements, but
performance testing will nevertheless be interesting. The relevant
patches are:
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/prepare_to_wait.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/tcp-wakeups.patch
A 4x performance regression in heavy dbench testing has been fixed. The
VM was accidentally being fair to the dbench instances in page reclaim.
It's better to be unfair so just a few instances can get ahead and submit
more contiguous IO. It's a silly thing, but it's what I meant to do anyway.
Since 2.5.34-mm4:
-readv-writev.patch
-aio-sync-iocb.patch
-llzpr.patch
-buffermem.patch
-lpp.patch
-lpp-update.patch
-reversemaps-leak.patch
-sharedmem.patch
-ext3-sb.patch
-pagevec_lru_add.patch
-oom-fix.patch
-tlb-cleanup.patch
-dump-stack.patch
-wli-cleanup.patch
Merged
+release_pages-speedup.patch
Avoid a couple of lock-takings.
-wake-speedup.patch
+prepare_to_wait.patch
Renamed, reworked
+swapoff-deadlock.patch
+dirty-and-uptodate.patch
+shmem_rename.patch
+dirent-size.patch
+tmpfs-trivia.patch
Various fixes and cleanups from Hugh Dickins
linus.patch
cset-1.552-to-1.564.txt.gz
scsi_hack.patch
Fix block-highmem for scsi
ext3-htree.patch
Indexed directories for ext3
spin-lock-check.patch
spinlock/rwlock checking infrastructure
rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)
madvise-move.patch
move mdavise implementation into mm/madvise.c
split-vma.patch
VMA splitting patch
mmap-fixes.patch
mmap.c cleanup and lock ranking fixes
buffer-ops-move.patch
Move submit_bh() and ll_rw_block() into fs/buffer.c
slab-stats.patch
Display total slab memory in /proc/meminfo
writeback-control.patch
Cleanup and extension of the writeback paths
free_area_init-cleanup.patch
free_area_init() code cleanup
alloc_pages-cleanup.patch
alloc_pages cleanup and optimisation
statm_pgd_range-sucks.patch
Remove the pagetable walk from /proc/stat
remove-sync_thresh.patch
Remove /proc/sys/vm/dirty_sync_thresh
taka-writev.patch
Speed up writev
pf_nowarn.patch
Fix up the handling of PF_NOWARN
jeremy.patch
Spel Jermy's naim wright
release_pages-speedup.patch
Reduced locking in release_pages()
queue-congestion.patch
Infrastructure for communicating request queue congestion to the VM
nonblocking-ext2-preread.patch
avoid ext2 inode prereads if the queue is congested
nonblocking-pdflush.patch
non-blocking writeback infrastructure, use it for pdflush
nonblocking-vm.patch
Non-blocking page reclaim
prepare_to_wait.patch
New sleep/wakeup API
vm-wakeups.patch
Use the faster wakeups in the VM and block layers
sync-helper.patch
Speed up sys_sync() against multiple spindles
slabasap.patch
Early and smarter shrinking of slabs
write-deadlock.patch
Fix the generic_file_write-from-same-mmapped-page deadlock
buddyinfo.patch
Add /proc/buddyinfo - stats on the free pages pool
free_area.patch
Remove struct free_area_struct and free_area_t, use `struct free_area'
per-node-kswapd.patch
Per-node kswapd instance
topology-api.patch
NUMA topology API
radix_tree_gang_lookup.patch
radix tree gang lookup
truncate_inode_pages.patch
truncate/invalidate_inode_pages rewrite
proc_vmstat.patch
Move the vm accounting out of /proc/stat
kswapd-reclaim-stats.patch
Add kswapd_steal to /proc/vmstat
iowait.patch
I/O wait statistics
tcp-wakeups.patch
Use fast wakeups in TCP/IPV4
swapoff-deadlock.patch
Fix a tmpfs swapoff deadlock
dirty-and-uptodate.patch
page state cleanup
shmem_rename.patch
shmem_rename() directory link count fix
dirent-size.patch
tmpfs: show a non-zero size for directories
tmpfs-trivia.patch
tmpfs: small fixlets
Hi!
> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/
>
> Significant rework of the new sleep/wakeup code - make it look totally
> different from the current APIs to avoid confusion, and to make it
> simpler to use.
Did you add any hooks to allow me to free memory for swsusp?
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.
Pavel Machek wrote:
>
> Hi!
>
> > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/
> >
> > Significant rework of the new sleep/wakeup code - make it look totally
> > different from the current APIs to avoid confusion, and to make it
> > simpler to use.
>
> Did you add any hooks to allow me to free memory for swsusp?
I just did then. You'll need to call
freed = shrink_all_memory(99);
to free up 99 pages. It returns the number which it actually
freed. If that's not 99 then it's time to give up. There is
no oom-killer in this code path.
I haven't tested it yet. And it's quite a long way back in the
queue I'm afraid - it has a dependency chain, and I prefer to
send stuff to Linus which has been tested for a couple of weeks, and
hasn't changed for one week.
Can you use the allocate-lots-then-free-it trick in the meanwhile?
include/linux/swap.h | 1 +
mm/vmscan.c | 46 ++++++++++++++++++++++++++++++++++++++++------
2 files changed, 41 insertions, 6 deletions
--- 2.5.36/mm/vmscan.c~swsusp-feature Wed Sep 18 13:55:20 2002
+++ 2.5.36-akpm/mm/vmscan.c Wed Sep 18 14:29:13 2002
@@ -694,12 +694,19 @@ try_to_free_pages(struct zone *classzone
}
/*
- * kswapd will work across all this node's zones until they are all at
- * pages_high.
+ * For kswapd, balance_pgdat() will work across all this node's zones until
+ * they are all at pages_high.
+ *
+ * If `nr_pages' is non-zero then it is the number of pages which are to be
+ * reclaimed, regardless of the zone occupancies. This is a software suspend
+ * special.
+ *
+ * Returns the number of pages which were actually freed.
*/
-static void kswapd_balance_pgdat(pg_data_t *pgdat)
+static int balance_pgdat(pg_data_t *pgdat, int nr_pages)
{
- int priority = DEF_PRIORITY;
+ int to_free = nr_pages;
+ int priority;
int i;
for (priority = DEF_PRIORITY; priority; priority--) {
@@ -712,13 +719,15 @@ static void kswapd_balance_pgdat(pg_data
int to_reclaim;
to_reclaim = zone->pages_high - zone->free_pages;
+ if (nr_pages && to_free > 0)
+ to_reclaim = min(to_free, SWAP_CLUSTER_MAX*8);
if (to_reclaim <= 0)
continue;
success = 0;
max_scan = zone->nr_inactive >> priority;
if (max_scan < to_reclaim * 2)
max_scan = to_reclaim * 2;
- shrink_zone(zone, max_scan, GFP_KSWAPD,
+ to_free -= shrink_zone(zone, max_scan, GFP_KSWAPD,
to_reclaim, &nr_mapped);
shrink_slab(max_scan + nr_mapped, GFP_KSWAPD);
}
@@ -726,6 +735,7 @@ static void kswapd_balance_pgdat(pg_data
break; /* All zones are at pages_high */
blk_congestion_wait(WRITE, HZ/4);
}
+ return nr_pages - to_free;
}
/*
@@ -772,10 +782,34 @@ int kswapd(void *p)
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
schedule();
finish_wait(&pgdat->kswapd_wait, &wait);
- kswapd_balance_pgdat(pgdat);
+ balance_pgdat(pgdat, 0);
blk_run_queues();
}
}
+
+#ifdef CONFIG_SOFTWARE_SUSPEND
+/*
+ * Try to free `nr_pages' of memory, system-wide. Returns the number of freed
+ * pages.
+ */
+int shrink_all_memory(int nr_pages)
+{
+ pg_data_t *pgdat;
+ int nr_to_free = nr_pages;
+ int ret = 0;
+
+ for_each_pgdat(pgdat) {
+ int freed;
+
+ freed = balance_pgdat(pgdat, nr_to_free);
+ ret += freed;
+ nr_to_free -= freed;
+ if (nr_to_free <= 0)
+ break;
+ }
+ return ret;
+}
+#endif
static int __init kswapd_init(void)
{
--- 2.5.36/include/linux/swap.h~swsusp-feature Wed Sep 18 14:03:01 2002
+++ 2.5.36-akpm/include/linux/swap.h Wed Sep 18 14:16:29 2002
@@ -163,6 +163,7 @@ extern void swap_setup(void);
/* linux/mm/vmscan.c */
extern int try_to_free_pages(struct zone *, unsigned int, unsigned int);
+int shrink_all_memory(int nr_pages);
/* linux/mm/page_io.c */
int swap_readpage(struct file *file, struct page *page);
.
Hi!
> > > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/
> > >
> > > Significant rework of the new sleep/wakeup code - make it look totally
> > > different from the current APIs to avoid confusion, and to make it
> > > simpler to use.
> >
> > Did you add any hooks to allow me to free memory for swsusp?
>
> I just did then. You'll need to call
>
> freed = shrink_all_memory(99);
Thanx a lot.
> to free up 99 pages. It returns the number which it actually
> freed. If that's not 99 then it's time to give up. There is
> no oom-killer in this code path.
So... I'll do something like shrink_all_memory(1000000) and it will
free as much as possible, right?
> I haven't tested it yet. And it's quite a long way back in the
> queue I'm afraid - it has a dependency chain, and I prefer to
So if I apply this to my tree it will not work (that's what
"dependency chain means", right?). Okay, thanx anyway.
> send stuff to Linus which has been tested for a couple of weeks, and
> hasn't changed for one week.
>
> Can you use the allocate-lots-then-free-it trick in the meanwhile?
In the meanwhile, swsusp only working when there's lot of ram is
probably okay. As IDE patch is not in, swsusp is dangerous, anyway.
Pavel
> --- 2.5.36/mm/vmscan.c~swsusp-feature Wed Sep 18 13:55:20 2002
> +++ 2.5.36-akpm/mm/vmscan.c Wed Sep 18 14:29:13 2002
> @@ -694,12 +694,19 @@ try_to_free_pages(struct zone *classzone
> }
>
> /*
> - * kswapd will work across all this node's zones until they are all at
> - * pages_high.
> + * For kswapd, balance_pgdat() will work across all this node's zones until
> + * they are all at pages_high.
> + *
> + * If `nr_pages' is non-zero then it is the number of pages which are to be
> + * reclaimed, regardless of the zone occupancies. This is a software suspend
> + * special.
> + *
> + * Returns the number of pages which were actually freed.
> */
> -static void kswapd_balance_pgdat(pg_data_t *pgdat)
> +static int balance_pgdat(pg_data_t *pgdat, int nr_pages)
> {
> - int priority = DEF_PRIORITY;
> + int to_free = nr_pages;
> + int priority;
> int i;
>
> for (priority = DEF_PRIORITY; priority; priority--) {
> @@ -712,13 +719,15 @@ static void kswapd_balance_pgdat(pg_data
> int to_reclaim;
>
> to_reclaim = zone->pages_high - zone->free_pages;
> + if (nr_pages && to_free > 0)
> + to_reclaim = min(to_free, SWAP_CLUSTER_MAX*8);
> if (to_reclaim <= 0)
> continue;
> success = 0;
> max_scan = zone->nr_inactive >> priority;
> if (max_scan < to_reclaim * 2)
> max_scan = to_reclaim * 2;
> - shrink_zone(zone, max_scan, GFP_KSWAPD,
> + to_free -= shrink_zone(zone, max_scan, GFP_KSWAPD,
> to_reclaim, &nr_mapped);
> shrink_slab(max_scan + nr_mapped, GFP_KSWAPD);
> }
> @@ -726,6 +735,7 @@ static void kswapd_balance_pgdat(pg_data
> break; /* All zones are at pages_high */
> blk_congestion_wait(WRITE, HZ/4);
> }
> + return nr_pages - to_free;
> }
>
> /*
> @@ -772,10 +782,34 @@ int kswapd(void *p)
> prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> schedule();
> finish_wait(&pgdat->kswapd_wait, &wait);
> - kswapd_balance_pgdat(pgdat);
> + balance_pgdat(pgdat, 0);
> blk_run_queues();
> }
> }
> +
> +#ifdef CONFIG_SOFTWARE_SUSPEND
> +/*
> + * Try to free `nr_pages' of memory, system-wide. Returns the number of freed
> + * pages.
> + */
> +int shrink_all_memory(int nr_pages)
> +{
> + pg_data_t *pgdat;
> + int nr_to_free = nr_pages;
> + int ret = 0;
> +
> + for_each_pgdat(pgdat) {
> + int freed;
> +
> + freed = balance_pgdat(pgdat, nr_to_free);
> + ret += freed;
> + nr_to_free -= freed;
> + if (nr_to_free <= 0)
> + break;
> + }
> + return ret;
> +}
> +#endif
>
> static int __init kswapd_init(void)
> {
> --- 2.5.36/include/linux/swap.h~swsusp-feature Wed Sep 18 14:03:01 2002
> +++ 2.5.36-akpm/include/linux/swap.h Wed Sep 18 14:16:29 2002
> @@ -163,6 +163,7 @@ extern void swap_setup(void);
>
> /* linux/mm/vmscan.c */
> extern int try_to_free_pages(struct zone *, unsigned int, unsigned int);
> +int shrink_all_memory(int nr_pages);
>
> /* linux/mm/page_io.c */
> int swap_readpage(struct file *file, struct page *page);
>
> .
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
On Monday 16 September 2002 09:15, Andrew Morton wrote:
> A 4x performance regression in heavy dbench testing has been fixed. The
> VM was accidentally being fair to the dbench instances in page reclaim.
> It's better to be unfair so just a few instances can get ahead and submit
> more contiguous IO. It's a silly thing, but it's what I meant to do anyway.
Curious... did the performance hit show anywhere other than dbench?
--
Daniel
Daniel Phillips wrote:
>
> On Monday 16 September 2002 09:15, Andrew Morton wrote:
> > A 4x performance regression in heavy dbench testing has been fixed. The
> > VM was accidentally being fair to the dbench instances in page reclaim.
> > It's better to be unfair so just a few instances can get ahead and submit
> > more contiguous IO. It's a silly thing, but it's what I meant to do anyway.
>
> Curious... did the performance hit show anywhere other than dbench?
Other benchmarky tests would have suffered, but I did not check.
I have logic in there which is designed to throttle heavy writers
within the page allocator, as well as within balance_dirty_pages.
basically:
generic_file_write()
{
current->backing_dev_info = mapping->backing_dev_info;
alloc_page()
current->backing_dev_info = 0;
}
shrink_list()
{
if (PageDirty(page)) {
if (page->mapping->backing_dev_info == current->backing_dev_info)
blocking_write(page->mapping);
else
nonblocking_write(page->mapping);
}
}
What this says is "if this task is prepared to block against this
page's queue, then write the dirty data, even if that would block".
This means that all the dbench instances will write each other's
dirty data as it comes off the tail of the LRU. Which provides
some additional throttling, and means that we don't just refile
the page.
But the logic was not correctly implemented. The dbench instances
were performing non-blocking writes. This meant that all 64 instances
were cheerfully running all the time, submitting IO all over the disk.
The /proc/meminfo:Writeback figure never even hit a megabyte. That
number tells us how much memory is currently in the request queue.
Clearly, it was very fragmented.
By forcing the dbench instance to block on the queue, particular instances
were able to submit decent amounts of IO. The `Writeback' figure went
back to around 4 megabytes, because the individual requests were
larger - more merging.