2009-09-17 22:44:31

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 0/4] compcache: in-memory compressed swapping v3

Project home: http://compcache.googlecode.com/

* Changelog: v3 vs v2
- All cleanups as suggested by Pekka.
- Move to staging (drivers/block/ramzswap/ -> drivers/staging/ramzswap/).
- Remove swap discard hooks -- swap notify support makes these redundant.
- Unify duplicate code between init_device() fail path and reset_device().
- Fix zero-page accounting.
- Do not accept backing swap with bad pages.

* Changelog: v2 vs initial revision
- Use 'struct page' instead of 32-bit PFNs in ramzswap driver and xvmalloc.
This is to make these 64-bit safe.
- xvmalloc is no longer a separate module and does not export any symbols.
Its compiled directly with ramzswap block driver. This is to avoid any
last bit of confusion with any other allocator.
- set_swap_free_notify() now accepts block_device as parameter instead of
swp_entry_t (interface cleanup).
- Fix: Make sure ramzswap disksize matches usable pages in backing swap file.
This caused initialization error in case backing swap file had intra-page
fragmentation.

It creates RAM based block devices which can be used (only) as swap disks.
Pages swapped to these disks are compressed and stored in memory itself. This
is a big win over swapping to slow hard-disk which are typically used as swap
disk. For flash, these suffer from wear-leveling issues when used as swap disk
- so again its helpful. For swapless systems, it allows more apps to run for a
given amount of memory.

It can create multiple ramzswap devices (/dev/ramzswapX, X = 0, 1, 2, ...).
Each of these devices can have separate backing swap (file or disk partition)
which is used when incompressible page is found or memory limit for device is
reached.

A separate userspace utility called rzscontrol is used to manage individual
ramzswap devices.

* Testing notes

Tested on x86, x64, ARM
ARM:
- Cortex-A8 (Beagleboard)
- ARM11 (Android G1)
- OMAP2420 (Nokia N810)

* Performance

All performance numbers/plots can be found at:
http://code.google.com/p/compcache/wiki/Performance

Below is a summary of this data:

General:
- Swap R/W times are reduced from milliseconds (in case of hard disks)
down to microseconds.

Positive cases:
- Shows 33% improvement in 'scan' benchmark which allocates given amount
of memory and linearly reads/writes to this region. This benchmark also
exposes bottlenecks in ramzswap code (global mutex) due to which this gain
is so small.
- On Linux thin clients, it gives the effect of nearly doubling the amount of
memory.

Negative cases:
Any workload that has active working set w.r.t. filesystem cache that is
nearly equal to amount of RAM while has minimal anonymous memory requirement,
is expected to suffer maximum loss in performance with ramzswap enabled.

Iozone filesystem benchmark can simulate exactly this kind of workload.
As expected, this test shows performance loss of ~25% with ramzswap.

drivers/staging/Kconfig | 2 +
drivers/staging/Makefile | 1 +
drivers/staging/ramzswap/Kconfig | 21 +
drivers/staging/ramzswap/Makefile | 3 +
drivers/staging/ramzswap/ramzswap.txt | 51 +
drivers/staging/ramzswap/ramzswap_drv.c | 1462 +++++++++++++++++++++++++++++
drivers/staging/ramzswap/ramzswap_drv.h | 173 ++++
drivers/staging/ramzswap/ramzswap_ioctl.h | 50 +
drivers/staging/ramzswap/xvmalloc.c | 533 +++++++++++
drivers/staging/ramzswap/xvmalloc.h | 30 +
drivers/staging/ramzswap/xvmalloc_int.h | 86 ++
include/linux/swap.h | 5 +
mm/swapfile.c | 34 +
13 files changed, 2451 insertions(+), 0 deletions(-)


2009-09-17 22:44:42

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 1/4] xvmalloc memory allocator

* Features:
- Low metadata overhead (just 4 bytes per object)
- O(1) Alloc/Free - except when we have to call system page allocator to
get additional memory.
- Very low fragmentation: In all tests, xvmalloc memory usage is within 12%
of "Ideal".
- Pool based allocator: Each pool can grow and shrink.
- It maps pages only when required. So, it does not hog vmalloc area which
is very small on 32-bit systems.

SLUB allocator could not be used due to fragmentation issues:
http://code.google.com/p/compcache/wiki/AllocatorsComparison
Data here shows kmalloc using ~43% more memory than TLSF and xvMalloc
is showed ~2% more space efficiency than TLSF (due to smaller metadata).
Creating various kmem_caches can reduce space efficiency gap but still
problem of being limited to low memory exists. Also, it depends on
allocating higher order pages to reduce fragmentation - this is not
acceptable for ramzswap as it is used under memory crunch (its a swap
device!).

SLOB allocator could not be used do to reasons mentioned here:
http://lkml.org/lkml/2009/3/18/210

* Implementation:
It uses two-level bitmap search to find free list containing block of
correct size. This idea is taken from TLSF (Two-Level Segregate Fit)
allocator and is well explained in its paper (see [Links] below).

* Limitations:
- Poor scalability: No per-cpu data structures (work in progress).

[Links]
1. Details and Performance data:
http://code.google.com/p/compcache/wiki/xvMalloc
http://code.google.com/p/compcache/wiki/xvMallocPerformance

2. TLSF memory allocator:
home: http://rtportal.upv.es/rtmalloc/
paper: http://rtportal.upv.es/rtmalloc/files/MRBC_2008.pdf

Signed-off-by: Nitin Gupta <[email protected]>

---
drivers/staging/ramzswap/xvmalloc.c | 533 +++++++++++++++++++++++++++++++
drivers/staging/ramzswap/xvmalloc.h | 30 ++
drivers/staging/ramzswap/xvmalloc_int.h | 86 +++++
3 files changed, 649 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/ramzswap/xvmalloc.c b/drivers/staging/ramzswap/xvmalloc.c
new file mode 100644
index 0000000..12ad1a3
--- /dev/null
+++ b/drivers/staging/ramzswap/xvmalloc.c
@@ -0,0 +1,533 @@
+/*
+ * xvmalloc memory allocator
+ *
+ * Copyright (C) 2008, 2009 Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ */
+
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/highmem.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+
+#include "xvmalloc.h"
+#include "xvmalloc_int.h"
+
+static void stat_inc(u64 *value)
+{
+ *value = *value + 1;
+}
+
+static void stat_dec(u64 *value)
+{
+ *value = *value - 1;
+}
+
+static int test_flag(struct block_header *block, enum blockflags flag)
+{
+ return block->prev & BIT(flag);
+}
+
+static void set_flag(struct block_header *block, enum blockflags flag)
+{
+ block->prev |= BIT(flag);
+}
+
+static void clear_flag(struct block_header *block, enum blockflags flag)
+{
+ block->prev &= ~BIT(flag);
+}
+
+/*
+ * Given <page, offset> pair, provide a derefrencable pointer.
+ * This is called from xv_malloc/xv_free path, so it
+ * needs to be fast.
+ */
+static void *get_ptr_atomic(struct page *page, u16 offset, enum km_type type)
+{
+ unsigned char *base;
+
+ base = kmap_atomic(page, type);
+ return base + offset;
+}
+
+static void put_ptr_atomic(void *ptr, enum km_type type)
+{
+ kunmap_atomic(ptr, type);
+}
+
+static u32 get_blockprev(struct block_header *block)
+{
+ return block->prev & PREV_MASK;
+}
+
+static void set_blockprev(struct block_header *block, u16 new_offset)
+{
+ block->prev = new_offset | (block->prev & FLAGS_MASK);
+}
+
+static struct block_header *BLOCK_NEXT(struct block_header *block)
+{
+ return (struct block_header *)
+ ((char *)block + block->size + XV_ALIGN);
+}
+
+/*
+ * Get index of free list containing blocks of maximum size
+ * which is less than or equal to given size.
+ */
+static u32 get_index_for_insert(u32 size)
+{
+ if (unlikely(size > XV_MAX_ALLOC_SIZE))
+ size = XV_MAX_ALLOC_SIZE;
+ size &= ~FL_DELTA_MASK;
+ return (size - XV_MIN_ALLOC_SIZE) >> FL_DELTA_SHIFT;
+}
+
+/*
+ * Get index of free list having blocks of size greater than
+ * or equal to requested size.
+ */
+static u32 get_index(u32 size)
+{
+ if (unlikely(size < XV_MIN_ALLOC_SIZE))
+ size = XV_MIN_ALLOC_SIZE;
+ size = ALIGN(size, FL_DELTA);
+ return (size - XV_MIN_ALLOC_SIZE) >> FL_DELTA_SHIFT;
+}
+
+/*
+ * Allocate a memory page. Called when a pool needs to grow.
+ */
+static struct page *xv_alloc_page(gfp_t flags)
+{
+ struct page *page;
+
+ page = alloc_page(flags);
+ if (unlikely(!page))
+ return 0;
+
+ return page;
+}
+
+/*
+ * Called when all objects in a page are freed.
+ */
+static void xv_free_page(struct page *page)
+{
+ __free_page(page);
+}
+
+/**
+ * find_block - find block of at least given size
+ * @pool: memory pool to search from
+ * @size: size of block required
+ * @page: page containing required block
+ * @offset: offset within the page where block is located.
+ *
+ * Searches two level bitmap to locate block of at least
+ * the given size. If such a block is found, it provides
+ * <page, offset> to identify this block and returns index
+ * in freelist where we found this block.
+ * Otherwise, returns 0 and <page, offset> params are not touched.
+ */
+static u32 find_block(struct xv_pool *pool, u32 size,
+ struct page **page, u32 *offset)
+{
+ ulong flbitmap, slbitmap;
+ u32 flindex, slindex, slbitstart;
+
+ /* There are no free blocks in this pool */
+ if (!pool->flbitmap)
+ return 0;
+
+ /* Get freelist index correspoding to this size */
+ slindex = get_index(size);
+ slbitmap = pool->slbitmap[slindex / BITS_PER_LONG];
+ slbitstart = slindex % BITS_PER_LONG;
+
+ /*
+ * If freelist is not empty at this index, we found the
+ * block - head of this list. This is approximate best-fit match.
+ */
+ if (test_bit(slbitstart, &slbitmap)) {
+ *page = pool->freelist[slindex].page;
+ *offset = pool->freelist[slindex].offset;
+ return slindex;
+ }
+
+ /*
+ * No best-fit found. Search a bit further in bitmap for a free block.
+ * Second level bitmap consists of series of 32-bit chunks. Search
+ * further in the chunk where we expected a best-fit, starting from
+ * index location found above.
+ */
+ slbitstart++;
+ slbitmap >>= slbitstart;
+
+ /* Skip this search if we were already at end of this bitmap chunk */
+ if ((slbitstart != BITS_PER_LONG) && slbitmap) {
+ slindex += __ffs(slbitmap) + 1;
+ *page = pool->freelist[slindex].page;
+ *offset = pool->freelist[slindex].offset;
+ return slindex;
+ }
+
+ /* Now do a full two-level bitmap search to find next nearest fit */
+ flindex = slindex / BITS_PER_LONG;
+
+ flbitmap = (pool->flbitmap) >> (flindex + 1);
+ if (!flbitmap)
+ return 0;
+
+ flindex += __ffs(flbitmap) + 1;
+ slbitmap = pool->slbitmap[flindex];
+ slindex = (flindex * BITS_PER_LONG) + __ffs(slbitmap);
+ *page = pool->freelist[slindex].page;
+ *offset = pool->freelist[slindex].offset;
+
+ return slindex;
+}
+
+/*
+ * Insert block at <page, offset> in freelist of given pool.
+ * freelist used depends on block size.
+ */
+static void insert_block(struct xv_pool *pool, struct page *page, u32 offset,
+ struct block_header *block)
+{
+ u32 flindex, slindex;
+ struct block_header *nextblock;
+
+ slindex = get_index_for_insert(block->size);
+ flindex = slindex / BITS_PER_LONG;
+
+ block->link.prev_page = 0;
+ block->link.prev_offset = 0;
+ block->link.next_page = pool->freelist[slindex].page;
+ block->link.next_offset = pool->freelist[slindex].offset;
+ pool->freelist[slindex].page = page;
+ pool->freelist[slindex].offset = offset;
+
+ if (block->link.next_page) {
+ nextblock = get_ptr_atomic(block->link.next_page,
+ block->link.next_offset, KM_USER1);
+ nextblock->link.prev_page = page;
+ nextblock->link.prev_offset = offset;
+ put_ptr_atomic(nextblock, KM_USER1);
+ }
+
+ __set_bit(slindex % BITS_PER_LONG, &pool->slbitmap[flindex]);
+ __set_bit(flindex, &pool->flbitmap);
+}
+
+/*
+ * Remove block from head of freelist. Index 'slindex' identifies the freelist.
+ */
+static void remove_block_head(struct xv_pool *pool,
+ struct block_header *block, u32 slindex)
+{
+ struct block_header *tmpblock;
+ u32 flindex = slindex / BITS_PER_LONG;
+
+ pool->freelist[slindex].page = block->link.next_page;
+ pool->freelist[slindex].offset = block->link.next_offset;
+ block->link.prev_page = 0;
+ block->link.prev_offset = 0;
+
+ if (!pool->freelist[slindex].page) {
+ __clear_bit(slindex % BITS_PER_LONG, &pool->slbitmap[flindex]);
+ if (!pool->slbitmap[flindex])
+ __clear_bit(flindex, &pool->flbitmap);
+ } else {
+ /*
+ * DEBUG ONLY: We need not reinitialize freelist head previous
+ * pointer to 0 - we never depend on its value. But just for
+ * sanity, lets do it.
+ */
+ tmpblock = get_ptr_atomic(pool->freelist[slindex].page,
+ pool->freelist[slindex].offset, KM_USER1);
+ tmpblock->link.prev_page = 0;
+ tmpblock->link.prev_offset = 0;
+ put_ptr_atomic(tmpblock, KM_USER1);
+ }
+}
+
+/*
+ * Remove block from freelist. Index 'slindex' identifies the freelist.
+ */
+static void remove_block(struct xv_pool *pool, struct page *page, u32 offset,
+ struct block_header *block, u32 slindex)
+{
+ u32 flindex;
+ struct block_header *tmpblock;
+
+ if (pool->freelist[slindex].page == page
+ && pool->freelist[slindex].offset == offset) {
+ remove_block_head(pool, block, slindex);
+ return;
+ }
+
+ flindex = slindex / BITS_PER_LONG;
+
+ if (block->link.prev_page) {
+ tmpblock = get_ptr_atomic(block->link.prev_page,
+ block->link.prev_offset, KM_USER1);
+ tmpblock->link.next_page = block->link.next_page;
+ tmpblock->link.next_offset = block->link.next_offset;
+ put_ptr_atomic(tmpblock, KM_USER1);
+ }
+
+ if (block->link.next_page) {
+ tmpblock = get_ptr_atomic(block->link.next_page,
+ block->link.next_offset, KM_USER1);
+ tmpblock->link.prev_page = block->link.prev_page;
+ tmpblock->link.prev_offset = block->link.prev_offset;
+ put_ptr_atomic(tmpblock, KM_USER1);
+ }
+
+ return;
+}
+
+/*
+ * Allocate a page and add it freelist of given pool.
+ */
+static int grow_pool(struct xv_pool *pool, gfp_t flags)
+{
+ struct page *page;
+ struct block_header *block;
+
+ page = xv_alloc_page(flags);
+ if (unlikely(!page))
+ return -ENOMEM;
+
+ stat_inc(&pool->total_pages);
+
+ spin_lock(&pool->lock);
+ block = get_ptr_atomic(page, 0, KM_USER0);
+
+ block->size = PAGE_SIZE - XV_ALIGN;
+ set_flag(block, BLOCK_FREE);
+ clear_flag(block, PREV_FREE);
+ set_blockprev(block, 0);
+
+ insert_block(pool, page, 0, block);
+
+ put_ptr_atomic(block, KM_USER0);
+ spin_unlock(&pool->lock);
+
+ return 0;
+}
+
+/*
+ * Create a memory pool. Allocates freelist, bitmaps and other
+ * per-pool metadata.
+ */
+struct xv_pool *xv_create_pool(void)
+{
+ u32 ovhd_size;
+ struct xv_pool *pool;
+
+ ovhd_size = roundup(sizeof(*pool), PAGE_SIZE);
+ pool = kzalloc(ovhd_size, GFP_KERNEL);
+ if (!pool)
+ return NULL;
+
+ spin_lock_init(&pool->lock);
+
+ return pool;
+}
+
+void xv_destroy_pool(struct xv_pool *pool)
+{
+ kfree(pool);
+}
+
+/**
+ * xv_malloc - Allocate block of given size from pool.
+ * @pool: pool to allocate from
+ * @size: size of block to allocate
+ * @page: page no. that holds the object
+ * @offset: location of object within page
+ *
+ * On success, <page, offset> identifies block allocated
+ * and 0 is returned. On failure, <page, offset> is set to
+ * 0 and -ENOMEM is returned.
+ *
+ * Allocation requests with size > XV_MAX_ALLOC_SIZE will fail.
+ */
+int xv_malloc(struct xv_pool *pool, u32 size, struct page **page,
+ u32 *offset, gfp_t flags)
+{
+ int error;
+ u32 index, tmpsize, origsize, tmpoffset;
+ struct block_header *block, *tmpblock;
+
+ *page = NULL;
+ *offset = 0;
+ origsize = size;
+
+ if (unlikely(!size || size > XV_MAX_ALLOC_SIZE))
+ return -ENOMEM;
+
+ size = ALIGN(size, XV_ALIGN);
+
+ spin_lock(&pool->lock);
+
+ index = find_block(pool, size, page, offset);
+
+ if (!*page) {
+ spin_unlock(&pool->lock);
+ if (flags & GFP_NOWAIT)
+ return -ENOMEM;
+ error = grow_pool(pool, flags);
+ if (unlikely(error))
+ return -ENOMEM;
+
+ spin_lock(&pool->lock);
+ index = find_block(pool, size, page, offset);
+ }
+
+ if (!*page) {
+ spin_unlock(&pool->lock);
+ return -ENOMEM;
+ }
+
+ block = get_ptr_atomic(*page, *offset, KM_USER0);
+
+ remove_block_head(pool, block, index);
+
+ /* Split the block if required */
+ tmpoffset = *offset + size + XV_ALIGN;
+ tmpsize = block->size - size;
+ tmpblock = (struct block_header *)((char *)block + size + XV_ALIGN);
+ if (tmpsize) {
+ tmpblock->size = tmpsize - XV_ALIGN;
+ set_flag(tmpblock, BLOCK_FREE);
+ clear_flag(tmpblock, PREV_FREE);
+
+ set_blockprev(tmpblock, *offset);
+ if (tmpblock->size >= XV_MIN_ALLOC_SIZE)
+ insert_block(pool, *page, tmpoffset, tmpblock);
+
+ if (tmpoffset + XV_ALIGN + tmpblock->size != PAGE_SIZE) {
+ tmpblock = BLOCK_NEXT(tmpblock);
+ set_blockprev(tmpblock, tmpoffset);
+ }
+ } else {
+ /* This block is exact fit */
+ if (tmpoffset != PAGE_SIZE)
+ clear_flag(tmpblock, PREV_FREE);
+ }
+
+ block->size = origsize;
+ clear_flag(block, BLOCK_FREE);
+
+ put_ptr_atomic(block, KM_USER0);
+ spin_unlock(&pool->lock);
+
+ *offset += XV_ALIGN;
+
+ return 0;
+}
+
+/*
+ * Free block identified with <page, offset>
+ */
+void xv_free(struct xv_pool *pool, struct page *page, u32 offset)
+{
+ void *page_start;
+ struct block_header *block, *tmpblock;
+
+ offset -= XV_ALIGN;
+
+ spin_lock(&pool->lock);
+
+ page_start = get_ptr_atomic(page, 0, KM_USER0);
+ block = (struct block_header *)((char *)page_start + offset);
+
+ /* Catch double free bugs */
+ BUG_ON(test_flag(block, BLOCK_FREE));
+
+ block->size = ALIGN(block->size, XV_ALIGN);
+
+ tmpblock = BLOCK_NEXT(block);
+ if (offset + block->size + XV_ALIGN == PAGE_SIZE)
+ tmpblock = NULL;
+
+ /* Merge next block if its free */
+ if (tmpblock && test_flag(tmpblock, BLOCK_FREE)) {
+ /*
+ * Blocks smaller than XV_MIN_ALLOC_SIZE
+ * are not inserted in any free list.
+ */
+ if (tmpblock->size >= XV_MIN_ALLOC_SIZE) {
+ remove_block(pool, page,
+ offset + block->size + XV_ALIGN, tmpblock,
+ get_index_for_insert(tmpblock->size));
+ }
+ block->size += tmpblock->size + XV_ALIGN;
+ }
+
+ /* Merge previous block if its free */
+ if (test_flag(block, PREV_FREE)) {
+ tmpblock = (struct block_header *)((char *)(page_start) +
+ get_blockprev(block));
+ offset = offset - tmpblock->size - XV_ALIGN;
+
+ if (tmpblock->size >= XV_MIN_ALLOC_SIZE)
+ remove_block(pool, page, offset, tmpblock,
+ get_index_for_insert(tmpblock->size));
+
+ tmpblock->size += block->size + XV_ALIGN;
+ block = tmpblock;
+ }
+
+ /* No used objects in this page. Free it. */
+ if (block->size == PAGE_SIZE - XV_ALIGN) {
+ put_ptr_atomic(page_start, KM_USER0);
+ spin_unlock(&pool->lock);
+
+ xv_free_page(page);
+ stat_dec(&pool->total_pages);
+ return;
+ }
+
+ set_flag(block, BLOCK_FREE);
+ if (block->size >= XV_MIN_ALLOC_SIZE)
+ insert_block(pool, page, offset, block);
+
+ if (offset + block->size + XV_ALIGN != PAGE_SIZE) {
+ tmpblock = BLOCK_NEXT(block);
+ set_flag(tmpblock, PREV_FREE);
+ set_blockprev(tmpblock, offset);
+ }
+
+ put_ptr_atomic(page_start, KM_USER0);
+ spin_unlock(&pool->lock);
+
+ return;
+}
+
+u32 xv_get_object_size(void *obj)
+{
+ struct block_header *blk;
+
+ blk = (struct block_header *)((char *)(obj) - XV_ALIGN);
+ return blk->size;
+}
+
+/*
+ * Returns total memory used by allocator (userdata + metadata)
+ */
+u64 xv_get_total_size_bytes(struct xv_pool *pool)
+{
+ return pool->total_pages << PAGE_SHIFT;
+}
diff --git a/drivers/staging/ramzswap/xvmalloc.h b/drivers/staging/ramzswap/xvmalloc.h
new file mode 100644
index 0000000..010c6fe
--- /dev/null
+++ b/drivers/staging/ramzswap/xvmalloc.h
@@ -0,0 +1,30 @@
+/*
+ * xvmalloc memory allocator
+ *
+ * Copyright (C) 2008, 2009 Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ */
+
+#ifndef _XV_MALLOC_H_
+#define _XV_MALLOC_H_
+
+#include <linux/types.h>
+
+struct xv_pool;
+
+struct xv_pool *xv_create_pool(void);
+void xv_destroy_pool(struct xv_pool *pool);
+
+int xv_malloc(struct xv_pool *pool, u32 size, struct page **page,
+ u32 *offset, gfp_t flags);
+void xv_free(struct xv_pool *pool, struct page *page, u32 offset);
+
+u32 xv_get_object_size(void *obj);
+u64 xv_get_total_size_bytes(struct xv_pool *pool);
+
+#endif
diff --git a/drivers/staging/ramzswap/xvmalloc_int.h b/drivers/staging/ramzswap/xvmalloc_int.h
new file mode 100644
index 0000000..03c1a65
--- /dev/null
+++ b/drivers/staging/ramzswap/xvmalloc_int.h
@@ -0,0 +1,86 @@
+/*
+ * xvmalloc memory allocator
+ *
+ * Copyright (C) 2008, 2009 Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ */
+
+#ifndef _XV_MALLOC_INT_H_
+#define _XV_MALLOC_INT_H_
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+/* User configurable params */
+
+/* Must be power of two */
+#define XV_ALIGN_SHIFT 2
+#define XV_ALIGN (1 << XV_ALIGN_SHIFT)
+#define XV_ALIGN_MASK (XV_ALIGN - 1)
+
+/* This must be greater than sizeof(link_free) */
+#define XV_MIN_ALLOC_SIZE 32
+#define XV_MAX_ALLOC_SIZE (PAGE_SIZE - XV_ALIGN)
+
+/* Free lists are separated by FL_DELTA bytes */
+#define FL_DELTA_SHIFT 3
+#define FL_DELTA (1 << FL_DELTA_SHIFT)
+#define FL_DELTA_MASK (FL_DELTA - 1)
+#define NUM_FREE_LISTS ((XV_MAX_ALLOC_SIZE - XV_MIN_ALLOC_SIZE) \
+ / FL_DELTA + 1)
+
+#define MAX_FLI DIV_ROUND_UP(NUM_FREE_LISTS, BITS_PER_LONG)
+
+/* End of user params */
+
+enum blockflags {
+ BLOCK_FREE,
+ PREV_FREE,
+ __NR_BLOCKFLAGS,
+};
+
+#define FLAGS_MASK XV_ALIGN_MASK
+#define PREV_MASK (~FLAGS_MASK)
+
+struct freelist_entry {
+ struct page *page;
+ u16 offset;
+ u16 pad;
+};
+
+struct link_free {
+ struct page *prev_page;
+ struct page *next_page;
+ u16 prev_offset;
+ u16 next_offset;
+};
+
+struct block_header {
+ union {
+ /* This common header must be ALIGN bytes */
+ u8 common[XV_ALIGN];
+ struct {
+ u16 size;
+ u16 prev;
+ };
+ };
+ struct link_free link;
+};
+
+struct xv_pool {
+ ulong flbitmap;
+ ulong slbitmap[MAX_FLI];
+ spinlock_t lock;
+
+ struct freelist_entry freelist[NUM_FREE_LISTS];
+
+ /* stats */
+ u64 total_pages;
+};
+
+#endif

2009-09-17 22:50:19

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 2/4] send callback when swap slot is freed

Currently, we have "swap discard" mechanism which sends a discard bio request
when we find a free cluster during scan_swap_map(). This callback can come a
long time after swap slots are actually freed.

This delay in callback is a great problem when (compressed) RAM [1] is used
as a swap device. So, this change adds a callback which is called as
soon as a swap slot becomes free. For above mentioned case of swapping
over compressed RAM device, this is very useful since we can immediately
free memory allocated for this swap page.

This callback does not replace swap discard support. It is called with
swap_lock held, so it is meant to trigger action that finishes quickly.
However, swap discard is an I/O request and can be used for taking longer
actions.

It is preferred to use this callback for ramzswap case even if discard
mechanism could be improved such that it can be called as often as required.
This is because, allocation of 'bio'(s) is undesirable since ramzswap always
operates under low memory conditions (its a swap device). Also, batching of
discard bio requests is not optimal since stale data can accumulate very
quickly in ramzswap devices, pushing system further into low memory state.

Links:
[1] http://compcache.googlecode.com/

Signed-off-by: Nitin Gupta <[email protected]>

---
include/linux/swap.h | 5 +++++
mm/swapfile.c | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7c15334..64796fc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -8,6 +8,7 @@
#include <linux/memcontrol.h>
#include <linux/sched.h>
#include <linux/node.h>
+#include <linux/blkdev.h>

#include <asm/atomic.h>
#include <asm/page.h>
@@ -20,6 +21,8 @@ struct bio;
#define SWAP_FLAG_PRIO_MASK 0x7fff
#define SWAP_FLAG_PRIO_SHIFT 0

+typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
+
static inline int current_is_kswapd(void)
{
return current->flags & PF_KSWAPD;
@@ -155,6 +158,7 @@ struct swap_info_struct {
unsigned int max;
unsigned int inuse_pages;
unsigned int old_block_size;
+ swap_free_notify_fn *swap_free_notify_fn;
};

struct swap_list_t {
@@ -295,6 +299,7 @@ extern sector_t swapdev_block(int, pgoff_t);
extern struct swap_info_struct *get_swap_info_struct(unsigned);
extern int reuse_swap_page(struct page *);
extern int try_to_free_swap(struct page *);
+extern void set_swap_free_notify(struct block_device *, swap_free_notify_fn *);
struct backing_dev_info;

/* linux/mm/thrash.c */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 74f1102..b165db0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -554,6 +554,38 @@ out:
return NULL;
}

+/*
+ * Sets callback for event when swap_map[offset] == 0
+ * i.e. page at this swap offset is no longer used.
+ */
+void set_swap_free_notify(struct block_device *bdev,
+ swap_free_notify_fn *notify_fn)
+{
+ unsigned int i;
+ struct swap_info_struct *sis;
+
+ spin_lock(&swap_lock);
+ for (i = 0; i <= nr_swapfiles; i++) {
+ sis = &swap_info[i];
+ if (!(sis->flags & SWP_USED))
+ continue;
+ if (sis->bdev == bdev)
+ break;
+ }
+
+ /* swap device not found */
+ if (i > nr_swapfiles) {
+ spin_unlock(&swap_lock);
+ return;
+ }
+
+ BUG_ON(!sis || sis->swap_free_notify_fn);
+ sis->swap_free_notify_fn = notify_fn;
+ spin_unlock(&swap_lock);
+ return;
+}
+EXPORT_SYMBOL_GPL(set_swap_free_notify);
+
static int swap_entry_free(struct swap_info_struct *p,
swp_entry_t ent, int cache)
{
@@ -585,6 +617,8 @@ static int swap_entry_free(struct swap_info_struct *p,
swap_list.next = p - swap_info;
nr_swap_pages++;
p->inuse_pages--;
+ if (p->swap_free_notify_fn)
+ p->swap_free_notify_fn(p->bdev, offset);
}
if (!swap_count(count))
mem_cgroup_uncharge_swap(ent);

2009-09-17 22:44:57

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 3/4] virtual block device driver (ramzswap)

Creates RAM based block devices (/dev/ramzswapX) which can be
used (only) as swap disks. Pages swapped to these are compressed
and stored in memory itself.

The module is called ramzswap.ko. It depends on:
- xvmalloc memory allocator (compiled with this driver)
- lzo_compress.ko
- lzo_decompress.ko

See ramzswap.txt for usage details.

Signed-off-by: Nitin Gupta <[email protected]>

---
drivers/staging/Kconfig | 2 +
drivers/staging/Makefile | 1 +
drivers/staging/ramzswap/Kconfig | 21 +
drivers/staging/ramzswap/Makefile | 3 +
drivers/staging/ramzswap/ramzswap_drv.c | 1462 +++++++++++++++++++++++++++++
drivers/staging/ramzswap/ramzswap_drv.h | 173 ++++
drivers/staging/ramzswap/ramzswap_ioctl.h | 50 +
7 files changed, 1712 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig
index 10d3fcf..5cc55c4 100644
--- a/drivers/staging/Kconfig
+++ b/drivers/staging/Kconfig
@@ -131,5 +131,7 @@ source "drivers/staging/iio/Kconfig"

source "drivers/staging/cowloop/Kconfig"

+source "drivers/staging/ramzswap/Kconfig"
+
endif # !STAGING_EXCLUDE_BUILD
endif # STAGING
diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index c30093b..dca5027 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -47,3 +47,4 @@ obj-$(CONFIG_RAR_REGISTER) += rar/
obj-$(CONFIG_DX_SEP) += sep/
obj-$(CONFIG_IIO) += iio/
obj-$(CONFIG_COWLOOP) += cowloop/
+obj-$(CONFIG_RAMZSWAP) += ramzswap/
diff --git a/drivers/staging/ramzswap/Kconfig b/drivers/staging/ramzswap/Kconfig
new file mode 100644
index 0000000..24e2569
--- /dev/null
+++ b/drivers/staging/ramzswap/Kconfig
@@ -0,0 +1,21 @@
+config RAMZSWAP
+ tristate "Compressed in-memory swap device (ramzswap)"
+ depends on SWAP
+ select LZO_COMPRESS
+ select LZO_DECOMPRESS
+ default n
+ help
+ Creates virtual block devices which can be used (only) as a swap
+ disks. Pages swapped to these disks are compressed and stored in
+ memory itself.
+
+ See ramzswap.txt for more information.
+ Project home: http://compcache.googlecode.com/
+
+config RAMZSWAP_STATS
+ bool "Enable ramzswap stats"
+ depends on RAMZSWAP
+ default y
+ help
+ Enable statistics collection for ramzswap. This adds only a minimal
+ overhead. In unsure, say Y.
diff --git a/drivers/staging/ramzswap/Makefile b/drivers/staging/ramzswap/Makefile
new file mode 100644
index 0000000..507d7dc
--- /dev/null
+++ b/drivers/staging/ramzswap/Makefile
@@ -0,0 +1,3 @@
+ramzswap-objs := ramzswap_drv.o xvmalloc.o
+
+obj-$(CONFIG_RAMZSWAP) += ramzswap.o
diff --git a/drivers/staging/ramzswap/ramzswap_drv.c b/drivers/staging/ramzswap/ramzswap_drv.c
new file mode 100644
index 0000000..46c387a
--- /dev/null
+++ b/drivers/staging/ramzswap/ramzswap_drv.c
@@ -0,0 +1,1462 @@
+/*
+ * Compressed RAM based swap device
+ *
+ * Copyright (C) 2008, 2009 Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ *
+ * Project home: http://compcache.googlecode.com
+ */
+
+#define KMSG_COMPONENT "ramzswap"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/bitops.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/device.h>
+#include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/lzo.h>
+#include <linux/mutex.h>
+#include <linux/string.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/vmalloc.h>
+#include <linux/version.h>
+
+#include "ramzswap_drv.h"
+
+/* Globals */
+static int ramzswap_major;
+static struct ramzswap *devices;
+
+/*
+ * Pages that compress to larger than this size are
+ * forwarded to backing swap, if present or stored
+ * uncompressed in memory otherwise.
+ */
+static unsigned int max_zpage_size;
+
+/* Module params (documentation at end) */
+static unsigned int num_devices;
+
+static int rzs_test_flag(struct ramzswap *rzs, u32 index,
+ enum rzs_pageflags flag)
+{
+ return rzs->table[index].flags & BIT(flag);
+}
+
+static void rzs_set_flag(struct ramzswap *rzs, u32 index,
+ enum rzs_pageflags flag)
+{
+ rzs->table[index].flags |= BIT(flag);
+}
+
+static void rzs_clear_flag(struct ramzswap *rzs, u32 index,
+ enum rzs_pageflags flag)
+{
+ rzs->table[index].flags &= ~BIT(flag);
+}
+
+static int page_zero_filled(void *ptr)
+{
+ u32 pos;
+ u64 *page;
+
+ page = (u64 *)ptr;
+
+ for (pos = 0; pos != PAGE_SIZE / sizeof(*page); pos++) {
+ if (page[pos])
+ return 0;
+ }
+
+ return 1;
+}
+
+/*
+ * memlimit cannot be greater than backing disk size.
+ */
+static void ramzswap_set_memlimit(struct ramzswap *rzs, size_t totalram_bytes)
+{
+ int memlimit_valid = 1;
+
+ if (!rzs->memlimit) {
+ pr_info("Memory limit not set.\n");
+ memlimit_valid = 0;
+ }
+
+ if (rzs->memlimit > rzs->disksize) {
+ pr_info("Memory limit cannot be greater than "
+ "disksize: limit=%zu, disksize=%zu\n",
+ rzs->memlimit, rzs->disksize);
+ memlimit_valid = 0;
+ }
+
+ if (!memlimit_valid) {
+ size_t mempart, disksize;
+ pr_info("Using default: smaller of (%u%% of RAM) and "
+ "(backing disk size).\n",
+ default_memlimit_perc_ram);
+ mempart = default_memlimit_perc_ram * (totalram_bytes / 100);
+ disksize = rzs->disksize;
+ rzs->memlimit = mempart > disksize ? disksize : mempart;
+ }
+
+ if (rzs->memlimit > totalram_bytes / 2) {
+ pr_info(
+ "Its not advisable setting limit more than half of "
+ "size of memory since we expect a 2:1 compression ratio. "
+ "Limit represents amount of *compressed* data we can keep "
+ "in memory!\n"
+ "\tMemory Size: %zu kB\n"
+ "\tLimit you selected: %zu kB\n"
+ "Continuing anyway ...\n",
+ totalram_bytes >> 10, rzs->memlimit >> 10
+ );
+ }
+
+ rzs->memlimit &= PAGE_MASK;
+ BUG_ON(!rzs->memlimit);
+}
+
+static void ramzswap_set_disksize(struct ramzswap *rzs, size_t totalram_bytes)
+{
+ if (!rzs->disksize) {
+ pr_info(
+ "disk size not provided. You can use disksize_kb module "
+ "param to specify size.\nUsing default: (%u%% of RAM).\n",
+ default_disksize_perc_ram
+ );
+ rzs->disksize = default_disksize_perc_ram *
+ (totalram_bytes / 100);
+ }
+
+ if (rzs->disksize > 2 * (totalram_bytes)) {
+ pr_info(
+ "There is little point creating a ramzswap of greater than "
+ "twice the size of memory since we expect a 2:1 compression "
+ "ratio. Note that ramzswap uses about 0.1%% of the size of "
+ "the swap device when not in use so a huge ramzswap is "
+ "wasteful.\n"
+ "\tMemory Size: %zu kB\n"
+ "\tSize you selected: %zu kB\n"
+ "Continuing anyway ...\n",
+ totalram_bytes >> 10, rzs->disksize
+ );
+ }
+
+ rzs->disksize &= PAGE_MASK;
+}
+
+/*
+ * Swap header (1st page of swap device) contains information
+ * to indentify it as a swap partition. Prepare such a header
+ * for ramzswap device (ramzswap0) so that swapon can identify
+ * it as swap partition. In case backing swap device is provided,
+ * copy its swap header.
+ */
+static int setup_swap_header(struct ramzswap *rzs, union swap_header *s)
+{
+ int ret = 0;
+ struct page *page;
+ struct address_space *mapping;
+ union swap_header *backing_swap_header;
+
+ /*
+ * There is no backing swap device. Create a swap header
+ * that is acceptable by swapon.
+ */
+ if (!rzs->backing_swap) {
+ s->info.version = 1;
+ s->info.last_page = (rzs->disksize >> PAGE_SHIFT) - 1;
+ s->info.nr_badpages = 0;
+ memcpy(s->magic.magic, "SWAPSPACE2", 10);
+ return 0;
+ }
+
+ /*
+ * We have a backing swap device. Copy its swap header
+ * to ramzswap device header. If this header contains
+ * invalid information (backing device not a swap
+ * partition, etc.), swapon will fail for ramzswap
+ * which is correct behavior - we don't want to swap
+ * over filesystem partition!
+ */
+
+ /* Read the backing swap header (code from sys_swapon) */
+ mapping = rzs->swap_file->f_mapping;
+ if (!mapping->a_ops->readpage) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ page = read_mapping_page(mapping, 0, rzs->swap_file);
+ if (IS_ERR(page)) {
+ ret = PTR_ERR(page);
+ goto out;
+ }
+
+ backing_swap_header = kmap(page);
+ memcpy(s, backing_swap_header, sizeof(*s));
+ if (s->info.nr_badpages) {
+ pr_info("Cannot use backing swap with bad pages (%u)\n",
+ s->info.nr_badpages);
+ ret = -EINVAL;
+ }
+ /*
+ * ramzswap disksize equals number of usable pages in backing
+ * swap. Set last_page in swap header to match this disksize
+ * ('last_page' means 0-based index of last usable swap page).
+ */
+ s->info.last_page = (rzs->disksize >> PAGE_SHIFT) - 1;
+ kunmap(page);
+
+out:
+ return ret;
+}
+
+static void ramzswap_flush_dcache_page(struct page *page)
+{
+#ifdef CONFIG_ARM
+ int flag = 0;
+ /*
+ * Ugly hack to get flush_dcache_page() work on ARM.
+ * page_mapping(page) == NULL after clearing this swap cache flag.
+ * Without clearing this flag, flush_dcache_page() will simply set
+ * "PG_dcache_dirty" bit and return.
+ */
+ if (PageSwapCache(page)) {
+ flag = 1;
+ ClearPageSwapCache(page);
+ }
+#endif
+ flush_dcache_page(page);
+#ifdef CONFIG_ARM
+ if (flag)
+ SetPageSwapCache(page);
+#endif
+}
+
+void ramzswap_ioctl_get_stats(struct ramzswap *rzs,
+ struct ramzswap_ioctl_stats *s)
+{
+ strncpy(s->backing_swap_name, rzs->backing_swap_name,
+ MAX_SWAP_NAME_LEN - 1);
+ s->backing_swap_name[MAX_SWAP_NAME_LEN - 1] = '\0';
+
+ s->disksize = rzs->disksize;
+ s->memlimit = rzs->memlimit;
+
+#if defined(CONFIG_RAMZSWAP_STATS)
+ {
+ struct ramzswap_stats *rs = &rzs->stats;
+ size_t succ_writes, mem_used;
+ unsigned int good_compress_perc = 0, no_compress_perc = 0;
+
+ mem_used = xv_get_total_size_bytes(rzs->mem_pool)
+ + (rs->pages_expand << PAGE_SHIFT);
+ succ_writes = rs->num_writes - rs->failed_writes;
+
+ if (succ_writes && rs->pages_stored) {
+ good_compress_perc = rs->good_compress * 100
+ / rs->pages_stored;
+ no_compress_perc = rs->pages_expand * 100
+ / rs->pages_stored;
+ }
+
+ s->num_reads = rs->num_reads;
+ s->num_writes = rs->num_writes;
+ s->failed_reads = rs->failed_reads;
+ s->failed_writes = rs->failed_writes;
+ s->invalid_io = rs->invalid_io;
+ s->notify_free = rs->notify_free;
+ s->pages_zero = rs->pages_zero;
+
+ s->good_compress_pct = good_compress_perc;
+ s->pages_expand_pct = no_compress_perc;
+
+ s->pages_stored = rs->pages_stored;
+ s->pages_used = mem_used >> PAGE_SHIFT;
+ s->orig_data_size = rs->pages_stored << PAGE_SHIFT;
+ s->compr_data_size = rs->compr_size;
+ s->mem_used_total = mem_used;
+
+ s->bdev_num_reads = rs->bdev_num_reads;
+ s->bdev_num_writes = rs->bdev_num_writes;
+ }
+#endif /* CONFIG_RAMZSWAP_STATS */
+}
+
+static int add_backing_swap_extent(struct ramzswap *rzs,
+ pgoff_t phy_pagenum,
+ pgoff_t num_pages)
+{
+ unsigned int idx;
+ struct list_head *head;
+ struct page *curr_page, *new_page;
+ unsigned int extents_per_page = PAGE_SIZE /
+ sizeof(struct ramzswap_backing_extent);
+
+ idx = rzs->num_extents % extents_per_page;
+ if (!idx) {
+ new_page = alloc_page(__GFP_ZERO);
+ if (!new_page)
+ return -ENOMEM;
+
+ if (rzs->num_extents) {
+ curr_page = virt_to_page(rzs->curr_extent);
+ head = &curr_page->lru;
+ } else {
+ head = &rzs->backing_swap_extent_list;
+ }
+
+ list_add(&new_page->lru, head);
+ rzs->curr_extent = page_address(new_page);
+ }
+
+ rzs->curr_extent->phy_pagenum = phy_pagenum;
+ rzs->curr_extent->num_pages = num_pages;
+
+ pr_debug("add_extent: idx=%u, phy_pgnum=%lu, num_pgs=%lu, "
+ "pg_last=%lu, curr_ext=%p\n", idx, phy_pagenum, num_pages,
+ phy_pagenum + num_pages - 1, rzs->curr_extent);
+
+ if (idx != extents_per_page - 1)
+ rzs->curr_extent++;
+
+ return 0;
+}
+
+static int setup_backing_swap_extents(struct ramzswap *rzs,
+ struct inode *inode, unsigned long *num_pages)
+{
+ int ret = 0;
+ unsigned blkbits;
+ unsigned blocks_per_page;
+ pgoff_t contig_pages = 0, total_pages = 0;
+ pgoff_t pagenum = 0, prev_pagenum = 0;
+ sector_t probe_block = 0;
+ sector_t last_block;
+
+ blkbits = inode->i_blkbits;
+ blocks_per_page = PAGE_SIZE >> blkbits;
+
+ last_block = i_size_read(inode) >> blkbits;
+ while (probe_block + blocks_per_page <= last_block) {
+ unsigned block_in_page;
+ sector_t first_block;
+
+ first_block = bmap(inode, probe_block);
+ if (first_block == 0)
+ goto bad_bmap;
+
+ /* It must be PAGE_SIZE aligned on-disk */
+ if (first_block & (blocks_per_page - 1)) {
+ probe_block++;
+ goto probe_next;
+ }
+
+ /* All blocks within this page must be contiguous on disk */
+ for (block_in_page = 1; block_in_page < blocks_per_page;
+ block_in_page++) {
+ sector_t block;
+
+ block = bmap(inode, probe_block + block_in_page);
+ if (block == 0)
+ goto bad_bmap;
+ if (block != first_block + block_in_page) {
+ /* Discontiguity */
+ probe_block++;
+ goto probe_next;
+ }
+ }
+
+ /*
+ * We found a PAGE_SIZE length, PAGE_SIZE aligned
+ * run of blocks.
+ */
+ pagenum = first_block >> (PAGE_SHIFT - blkbits);
+
+ if (total_pages && (pagenum != prev_pagenum + 1)) {
+ ret = add_backing_swap_extent(rzs, prev_pagenum -
+ (contig_pages - 1), contig_pages);
+ if (ret < 0)
+ goto out;
+ rzs->num_extents++;
+ contig_pages = 0;
+ }
+ total_pages++;
+ contig_pages++;
+ prev_pagenum = pagenum;
+ probe_block += blocks_per_page;
+
+probe_next:
+ continue;
+ }
+
+ if (contig_pages) {
+ pr_debug("adding last extent: pagenum=%lu, "
+ "contig_pages=%lu\n", pagenum, contig_pages);
+ ret = add_backing_swap_extent(rzs,
+ prev_pagenum - (contig_pages - 1), contig_pages);
+ if (ret < 0)
+ goto out;
+ rzs->num_extents++;
+ }
+ if (!rzs->num_extents) {
+ pr_err("No swap extents found!\n");
+ ret = -EINVAL;
+ }
+
+ if (!ret) {
+ *num_pages = total_pages;
+ pr_info("Found %lu extents containing %luk\n",
+ rzs->num_extents, *num_pages << (PAGE_SHIFT - 10));
+ }
+ goto out;
+
+bad_bmap:
+ pr_err("Backing swapfile has holes\n");
+ ret = -EINVAL;
+out:
+ while (ret && !list_empty(&rzs->backing_swap_extent_list)) {
+ struct page *page;
+ struct list_head *entry = rzs->backing_swap_extent_list.next;
+ page = list_entry(entry, struct page, lru);
+ list_del(entry);
+ __free_page(page);
+ }
+ return ret;
+}
+
+static void map_backing_swap_extents(struct ramzswap *rzs)
+{
+ struct ramzswap_backing_extent *se;
+ struct page *table_page, *se_page;
+ unsigned long num_pages, num_table_pages, entry;
+ unsigned long se_idx, span;
+ unsigned entries_per_page = PAGE_SIZE / sizeof(*rzs->table);
+ unsigned extents_per_page = PAGE_SIZE / sizeof(*se);
+
+ /* True for block device */
+ if (!rzs->num_extents)
+ return;
+
+ se_page = list_entry(rzs->backing_swap_extent_list.next,
+ struct page, lru);
+ se = page_address(se_page);
+ span = se->num_pages;
+ num_pages = rzs->disksize >> PAGE_SHIFT;
+ num_table_pages = DIV_ROUND_UP(num_pages * sizeof(*rzs->table),
+ PAGE_SIZE);
+
+ entry = 0;
+ se_idx = 0;
+ while (num_table_pages--) {
+ table_page = vmalloc_to_page(&rzs->table[entry]);
+ while (span <= entry) {
+ se_idx++;
+ if (se_idx == rzs->num_extents)
+ BUG();
+
+ if (!(se_idx % extents_per_page)) {
+ se_page = list_entry(se_page->lru.next,
+ struct page, lru);
+ se = page_address(se_page);
+ } else
+ se++;
+
+ span += se->num_pages;
+ }
+ table_page->mapping = (struct address_space *)se;
+ table_page->private = se->num_pages - (span - entry);
+ pr_debug("map_table: entry=%lu, span=%lu, map=%p, priv=%lu\n",
+ entry, span, table_page->mapping, table_page->private);
+ entry += entries_per_page;
+ }
+}
+
+/*
+ * Check if value of backing_swap module param is sane.
+ * Claim this device and set ramzswap size equal to
+ * size of this block device.
+ */
+static int setup_backing_swap(struct ramzswap *rzs)
+{
+ int ret = 0;
+ size_t disksize;
+ unsigned long num_pages = 0;
+ struct inode *inode;
+ struct file *swap_file;
+ struct address_space *mapping;
+ struct block_device *bdev = NULL;
+
+ if (!rzs->backing_swap_name[0]) {
+ pr_debug("backing_swap param not given\n");
+ goto out;
+ }
+
+ pr_info("Using backing swap device: %s\n", rzs->backing_swap_name);
+
+ swap_file = filp_open(rzs->backing_swap_name,
+ O_RDWR | O_LARGEFILE, 0);
+ if (IS_ERR(swap_file)) {
+ pr_err("Error opening backing device: %s\n",
+ rzs->backing_swap_name);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ mapping = swap_file->f_mapping;
+ inode = mapping->host;
+
+ if (S_ISBLK(inode->i_mode)) {
+ bdev = I_BDEV(inode);
+ ret = bd_claim(bdev, setup_backing_swap);
+ if (ret < 0) {
+ bdev = NULL;
+ goto bad_param;
+ }
+ disksize = i_size_read(inode);
+ } else if (S_ISREG(inode->i_mode)) {
+ bdev = inode->i_sb->s_bdev;
+ if (IS_SWAPFILE(inode)) {
+ ret = -EBUSY;
+ goto bad_param;
+ }
+ ret = setup_backing_swap_extents(rzs, inode, &num_pages);
+ if (ret < 0)
+ goto bad_param;
+ disksize = num_pages << PAGE_SHIFT;
+ } else {
+ goto bad_param;
+ }
+
+ rzs->swap_file = swap_file;
+ rzs->backing_swap = bdev;
+ rzs->disksize = disksize;
+ BUG_ON(!rzs->disksize);
+
+ return 0;
+
+bad_param:
+ if (bdev)
+ bd_release(bdev);
+ filp_close(swap_file, NULL);
+
+out:
+ rzs->backing_swap = NULL;
+ return ret;
+}
+
+/*
+ * Map logical page number 'pagenum' to physical page number
+ * on backing swap device. For block device, this is a nop.
+ */
+u32 map_backing_swap_page(struct ramzswap *rzs, u32 pagenum)
+{
+ u32 skip_pages, entries_per_page;
+ size_t delta, se_offset, skipped;
+ struct page *table_page, *se_page;
+ struct ramzswap_backing_extent *se;
+
+ if (!rzs->num_extents)
+ return pagenum;
+
+ entries_per_page = PAGE_SIZE / sizeof(*rzs->table);
+
+ table_page = vmalloc_to_page(&rzs->table[pagenum]);
+ se = (struct ramzswap_backing_extent *)table_page->mapping;
+ se_page = virt_to_page(se);
+
+ skip_pages = pagenum - (pagenum / entries_per_page * entries_per_page);
+ se_offset = table_page->private + skip_pages;
+
+ if (se_offset < se->num_pages)
+ return se->phy_pagenum + se_offset;
+
+ skipped = se->num_pages - table_page->private;
+ do {
+ struct ramzswap_backing_extent *se_base;
+ u32 se_entries_per_page = PAGE_SIZE / sizeof(*se);
+
+ /* Get next swap extent */
+ se_base = (struct ramzswap_backing_extent *)
+ page_address(se_page);
+ if (se - se_base == se_entries_per_page - 1) {
+ se_page = list_entry(se_page->lru.next,
+ struct page, lru);
+ se = page_address(se_page);
+ } else {
+ se++;
+ }
+
+ skipped += se->num_pages;
+ } while (skipped < skip_pages);
+
+ delta = skipped - skip_pages;
+ se_offset = se->num_pages - delta;
+
+ return se->phy_pagenum + se_offset;
+}
+
+static void ramzswap_free_page(struct ramzswap *rzs, size_t index)
+{
+ u32 clen;
+ void *obj;
+
+ struct page *page = rzs->table[index].page;
+ u32 offset = rzs->table[index].offset;
+
+ if (unlikely(!page)) {
+ if (rzs_test_flag(rzs, index, RZS_ZERO)) {
+ rzs_clear_flag(rzs, index, RZS_ZERO);
+ stat_dec(rzs->stats.pages_zero);
+ }
+ return;
+ }
+
+ if (unlikely(rzs_test_flag(rzs, index, RZS_UNCOMPRESSED))) {
+ clen = PAGE_SIZE;
+ __free_page(page);
+ rzs_clear_flag(rzs, index, RZS_UNCOMPRESSED);
+ stat_dec(rzs->stats.pages_expand);
+ goto out;
+ }
+
+ obj = kmap_atomic(page, KM_USER0) + offset;
+ clen = xv_get_object_size(obj) - sizeof(struct zobj_header);
+ kunmap_atomic(obj, KM_USER0);
+
+ xv_free(rzs->mem_pool, page, offset);
+ if (clen <= PAGE_SIZE / 2)
+ stat_dec(rzs->stats.good_compress);
+
+out:
+ rzs->stats.compr_size -= clen;
+ stat_dec(rzs->stats.pages_stored);
+
+ rzs->table[index].page = NULL;
+ rzs->table[index].offset = 0;
+}
+
+/*
+ * callback function called when swap_map[offset] == 0
+ * i.e page at this swap offset is no longer used
+ */
+static void ramzswap_free_notify(struct block_device *bdev,
+ unsigned long index)
+{
+ struct ramzswap *rzs = bdev->bd_disk->private_data;
+
+ ramzswap_free_page(rzs, index);
+ stat_inc(rzs->stats.notify_free);
+}
+
+static int handle_zero_page(struct bio *bio)
+{
+ void *user_mem;
+ struct page *page = bio->bi_io_vec[0].bv_page;
+
+ user_mem = kmap_atomic(page, KM_USER0);
+ memset(user_mem, 0, PAGE_SIZE);
+ kunmap_atomic(user_mem, KM_USER0);
+
+ ramzswap_flush_dcache_page(page);
+
+ set_bit(BIO_UPTODATE, &bio->bi_flags);
+ bio_endio(bio, 0);
+ return 0;
+}
+
+static int handle_uncompressed_page(struct ramzswap *rzs, struct bio *bio)
+{
+ u32 index;
+ struct page *page;
+ unsigned char *user_mem, *cmem;
+
+ page = bio->bi_io_vec[0].bv_page;
+ index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
+ user_mem = kmap_atomic(page, KM_USER0);
+ cmem = kmap_atomic(rzs->table[index].page, KM_USER1) +
+ rzs->table[index].offset;
+
+ memcpy(user_mem, cmem, PAGE_SIZE);
+ kunmap_atomic(user_mem, KM_USER0);
+ kunmap_atomic(cmem, KM_USER1);
+
+ ramzswap_flush_dcache_page(page);
+
+ set_bit(BIO_UPTODATE, &bio->bi_flags);
+ bio_endio(bio, 0);
+ return 0;
+}
+
+
+/*
+ * Called when request page is not present in ramzswap.
+ * Its either in backing swap device (if present) or
+ * this is an attempt to read before any previous write
+ * to this location - this happens due to readahead when
+ * swap device is read from user-space (e.g. during swapon)
+ */
+static int handle_ramzswap_fault(struct ramzswap *rzs, struct bio *bio)
+{
+ /*
+ * Always forward such requests to backing swap
+ * device (if present)
+ */
+ if (rzs->backing_swap) {
+ u32 pagenum;
+ stat_dec(rzs->stats.num_reads);
+ stat_inc(rzs->stats.bdev_num_reads);
+ bio->bi_bdev = rzs->backing_swap;
+
+ /*
+ * In case backing swap is a file, find the right offset within
+ * the file corresponding to logical position 'index'. For block
+ * device, this is a nop.
+ */
+ pagenum = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT;
+ bio->bi_sector = map_backing_swap_page(rzs, pagenum)
+ << SECTORS_PER_PAGE_SHIFT;
+ return 1;
+ }
+
+ /*
+ * Its unlikely event in case backing dev is
+ * not present
+ */
+ pr_debug("Read before write on swap device: "
+ "sector=%lu, size=%u, offset=%u\n",
+ (ulong)(bio->bi_sector), bio->bi_size,
+ bio->bi_io_vec[0].bv_offset);
+
+ /* Do nothing. Just return success */
+ set_bit(BIO_UPTODATE, &bio->bi_flags);
+ bio_endio(bio, 0);
+ return 0;
+}
+
+static int ramzswap_read(struct ramzswap *rzs, struct bio *bio)
+{
+ int ret;
+ u32 index;
+ size_t clen;
+ struct page *page;
+ struct zobj_header *zheader;
+ unsigned char *user_mem, *cmem;
+
+ stat_inc(rzs->stats.num_reads);
+
+ page = bio->bi_io_vec[0].bv_page;
+ index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
+ if (unlikely(!rzs->init_notify_callback) && PageSwapCache(page)) {
+ set_swap_free_notify(bio->bi_bdev, ramzswap_free_notify);
+ rzs->init_notify_callback = 1;
+ }
+
+ if (rzs_test_flag(rzs, index, RZS_ZERO))
+ return handle_zero_page(bio);
+
+ /* Requested page is not present in compressed area */
+ if (!rzs->table[index].page)
+ return handle_ramzswap_fault(rzs, bio);
+
+ /* Page is stored uncompressed since its incompressible */
+ if (unlikely(rzs_test_flag(rzs, index, RZS_UNCOMPRESSED)))
+ return handle_uncompressed_page(rzs, bio);
+
+ user_mem = kmap_atomic(page, KM_USER0);
+ clen = PAGE_SIZE;
+
+ cmem = kmap_atomic(rzs->table[index].page, KM_USER1) +
+ rzs->table[index].offset;
+
+ ret = lzo1x_decompress_safe(
+ cmem + sizeof(*zheader),
+ xv_get_object_size(cmem) - sizeof(*zheader),
+ user_mem, &clen);
+
+ kunmap_atomic(user_mem, KM_USER0);
+ kunmap_atomic(cmem, KM_USER1);
+
+ /* should NEVER happen */
+ if (unlikely(ret != LZO_E_OK)) {
+ pr_err("Decompression failed! err=%d, page=%u\n",
+ ret, index);
+ stat_inc(rzs->stats.failed_reads);
+ goto out;
+ }
+
+ ramzswap_flush_dcache_page(page);
+
+ set_bit(BIO_UPTODATE, &bio->bi_flags);
+ bio_endio(bio, 0);
+ return 0;
+
+out:
+ bio_io_error(bio);
+ return 0;
+}
+
+static int ramzswap_write(struct ramzswap *rzs, struct bio *bio)
+{
+ int ret, fwd_write_request = 0;
+ u32 offset, index;
+ size_t clen;
+ struct zobj_header *zheader;
+ struct page *page, *page_store;
+ unsigned char *user_mem, *cmem, *src;
+
+ stat_inc(rzs->stats.num_writes);
+
+ page = bio->bi_io_vec[0].bv_page;
+ index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
+ src = rzs->compress_buffer;
+
+ /*
+ * System swaps to same sector again when the stored page
+ * is no longer referenced by any process. So, its now safe
+ * to free the memory that was allocated for this page.
+ */
+ if (rzs->table[index].page)
+ ramzswap_free_page(rzs, index);
+
+ /*
+ * No memory ia allocated for zero filled pages.
+ * Simply clear zero page flag.
+ */
+ if (rzs_test_flag(rzs, index, RZS_ZERO)) {
+ stat_dec(rzs->stats.pages_zero);
+ rzs_clear_flag(rzs, index, RZS_ZERO);
+ }
+
+ mutex_lock(&rzs->lock);
+
+ user_mem = kmap_atomic(page, KM_USER0);
+ if (page_zero_filled(user_mem)) {
+ kunmap_atomic(user_mem, KM_USER0);
+ mutex_unlock(&rzs->lock);
+ stat_inc(rzs->stats.pages_zero);
+ rzs_set_flag(rzs, index, RZS_ZERO);
+
+ set_bit(BIO_UPTODATE, &bio->bi_flags);
+ bio_endio(bio, 0);
+ return 0;
+ }
+
+ if (rzs->backing_swap &&
+ (rzs->stats.compr_size > rzs->memlimit - PAGE_SIZE)) {
+ kunmap_atomic(user_mem, KM_USER0);
+ mutex_unlock(&rzs->lock);
+ fwd_write_request = 1;
+ goto out;
+ }
+
+ ret = lzo1x_1_compress(user_mem, PAGE_SIZE, src, &clen,
+ rzs->compress_workmem);
+
+ kunmap_atomic(user_mem, KM_USER0);
+
+ if (unlikely(ret != LZO_E_OK)) {
+ mutex_unlock(&rzs->lock);
+ pr_err("Compression failed! err=%d\n", ret);
+ stat_inc(rzs->stats.failed_writes);
+ goto out;
+ }
+
+ /*
+ * Page is incompressible. Forward it to backing swap
+ * if present. Otherwise, store it as-is (uncompressed)
+ * since we do not want to return too many swap write
+ * errors which has side effect of hanging the system.
+ */
+ if (unlikely(clen > max_zpage_size)) {
+ if (rzs->backing_swap) {
+ mutex_unlock(&rzs->lock);
+ fwd_write_request = 1;
+ goto out;
+ }
+
+ clen = PAGE_SIZE;
+ page_store = alloc_page(GFP_NOIO | __GFP_HIGHMEM);
+ if (unlikely(!page_store)) {
+ mutex_unlock(&rzs->lock);
+ pr_info("Error allocating memory for incompressible "
+ "page: %u\n", index);
+ stat_inc(rzs->stats.failed_writes);
+ goto out;
+ }
+
+ offset = 0;
+ rzs_set_flag(rzs, index, RZS_UNCOMPRESSED);
+ stat_inc(rzs->stats.pages_expand);
+ rzs->table[index].page = page_store;
+ src = kmap_atomic(page, KM_USER0);
+ goto memstore;
+ }
+
+ if (xv_malloc(rzs->mem_pool, clen + sizeof(*zheader),
+ &rzs->table[index].page, &offset,
+ GFP_NOIO | __GFP_HIGHMEM)) {
+ mutex_unlock(&rzs->lock);
+ pr_info("Error allocating memory for compressed "
+ "page: %u, size=%zu\n", index, clen);
+ stat_inc(rzs->stats.failed_writes);
+ if (rzs->backing_swap)
+ fwd_write_request = 1;
+ goto out;
+ }
+
+memstore:
+ rzs->table[index].offset = offset;
+
+ cmem = kmap_atomic(rzs->table[index].page, KM_USER1) +
+ rzs->table[index].offset;
+
+#if 0
+ /* Back-reference needed for memory defragmentation */
+ if (!rzs_test_flag(rzs, index, RZS_UNCOMPRESSED)) {
+ zheader = (struct zobj_header *)cmem;
+ zheader->table_idx = index;
+ cmem += sizeof(*zheader);
+ }
+#endif
+
+ memcpy(cmem, src, clen);
+
+ kunmap_atomic(cmem, KM_USER1);
+ if (unlikely(rzs_test_flag(rzs, index, RZS_UNCOMPRESSED)))
+ kunmap_atomic(src, KM_USER0);
+
+ /* Update stats */
+ rzs->stats.compr_size += clen;
+ stat_inc(rzs->stats.pages_stored);
+ if (clen <= PAGE_SIZE / 2)
+ stat_inc(rzs->stats.good_compress);
+
+ mutex_unlock(&rzs->lock);
+
+ set_bit(BIO_UPTODATE, &bio->bi_flags);
+ bio_endio(bio, 0);
+ return 0;
+
+out:
+ if (fwd_write_request) {
+ stat_inc(rzs->stats.bdev_num_writes);
+ bio->bi_bdev = rzs->backing_swap;
+#if 0
+ /*
+ * TODO: We currently have linear mapping of ramzswap and
+ * backing swap sectors. This is not desired since we want
+ * to optimize writes to backing swap to minimize disk seeks
+ * or have effective wear leveling (for SSDs). Also, a
+ * non-linear mapping is required to implement compressed
+ * on-disk swapping.
+ */
+ bio->bi_sector = get_backing_swap_page()
+ << SECTORS_PER_PAGE_SHIFT;
+#endif
+ /*
+ * In case backing swap is a file, find the right offset within
+ * the file corresponding to logical position 'index'. For block
+ * device, this is a nop.
+ */
+ bio->bi_sector = map_backing_swap_page(rzs, index)
+ << SECTORS_PER_PAGE_SHIFT;
+ return 1;
+ }
+
+ bio_io_error(bio);
+ return 0;
+}
+
+
+/*
+ * Check if request is within bounds and page aligned.
+ */
+static inline int valid_swap_request(struct ramzswap *rzs, struct bio *bio)
+{
+ if (unlikely(
+ (bio->bi_sector >= (rzs->disksize >> SECTOR_SHIFT)) ||
+ (bio->bi_sector & (SECTORS_PER_PAGE - 1)) ||
+ (bio->bi_vcnt != 1) ||
+ (bio->bi_size != PAGE_SIZE) ||
+ (bio->bi_io_vec[0].bv_offset != 0))) {
+
+ return 0;
+ }
+
+ /* swap request is valid */
+ return 1;
+}
+
+/*
+ * Handler function for all ramzswap I/O requests.
+ */
+static int ramzswap_make_request(struct request_queue *queue, struct bio *bio)
+{
+ int ret = 0;
+ struct ramzswap *rzs = queue->queuedata;
+
+ if (unlikely(!rzs->init_done)) {
+ bio_io_error(bio);
+ return 0;
+ }
+
+ if (!valid_swap_request(rzs, bio)) {
+ stat_inc(rzs->stats.invalid_io);
+ bio_io_error(bio);
+ return 0;
+ }
+
+ switch (bio_data_dir(bio)) {
+ case READ:
+ ret = ramzswap_read(rzs, bio);
+ break;
+
+ case WRITE:
+ ret = ramzswap_write(rzs, bio);
+ break;
+ }
+
+ return ret;
+}
+
+static void reset_device(struct ramzswap *rzs)
+{
+ int is_backing_blkdev = 0;
+ size_t index, num_pages;
+ unsigned entries_per_page;
+ unsigned long num_table_pages, entry = 0;
+
+ if (rzs->backing_swap && !rzs->num_extents)
+ is_backing_blkdev = 1;
+
+ num_pages = rzs->disksize >> PAGE_SHIFT;
+
+ /* Free various per-device buffers */
+ kfree(rzs->compress_workmem);
+ free_pages((unsigned long)rzs->compress_buffer, 1);
+
+ rzs->compress_workmem = NULL;
+ rzs->compress_buffer = NULL;
+
+ /* Free all pages that are still in this ramzswap device */
+ for (index = 0; index < num_pages; index++) {
+ struct page *page;
+ u16 offset;
+
+ page = rzs->table[index].page;
+ offset = rzs->table[index].offset;
+
+ if (!page)
+ continue;
+
+ if (unlikely(rzs_test_flag(rzs, index, RZS_UNCOMPRESSED)))
+ __free_page(page);
+ else
+ xv_free(rzs->mem_pool, page, offset);
+ }
+
+ entries_per_page = PAGE_SIZE / sizeof(*rzs->table);
+ num_table_pages = DIV_ROUND_UP(num_pages * sizeof(*rzs->table),
+ PAGE_SIZE);
+ /*
+ * Set page->mapping to NULL for every table page.
+ * Otherwise, we will hit bad_page() during free.
+ */
+ while (rzs->num_extents && num_table_pages--) {
+ struct page *page;
+ page = vmalloc_to_page(&rzs->table[entry]);
+ page->mapping = NULL;
+ entry += entries_per_page;
+ }
+ vfree(rzs->table);
+ rzs->table = NULL;
+
+ xv_destroy_pool(rzs->mem_pool);
+ rzs->mem_pool = NULL;
+
+ /* Free all swap extent pages */
+ while (!list_empty(&rzs->backing_swap_extent_list)) {
+ struct page *page;
+ struct list_head *entry;
+ entry = rzs->backing_swap_extent_list.next;
+ page = list_entry(entry, struct page, lru);
+ list_del(entry);
+ __free_page(page);
+ }
+ INIT_LIST_HEAD(&rzs->backing_swap_extent_list);
+ rzs->num_extents = 0;
+
+ /* Close backing swap device, if present */
+ if (rzs->backing_swap) {
+ if (is_backing_blkdev)
+ bd_release(rzs->backing_swap);
+ filp_close(rzs->swap_file, NULL);
+ rzs->backing_swap = NULL;
+ }
+
+ /* Reset stats */
+ memset(&rzs->stats, 0, sizeof(rzs->stats));
+
+ rzs->disksize = 0;
+ rzs->memlimit = 0;
+
+ /* Back to uninitialized state */
+ rzs->init_done = 0;
+}
+
+static int ramzswap_ioctl_init_device(struct ramzswap *rzs)
+{
+ int ret;
+ size_t num_pages;
+ struct page *page;
+ union swap_header *swap_header;
+
+ if (rzs->init_done) {
+ pr_info("Device already initialized!\n");
+ return -EBUSY;
+ }
+
+ ret = setup_backing_swap(rzs);
+ if (ret)
+ goto fail;
+
+ if (rzs->backing_swap)
+ ramzswap_set_memlimit(rzs, totalram_pages << PAGE_SHIFT);
+ else
+ ramzswap_set_disksize(rzs, totalram_pages << PAGE_SHIFT);
+
+ rzs->compress_workmem = kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
+ if (!rzs->compress_workmem) {
+ pr_err("Error allocating compressor working memory!\n");
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ rzs->compress_buffer = (void *)__get_free_pages(__GFP_ZERO, 1);
+ if (!rzs->compress_buffer) {
+ pr_err("Error allocating compressor buffer space\n");
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ num_pages = rzs->disksize >> PAGE_SHIFT;
+ rzs->table = vmalloc(num_pages * sizeof(*rzs->table));
+ if (!rzs->table) {
+ pr_err("Error allocating ramzswap address table\n");
+ /* To prevent accessing table entries during cleanup */
+ rzs->disksize = 0;
+ ret = -ENOMEM;
+ goto fail;
+ }
+ memset(rzs->table, 0, num_pages * sizeof(*rzs->table));
+
+ map_backing_swap_extents(rzs);
+
+ page = alloc_page(__GFP_ZERO);
+ if (!page) {
+ pr_err("Error allocating swap header page\n");
+ ret = -ENOMEM;
+ goto fail;
+ }
+ rzs->table[0].page = page;
+ rzs_set_flag(rzs, 0, RZS_UNCOMPRESSED);
+
+ swap_header = kmap(page);
+ ret = setup_swap_header(rzs, swap_header);
+ kunmap(page);
+ if (ret) {
+ pr_err("Error setting swap header\n");
+ goto fail;
+ }
+
+ set_capacity(rzs->disk, rzs->disksize >> SECTOR_SHIFT);
+
+ /*
+ * We have ident mapping of sectors for ramzswap and
+ * and the backing swap device. So, this queue flag
+ * should be according to backing dev.
+ */
+ if (!rzs->backing_swap ||
+ blk_queue_nonrot(rzs->backing_swap->bd_disk->queue))
+ queue_flag_set_unlocked(QUEUE_FLAG_NONROT, rzs->disk->queue);
+
+ rzs->mem_pool = xv_create_pool();
+ if (!rzs->mem_pool) {
+ pr_err("Error creating memory pool\n");
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ /*
+ * Pages that compress to size greater than this are forwarded
+ * to physical swap disk (if backing dev is provided)
+ * TODO: make this configurable
+ */
+ if (rzs->backing_swap)
+ max_zpage_size = max_zpage_size_bdev;
+ else
+ max_zpage_size = max_zpage_size_nobdev;
+ pr_debug("Max compressed page size: %u bytes\n", max_zpage_size);
+
+ rzs->init_done = 1;
+
+ pr_debug("Initialization done!\n");
+ return 0;
+
+fail:
+ reset_device(rzs);
+
+ pr_err("Initialization failed: err=%d\n", ret);
+ return ret;
+}
+
+static int ramzswap_ioctl_reset_device(struct ramzswap *rzs)
+{
+ if (rzs->init_done)
+ reset_device(rzs);
+
+ return 0;
+}
+
+static int ramzswap_ioctl(struct block_device *bdev, fmode_t mode,
+ unsigned int cmd, unsigned long arg)
+{
+ int ret = 0;
+ size_t disksize_kb, memlimit_kb;
+
+ struct ramzswap *rzs = bdev->bd_disk->private_data;
+
+ switch (cmd) {
+ case RZSIO_SET_DISKSIZE_KB:
+ if (rzs->init_done) {
+ ret = -EBUSY;
+ goto out;
+ }
+ if (copy_from_user(&disksize_kb, (void *)arg,
+ _IOC_SIZE(cmd))) {
+ ret = -EFAULT;
+ goto out;
+ }
+ rzs->disksize = disksize_kb << 10;
+ pr_info("Disk size set to %zu kB\n", disksize_kb);
+ break;
+
+ case RZSIO_SET_MEMLIMIT_KB:
+ if (rzs->init_done) {
+ /* TODO: allow changing memlimit */
+ ret = -EBUSY;
+ goto out;
+ }
+ if (copy_from_user(&memlimit_kb, (void *)arg,
+ _IOC_SIZE(cmd))) {
+ ret = -EFAULT;
+ goto out;
+ }
+ rzs->memlimit = memlimit_kb << 10;
+ pr_info("Memory limit set to %zu kB\n", memlimit_kb);
+ break;
+
+ case RZSIO_SET_BACKING_SWAP:
+ if (rzs->init_done) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (copy_from_user(&rzs->backing_swap_name, (void *)arg,
+ _IOC_SIZE(cmd))) {
+ ret = -EFAULT;
+ goto out;
+ }
+ rzs->backing_swap_name[MAX_SWAP_NAME_LEN - 1] = '\0';
+ pr_info("Backing swap set to %s\n", rzs->backing_swap_name);
+ break;
+
+ case RZSIO_GET_STATS:
+ {
+ struct ramzswap_ioctl_stats *stats;
+ if (!rzs->init_done) {
+ ret = -ENOTTY;
+ goto out;
+ }
+ stats = kzalloc(sizeof(*stats), GFP_KERNEL);
+ if (!stats) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ ramzswap_ioctl_get_stats(rzs, stats);
+ if (copy_to_user((void *)arg, stats, sizeof(*stats))) {
+ kfree(stats);
+ ret = -EFAULT;
+ goto out;
+ }
+ kfree(stats);
+ break;
+ }
+ case RZSIO_INIT:
+ ret = ramzswap_ioctl_init_device(rzs);
+ break;
+
+ case RZSIO_RESET:
+ /* Do not reset an active device! */
+ if (bdev->bd_holders) {
+ ret = -EBUSY;
+ goto out;
+ }
+ ret = ramzswap_ioctl_reset_device(rzs);
+ /*
+ * Racy! Device has already been swapoff'ed. Bad things
+ * can happen if another swapon is done before this reset.
+ * TODO: A callback from swapoff() will solve this problem.
+ */
+ set_swap_free_notify(bdev, NULL);
+ rzs->init_notify_callback = 0;
+ break;
+
+ default:
+ pr_info("Invalid ioctl %u\n", cmd);
+ ret = -ENOTTY;
+ }
+
+out:
+ return ret;
+}
+
+static struct block_device_operations ramzswap_devops = {
+ .ioctl = ramzswap_ioctl,
+ .owner = THIS_MODULE,
+};
+
+static void create_device(struct ramzswap *rzs, int device_id)
+{
+ mutex_init(&rzs->lock);
+ INIT_LIST_HEAD(&rzs->backing_swap_extent_list);
+
+ rzs->queue = blk_alloc_queue(GFP_KERNEL);
+ if (!rzs->queue) {
+ pr_err("Error allocating disk queue for device %d\n",
+ device_id);
+ return;
+ }
+
+ blk_queue_make_request(rzs->queue, ramzswap_make_request);
+ rzs->queue->queuedata = rzs;
+
+ /* gendisk structure */
+ rzs->disk = alloc_disk(1);
+ if (!rzs->disk) {
+ blk_cleanup_queue(rzs->queue);
+ pr_warning("Error allocating disk structure for device %d\n",
+ device_id);
+ return;
+ }
+
+ rzs->disk->major = ramzswap_major;
+ rzs->disk->first_minor = device_id;
+ rzs->disk->fops = &ramzswap_devops;
+ rzs->disk->queue = rzs->queue;
+ rzs->disk->private_data = rzs;
+ snprintf(rzs->disk->disk_name, 16, "ramzswap%d", device_id);
+
+ /*
+ * Actual capacity set using RZSIO_SET_DISKSIZE_KB ioctl
+ * or set equal to backing swap device (if provided)
+ */
+ set_capacity(rzs->disk, 0);
+ add_disk(rzs->disk);
+
+ rzs->init_done = 0;
+
+ return;
+}
+
+static void destroy_device(struct ramzswap *rzs)
+{
+ if (rzs->disk) {
+ del_gendisk(rzs->disk);
+ put_disk(rzs->disk);
+ }
+
+ if (rzs->queue)
+ blk_cleanup_queue(rzs->queue);
+}
+
+static int __init ramzswap_init(void)
+{
+ int i;
+
+ if (num_devices > max_num_devices) {
+ pr_warning("Invalid value for num_devices: %u\n",
+ num_devices);
+ return -EINVAL;
+ }
+
+ ramzswap_major = register_blkdev(0, "ramzswap");
+ if (ramzswap_major <= 0) {
+ pr_warning("Unable to get major number\n");
+ return -EBUSY;
+ }
+
+ if (!num_devices) {
+ pr_info("num_devices not specified. Using default: 1\n");
+ num_devices = 1;
+ }
+
+ /* Allocate the device array and initialize each one */
+ pr_info("Creating %u devices ...\n", num_devices);
+ devices = kzalloc(num_devices * sizeof(struct ramzswap), GFP_KERNEL);
+ if (!devices)
+ goto out;
+
+ for (i = 0; i < num_devices; i++)
+ create_device(&devices[i], i);
+
+ return 0;
+
+out:
+ unregister_blkdev(ramzswap_major, "ramzswap");
+ return -ENOMEM;
+}
+
+static void __exit ramzswap_exit(void)
+{
+ int i;
+ struct ramzswap *rzs;
+
+ for (i = 0; i < num_devices; i++) {
+ rzs = &devices[i];
+
+ destroy_device(rzs);
+ if (rzs->init_done)
+ reset_device(rzs);
+ }
+
+ unregister_blkdev(ramzswap_major, "ramzswap");
+
+ kfree(devices);
+ pr_debug("Cleanup done!\n");
+}
+
+module_param(num_devices, uint, 0);
+MODULE_PARM_DESC(num_devices, "Number of ramzswap devices");
+
+module_init(ramzswap_init);
+module_exit(ramzswap_exit);
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_AUTHOR("Nitin Gupta <[email protected]>");
+MODULE_DESCRIPTION("Compressed RAM Based Swap Device");
diff --git a/drivers/staging/ramzswap/ramzswap_drv.h b/drivers/staging/ramzswap/ramzswap_drv.h
new file mode 100644
index 0000000..f7f273f
--- /dev/null
+++ b/drivers/staging/ramzswap/ramzswap_drv.h
@@ -0,0 +1,173 @@
+/*
+ * Compressed RAM based swap device
+ *
+ * Copyright (C) 2008, 2009 Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ *
+ * Project home: http://compcache.googlecode.com
+ */
+
+#ifndef _RAMZSWAP_DRV_H_
+#define _RAMZSWAP_DRV_H_
+
+#include "ramzswap_ioctl.h"
+#include "xvmalloc.h"
+
+/*
+ * Some arbitrary value. This is just to catch
+ * invalid value for num_devices module parameter.
+ */
+static const unsigned max_num_devices = 32;
+
+/*
+ * Stored at beginning of each compressed object.
+ *
+ * It stores back-reference to table entry which points to this
+ * object. This is required to support memory defragmentation or
+ * migrating compressed pages to backing swap disk.
+ */
+struct zobj_header {
+#if 0
+ u32 table_idx;
+#endif
+};
+
+/*-- Configurable parameters */
+
+/* Default ramzswap disk size: 25% of total RAM */
+static const unsigned default_disksize_perc_ram = 25;
+static const unsigned default_memlimit_perc_ram = 15;
+
+/*
+ * Max compressed page size when backing device is provided.
+ * Pages that compress to size greater than this are sent to
+ * physical swap disk.
+ */
+static const unsigned max_zpage_size_bdev = PAGE_SIZE / 2;
+
+/*
+ * Max compressed page size when there is no backing dev.
+ * Pages that compress to size greater than this are stored
+ * uncompressed in memory.
+ */
+static const unsigned max_zpage_size_nobdev = PAGE_SIZE / 4 * 3;
+
+/*
+ * NOTE: max_zpage_size_{bdev,nobdev} sizes must be
+ * less than or equal to:
+ * XV_MAX_ALLOC_SIZE - sizeof(struct zobj_header)
+ * since otherwise xv_malloc would always return failure.
+ */
+
+/*-- End of configurable params */
+
+#define SECTOR_SHIFT 9
+#define SECTOR_SIZE (1 << SECTOR_SHIFT)
+#define SECTORS_PER_PAGE_SHIFT (PAGE_SHIFT - SECTOR_SHIFT)
+#define SECTORS_PER_PAGE (1 << SECTORS_PER_PAGE_SHIFT)
+
+/* Debugging and Stats */
+#if defined(CONFIG_RAMZSWAP_STATS)
+#define stat_inc(stat) ((stat)++)
+#define stat_dec(stat) ((stat)--)
+#else
+#define stat_inc(x)
+#define stat_dec(x)
+#endif
+
+/* Flags for ramzswap pages (table[page_no].flags) */
+enum rzs_pageflags {
+ /* Page is stored uncompressed */
+ RZS_UNCOMPRESSED,
+
+ /* Page consists entirely of zeros */
+ RZS_ZERO,
+
+ __NR_RZS_PAGEFLAGS,
+};
+
+/*-- Data structures */
+
+/*
+ * Allocated for each swap slot, indexed by page no.
+ * These table entries must fit exactly in a page.
+ */
+struct table {
+ struct page *page;
+ u16 offset;
+ u8 count; /* object ref count (not yet used) */
+ u8 flags;
+} __attribute__((aligned(4)));;
+
+/*
+ * Swap extent information in case backing swap is a regular
+ * file. These extent entries must fit exactly in a page.
+ */
+struct ramzswap_backing_extent {
+ pgoff_t phy_pagenum;
+ pgoff_t num_pages;
+} __attribute__((aligned(4)));
+
+struct ramzswap_stats {
+ /* basic stats */
+ size_t compr_size; /* compressed size of pages stored -
+ * needed to enforce memlimit */
+ /* more stats */
+#if defined(CONFIG_RAMZSWAP_STATS)
+ u64 num_reads; /* failed + successful */
+ u64 num_writes; /* --do-- */
+ u64 failed_reads; /* can happen when memory is too low */
+ u64 failed_writes; /* should NEVER! happen */
+ u64 invalid_io; /* non-swap I/O requests */
+ u64 notify_free; /* no. of pages freed by swap free callback */
+ u32 pages_zero; /* no. of zero filled pages */
+ u32 pages_stored; /* no. of pages currently stored */
+ u32 good_compress; /* % of pages with compression ratio<=50% */
+ u32 pages_expand; /* % of incompressible pages */
+ u64 bdev_num_reads; /* no. of reads on backing dev */
+ u64 bdev_num_writes; /* no. of writes on backing dev */
+#endif
+};
+
+struct ramzswap {
+ struct xv_pool *mem_pool;
+ void *compress_workmem;
+ void *compress_buffer;
+ struct table *table;
+ struct mutex lock;
+ struct request_queue *queue;
+ struct gendisk *disk;
+ int init_done;
+ int init_notify_callback;
+ /*
+ * This is limit on compressed data size (stats.compr_size)
+ * Its applicable only when backing swap device is present.
+ */
+ size_t memlimit; /* bytes */
+ /*
+ * This is limit on amount of *uncompressed* worth of data
+ * we can hold. When backing swap device is provided, it is
+ * set equal to device size.
+ */
+ size_t disksize; /* bytes */
+
+ struct ramzswap_stats stats;
+
+ /* backing swap device info */
+ struct ramzswap_backing_extent *curr_extent;
+ struct list_head backing_swap_extent_list;
+ unsigned long num_extents;
+ char backing_swap_name[MAX_SWAP_NAME_LEN];
+ struct block_device *backing_swap;
+ struct file *swap_file;
+};
+
+/*-- */
+
+#endif
+
diff --git a/drivers/staging/ramzswap/ramzswap_ioctl.h b/drivers/staging/ramzswap/ramzswap_ioctl.h
new file mode 100644
index 0000000..a7ce039
--- /dev/null
+++ b/drivers/staging/ramzswap/ramzswap_ioctl.h
@@ -0,0 +1,50 @@
+/*
+ * Compressed RAM based swap device
+ *
+ * Copyright (C) 2008, 2009 Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ *
+ * Project home: http://compcache.googlecode.com
+ */
+
+#ifndef _RAMZSWAP_IOCTL_H_
+#define _RAMZSWAP_IOCTL_H_
+
+#define MAX_SWAP_NAME_LEN 128
+
+struct ramzswap_ioctl_stats {
+ char backing_swap_name[MAX_SWAP_NAME_LEN];
+ u64 memlimit; /* only applicable if backing swap present */
+ u64 disksize; /* user specified or equal to backing swap
+ * size (if present) */
+ u64 num_reads; /* failed + successful */
+ u64 num_writes; /* --do-- */
+ u64 failed_reads; /* can happen when memory is too low */
+ u64 failed_writes; /* should NEVER! happen */
+ u64 invalid_io; /* non-swap I/O requests */
+ u64 notify_free; /* no. of pages freed by swap free callback */
+ u32 pages_zero; /* no. of zero filled pages */
+ u32 good_compress_pct; /* no. of pages with compression ratio<=50% */
+ u32 pages_expand_pct; /* no. of incompressible pages */
+ u32 pages_stored;
+ u32 pages_used;
+ u64 orig_data_size;
+ u64 compr_data_size;
+ u64 mem_used_total;
+ u64 bdev_num_reads; /* no. of reads on backing dev */
+ u64 bdev_num_writes; /* no. of writes on backing dev */
+} __attribute__ ((packed, aligned(4)));
+
+#define RZSIO_SET_DISKSIZE_KB _IOW('z', 0, size_t)
+#define RZSIO_SET_MEMLIMIT_KB _IOW('z', 1, size_t)
+#define RZSIO_SET_BACKING_SWAP _IOW('z', 2, unsigned char[MAX_SWAP_NAME_LEN])
+#define RZSIO_GET_STATS _IOR('z', 3, struct ramzswap_ioctl_stats)
+#define RZSIO_INIT _IO('z', 4)
+#define RZSIO_RESET _IO('z', 5)
+
+#endif

2009-09-17 22:50:36

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 4/4] documentation

Short guide on how to setup and use ramzswap.

Signed-off-by: Nitin Gupta <[email protected]>

---
drivers/staging/ramzswap/ramzswap.txt | 51 +++++++++++++++++++++++++++++++++
1 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/ramzswap/ramzswap.txt b/drivers/staging/ramzswap/ramzswap.txt
new file mode 100644
index 0000000..e9f1619
--- /dev/null
+++ b/drivers/staging/ramzswap/ramzswap.txt
@@ -0,0 +1,51 @@
+ramzswap: Compressed RAM based swap device
+-------------------------------------------
+
+Project home: http://compcache.googlecode.com/
+
+* Introduction
+
+It creates RAM based block devices which can be used (only) as swap disks.
+Pages swapped to these devices are compressed and stored in memory itself.
+See project home for use cases, performance numbers and a lot more.
+
+Individual ramzswap devices are configured and initialized using rzscontrol
+userspace utility as shown in examples below. See rzscontrol man page for more
+details.
+
+* Usage
+
+Following shows a typical sequence of steps for using ramzswap.
+
+1) Load Modules:
+ modprobe ramzswap num_devices=4
+ This creates 4 (uninitialized) devices: /dev/ramzswap{0,1,2,3}
+ (num_devices parameter is optional. Default: 1)
+
+2) Initialize:
+ Use rzscontrol utility to configure and initialize individual
+ ramzswap devices. Example:
+ rzscontrol /dev/ramzswap2 --init # uses default value of disksize_kb
+
+ *See rzscontrol man page for more details and examples*
+
+3) Activate:
+ swapon /dev/ramzswap2 # or any other initialized ramzswap device
+
+4) Stats:
+ rzscontrol /dev/ramzswap2 --stats
+
+5) Deactivate:
+ swapoff /dev/ramzswap2
+
+6) Reset:
+ rzscontrol /dev/ramzswap2 --reset
+ (This frees all the memory allocated for this device).
+
+
+Please report any problems at:
+ - Mailing list: linux-mm-cc at laptop dot org
+ - Issue tracker: http://code.google.com/p/compcache/issues/list
+
+Nitin Gupta
[email protected]

2009-09-18 06:53:25

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On Fri, 2009-09-18 at 04:13 +0530, Nitin Gupta wrote:
> Currently, we have "swap discard" mechanism which sends a discard bio request
> when we find a free cluster during scan_swap_map(). This callback can come a
> long time after swap slots are actually freed.
>
> This delay in callback is a great problem when (compressed) RAM [1] is used
> as a swap device. So, this change adds a callback which is called as
> soon as a swap slot becomes free. For above mentioned case of swapping
> over compressed RAM device, this is very useful since we can immediately
> free memory allocated for this swap page.
>
> This callback does not replace swap discard support. It is called with
> swap_lock held, so it is meant to trigger action that finishes quickly.
> However, swap discard is an I/O request and can be used for taking longer
> actions.
>
> It is preferred to use this callback for ramzswap case even if discard
> mechanism could be improved such that it can be called as often as required.
> This is because, allocation of 'bio'(s) is undesirable since ramzswap always
> operates under low memory conditions (its a swap device). Also, batching of
> discard bio requests is not optimal since stale data can accumulate very
> quickly in ramzswap devices, pushing system further into low memory state.
>
> Links:
> [1] http://compcache.googlecode.com/
>
> Signed-off-by: Nitin Gupta <[email protected]>
>
> ---
> include/linux/swap.h | 5 +++++
> mm/swapfile.c | 34 ++++++++++++++++++++++++++++++++++
> 2 files changed, 39 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7c15334..64796fc 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -8,6 +8,7 @@
> #include <linux/memcontrol.h>
> #include <linux/sched.h>
> #include <linux/node.h>
> +#include <linux/blkdev.h>
>
> #include <asm/atomic.h>
> #include <asm/page.h>
> @@ -20,6 +21,8 @@ struct bio;
> #define SWAP_FLAG_PRIO_MASK 0x7fff
> #define SWAP_FLAG_PRIO_SHIFT 0
>
> +typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
> +
> static inline int current_is_kswapd(void)
> {
> return current->flags & PF_KSWAPD;
> @@ -155,6 +158,7 @@ struct swap_info_struct {
> unsigned int max;
> unsigned int inuse_pages;
> unsigned int old_block_size;
> + swap_free_notify_fn *swap_free_notify_fn;
> };
>
> struct swap_list_t {
> @@ -295,6 +299,7 @@ extern sector_t swapdev_block(int, pgoff_t);
> extern struct swap_info_struct *get_swap_info_struct(unsigned);
> extern int reuse_swap_page(struct page *);
> extern int try_to_free_swap(struct page *);
> +extern void set_swap_free_notify(struct block_device *, swap_free_notify_fn *);
> struct backing_dev_info;
>
> /* linux/mm/thrash.c */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 74f1102..b165db0 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -554,6 +554,38 @@ out:
> return NULL;
> }
>
> +/*
> + * Sets callback for event when swap_map[offset] == 0
> + * i.e. page at this swap offset is no longer used.
> + */
> +void set_swap_free_notify(struct block_device *bdev,
> + swap_free_notify_fn *notify_fn)
> +{
> + unsigned int i;
> + struct swap_info_struct *sis;
> +
> + spin_lock(&swap_lock);
> + for (i = 0; i <= nr_swapfiles; i++) {
> + sis = &swap_info[i];
> + if (!(sis->flags & SWP_USED))
> + continue;
> + if (sis->bdev == bdev)
> + break;
> + }
> +
> + /* swap device not found */
> + if (i > nr_swapfiles) {
> + spin_unlock(&swap_lock);
> + return;
> + }
> +
> + BUG_ON(!sis || sis->swap_free_notify_fn);
> + sis->swap_free_notify_fn = notify_fn;
> + spin_unlock(&swap_lock);
> + return;
> +}
> +EXPORT_SYMBOL_GPL(set_swap_free_notify);
> +
> static int swap_entry_free(struct swap_info_struct *p,
> swp_entry_t ent, int cache)
> {
> @@ -585,6 +617,8 @@ static int swap_entry_free(struct swap_info_struct *p,
> swap_list.next = p - swap_info;
> nr_swap_pages++;
> p->inuse_pages--;
> + if (p->swap_free_notify_fn)
> + p->swap_free_notify_fn(p->bdev, offset);
> }
> if (!swap_count(count))
> mem_cgroup_uncharge_swap(ent);

OK, this hits core kernel code so we need to CC some more mm/swapfile.c
people. The set_swap_free_notify() API looks strange to me. Hugh, I
think you mentioned that you're okay with an explicit hook. Any
suggestions how to do this cleanly?

Pekka

2009-09-18 07:18:01

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On Fri, 18 Sep 2009, Pekka Enberg wrote:
> On Fri, 2009-09-18 at 04:13 +0530, Nitin Gupta wrote:
> > +EXPORT_SYMBOL_GPL(set_swap_free_notify);
> > +
> > static int swap_entry_free(struct swap_info_struct *p,
> > swp_entry_t ent, int cache)
> > {
> > @@ -585,6 +617,8 @@ static int swap_entry_free(struct swap_info_struct *p,
> > swap_list.next = p - swap_info;
> > nr_swap_pages++;
> > p->inuse_pages--;
> > + if (p->swap_free_notify_fn)
> > + p->swap_free_notify_fn(p->bdev, offset);
> > }
> > if (!swap_count(count))
> > mem_cgroup_uncharge_swap(ent);
>
> OK, this hits core kernel code so we need to CC some more mm/swapfile.c
> people. The set_swap_free_notify() API looks strange to me. Hugh, I
> think you mentioned that you're okay with an explicit hook. Any
> suggestions how to do this cleanly?

No, no better suggestion. I quite see Nitin's point that ramzswap
would benefit significantly from a callback here, though it's not a
place (holding swap_lock) where we'd like to offer a callback at all.

I think I would prefer the naming to make it absolutely clear that
it's a special for ramzswap or compcache, rather than dressing it
up in the grand generality of a swap_free_notify_fn: giving our
hacks fancy names doesn't really make them better.

(Does the bdev matching work out if there are any regular swapfiles
around? I've not checked, might or might not need refinement there.)

Hugh

2009-09-18 07:55:28

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

Hi Hugh,

On Fri, 2009-09-18 at 08:17 +0100, Hugh Dickins wrote:
> On Fri, 18 Sep 2009, Pekka Enberg wrote:
> > On Fri, 2009-09-18 at 04:13 +0530, Nitin Gupta wrote:
> > > +EXPORT_SYMBOL_GPL(set_swap_free_notify);
> > > +
> > > static int swap_entry_free(struct swap_info_struct *p,
> > > swp_entry_t ent, int cache)
> > > {
> > > @@ -585,6 +617,8 @@ static int swap_entry_free(struct swap_info_struct *p,
> > > swap_list.next = p - swap_info;
> > > nr_swap_pages++;
> > > p->inuse_pages--;
> > > + if (p->swap_free_notify_fn)
> > > + p->swap_free_notify_fn(p->bdev, offset);
> > > }
> > > if (!swap_count(count))
> > > mem_cgroup_uncharge_swap(ent);
> >
> > OK, this hits core kernel code so we need to CC some more mm/swapfile.c
> > people. The set_swap_free_notify() API looks strange to me. Hugh, I
> > think you mentioned that you're okay with an explicit hook. Any
> > suggestions how to do this cleanly?
>
> No, no better suggestion. I quite see Nitin's point that ramzswap
> would benefit significantly from a callback here, though it's not a
> place (holding swap_lock) where we'd like to offer a callback at all.
>
> I think I would prefer the naming to make it absolutely clear that
> it's a special for ramzswap or compcache, rather than dressing it
> up in the grand generality of a swap_free_notify_fn: giving our
> hacks fancy names doesn't really make them better.
>
> (Does the bdev matching work out if there are any regular swapfiles
> around? I've not checked, might or might not need refinement there.)

The *hook* looks OK to me but set_swap_free_notify() looks like an ugly
hack. I don't understand why we're setting up the hook lazily in
ramzswap_read() nor do I understand why we need to look up struct
swap_info_struct with a bdev. Surely there's a cleaner way to do all
this? Probably somewhere in sys_swapon()?

Pekka

2009-09-18 08:00:31

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On Fri, 18 Sep 2009, Pekka Enberg wrote:
>
> The *hook* looks OK to me but set_swap_free_notify() looks like an ugly
> hack. I don't understand why we're setting up the hook lazily in
> ramzswap_read() nor do I understand why we need to look up struct
> swap_info_struct with a bdev. Surely there's a cleaner way to do all
> this? Probably somewhere in sys_swapon()?

Sounds like you have something in mind, may well be better,
but please show us a patch... (my mind is elsewhere!)

Hugh

2009-09-18 09:33:14

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

Hi Hugh,

On Fri, 2009-09-18 at 08:59 +0100, Hugh Dickins wrote:
> On Fri, 18 Sep 2009, Pekka Enberg wrote:
> >
> > The *hook* looks OK to me but set_swap_free_notify() looks like an ugly
> > hack. I don't understand why we're setting up the hook lazily in
> > ramzswap_read() nor do I understand why we need to look up struct
> > swap_info_struct with a bdev. Surely there's a cleaner way to do all
> > this? Probably somewhere in sys_swapon()?
>
> Sounds like you have something in mind, may well be better,
> but please show us a patch... (my mind is elsewhere!)

Hey, you tricked me into a world of pain! Here's a totally untested
patch that adds a "swapon" funciton to struct block_device_operations
that the ramzswap driver can implement to setup ->swap_free_notify_fn.

Pekka

>From 41827c57196f7eae66701a6ae6565c226c0b7674 Mon Sep 17 00:00:00 2001
From: Nitin Gupta <[email protected]>
Date: Fri, 18 Sep 2009 04:13:30 +0530
Subject: [PATCH] mm: swap slot free notifier

Currently, we have "swap discard" mechanism which sends a discard bio request
when we find a free cluster during scan_swap_map(). This callback can come a
long time after swap slots are actually freed.

This delay in callback is a great problem when (compressed) RAM [1] is used
as a swap device. So, this change adds a callback which is called as
soon as a swap slot becomes free. For above mentioned case of swapping
over compressed RAM device, this is very useful since we can immediately
free memory allocated for this swap page.

This callback does not replace swap discard support. It is called with
swap_lock held, so it is meant to trigger action that finishes quickly.
However, swap discard is an I/O request and can be used for taking longer
actions.

It is preferred to use this callback for ramzswap case even if discard
mechanism could be improved such that it can be called as often as required.
This is because, allocation of 'bio'(s) is undesirable since ramzswap always
operates under low memory conditions (its a swap device). Also, batching of
discard bio requests is not optimal since stale data can accumulate very
quickly in ramzswap devices, pushing system further into low memory state.

Signed-off-by: Nitin Gupta <[email protected]>
Signed-off-by: Pekka Enberg <[email protected]>
---
include/linux/blkdev.h | 2 ++
include/linux/swap.h | 16 ++++++++++++++++
mm/swapfile.c | 5 +++++
3 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e23a86c..688e024 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,6 +27,7 @@ struct scsi_ioctl_command;
struct request_queue;
struct elevator_queue;
struct request_pm_state;
+struct swap_info_struct;
struct blk_trace;
struct request;
struct sg_io_hdr;
@@ -1239,6 +1240,7 @@ struct block_device_operations {
unsigned long long);
int (*revalidate_disk) (struct gendisk *);
int (*getgeo)(struct block_device *, struct hd_geometry *);
+ int (*swapon)(struct block_device *, struct swap_info_struct *);
struct module *owner;
};

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7c15334..6603009 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -8,6 +8,7 @@
#include <linux/memcontrol.h>
#include <linux/sched.h>
#include <linux/node.h>
+#include <linux/blkdev.h>

#include <asm/atomic.h>
#include <asm/page.h>
@@ -20,6 +21,8 @@ struct bio;
#define SWAP_FLAG_PRIO_MASK 0x7fff
#define SWAP_FLAG_PRIO_SHIFT 0

+typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
+
static inline int current_is_kswapd(void)
{
return current->flags & PF_KSWAPD;
@@ -155,6 +158,7 @@ struct swap_info_struct {
unsigned int max;
unsigned int inuse_pages;
unsigned int old_block_size;
+ swap_free_notify_fn *swap_free_notify_fn;
};

struct swap_list_t {
@@ -297,6 +301,18 @@ extern int reuse_swap_page(struct page *);
extern int try_to_free_swap(struct page *);
struct backing_dev_info;

+static inline int
+blkdev_swapon(struct block_device *bdev, struct swap_info_struct *sis)
+{
+ struct gendisk *disk = bdev->bd_disk;
+ int err = 0;
+
+ if (disk->fops->swapon)
+ err = disk->fops->swapon(bdev, sis);
+
+ return err;
+}
+
/* linux/mm/thrash.c */
extern struct mm_struct *swap_token_mm;
extern void grab_swap_token(struct mm_struct *);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 74f1102..9d10f02 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -585,6 +585,8 @@ static int swap_entry_free(struct swap_info_struct *p,
swap_list.next = p - swap_info;
nr_swap_pages++;
p->inuse_pages--;
+ if (p->swap_free_notify_fn)
+ p->swap_free_notify_fn(p->bdev, offset);
}
if (!swap_count(count))
mem_cgroup_uncharge_swap(ent);
@@ -1845,6 +1847,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (error < 0)
goto bad_swap;
p->bdev = bdev;
+ error = blkdev_swapon(bdev, p);
+ if (error < 0)
+ goto bad_swap;
} else if (S_ISREG(inode->i_mode)) {
p->bdev = inode->i_sb->s_bdev;
mutex_lock(&inode->i_mutex);
--
1.5.6.3


2009-09-18 09:59:00

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On Fri, Sep 18, 2009 at 12:47 PM, Hugh Dickins
<[email protected]> wrote:
> On Fri, 18 Sep 2009, Pekka Enberg wrote:
>> On Fri, 2009-09-18 at 04:13 +0530, Nitin Gupta wrote:
>> > +EXPORT_SYMBOL_GPL(set_swap_free_notify);
>> > +
>> > ?static int swap_entry_free(struct swap_info_struct *p,
>> > ? ? ? ? ? ? ? ? ? ? ? ?swp_entry_t ent, int cache)
>> > ?{
>> > @@ -585,6 +617,8 @@ static int swap_entry_free(struct swap_info_struct *p,
>> > ? ? ? ? ? ? ? ? ? ? swap_list.next = p - swap_info;
>> > ? ? ? ? ? ? nr_swap_pages++;
>> > ? ? ? ? ? ? p->inuse_pages--;
>> > + ? ? ? ? ? if (p->swap_free_notify_fn)
>> > + ? ? ? ? ? ? ? ? ? p->swap_free_notify_fn(p->bdev, offset);
>> > ? ? }
>> > ? ? if (!swap_count(count))
>> > ? ? ? ? ? ? mem_cgroup_uncharge_swap(ent);
>>
>> OK, this hits core kernel code so we need to CC some more mm/swapfile.c
>> people. The set_swap_free_notify() API looks strange to me. Hugh, I
>> think you mentioned that you're okay with an explicit hook. Any
>> suggestions how to do this cleanly?
>
> No, no better suggestion. ?I quite see Nitin's point that ramzswap
> would benefit significantly from a callback here, though it's not a
> place (holding swap_lock) where we'd like to offer a callback at all.
>
> I think I would prefer the naming to make it absolutely clear that
> it's a special for ramzswap or compcache, rather than dressing it
> up in the grand generality of a swap_free_notify_fn: giving our
> hacks fancy names doesn't really make them better.
>

Yes, makes sense... Since we cannot afford to have a chain of callbacks
within a spin lock, we have to keep it ramzswap specific (and rename
functions/variables to reflect this).

set_ramzswap_free_notify_fn() -> set_ramzswap_free_notify_fn()
and
swap_free_notify_fn -> ramzswap_free_notify_fn

Now, this renaming exposes ugliness of this hack in its true sense. Currently,
I don't have a cleaner solution but few points to consider:

- If we really have to do this within the lock then there cannot be
multiple callbacks.
It has to then remain ramzswap specific. In that case, current patch
looks looks like
the simplest solution.

- Do we really have to have a callback within a spin lock? Things become
very complex in ramzswap driver if we try to do this outside the lock
(I attempted
this but couldn't get it working). Still, we should think about it.

- If it can be done outside lock, we can afford to be chain of
callbacks attached
to this event. A nice generic solution. But if this means delaying
callback for too long,
then it may be unacceptable for ramzswap (we come back to problem with discard
approach).


> (Does the bdev matching work out if there are any regular swapfiles
> around? I've not checked, might or might not need refinement there.)
>

Yes.

Thanks,
Nitin

2009-09-18 15:05:16

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

Hi Pekka,

On 09/18/2009 03:03 PM, Pekka Enberg wrote:
>
> On Fri, 2009-09-18 at 08:59 +0100, Hugh Dickins wrote:
>> On Fri, 18 Sep 2009, Pekka Enberg wrote:
>>>
>>> The *hook* looks OK to me but set_swap_free_notify() looks like an ugly
>>> hack. I don't understand why we're setting up the hook lazily in
>>> ramzswap_read() nor do I understand why we need to look up struct
>>> swap_info_struct with a bdev. Surely there's a cleaner way to do all
>>> this? Probably somewhere in sys_swapon()?
>>
>> Sounds like you have something in mind, may well be better,
>> but please show us a patch... (my mind is elsewhere!)
>
> Hey, you tricked me into a world of pain! Here's a totally untested
> patch that adds a "swapon" funciton to struct block_device_operations
> that the ramzswap driver can implement to setup ->swap_free_notify_fn.
>


It is understood that this swap notify callback is bit of a hack. I think
we will not gain much trying to beautify this hack. However, I agree with
Hugh's suggestion to rename this notify callback related function/variables
to make it explicit that its completely ramzswap related. I will send a path
that affects these renames as reply to patch 0/4.

A solution that is generic while satisfying ramzswap demands may be a bit more
involved. It will be easier to incrementally develop these additions.

Thanks,
Nitin


>
>> From 41827c57196f7eae66701a6ae6565c226c0b7674 Mon Sep 17 00:00:00 2001
> From: Nitin Gupta<[email protected]>
> Date: Fri, 18 Sep 2009 04:13:30 +0530
> Subject: [PATCH] mm: swap slot free notifier
>
> Currently, we have "swap discard" mechanism which sends a discard bio request
> when we find a free cluster during scan_swap_map(). This callback can come a
> long time after swap slots are actually freed.
>
> This delay in callback is a great problem when (compressed) RAM [1] is used
> as a swap device. So, this change adds a callback which is called as
> soon as a swap slot becomes free. For above mentioned case of swapping
> over compressed RAM device, this is very useful since we can immediately
> free memory allocated for this swap page.
>
> This callback does not replace swap discard support. It is called with
> swap_lock held, so it is meant to trigger action that finishes quickly.
> However, swap discard is an I/O request and can be used for taking longer
> actions.
>
> It is preferred to use this callback for ramzswap case even if discard
> mechanism could be improved such that it can be called as often as required.
> This is because, allocation of 'bio'(s) is undesirable since ramzswap always
> operates under low memory conditions (its a swap device). Also, batching of
> discard bio requests is not optimal since stale data can accumulate very
> quickly in ramzswap devices, pushing system further into low memory state.
>
> Signed-off-by: Nitin Gupta<[email protected]>
> Signed-off-by: Pekka Enberg<[email protected]>
> ---
> include/linux/blkdev.h | 2 ++
> include/linux/swap.h | 16 ++++++++++++++++
> mm/swapfile.c | 5 +++++
> 3 files changed, 23 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index e23a86c..688e024 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -27,6 +27,7 @@ struct scsi_ioctl_command;
> struct request_queue;
> struct elevator_queue;
> struct request_pm_state;
> +struct swap_info_struct;
> struct blk_trace;
> struct request;
> struct sg_io_hdr;
> @@ -1239,6 +1240,7 @@ struct block_device_operations {
> unsigned long long);
> int (*revalidate_disk) (struct gendisk *);
> int (*getgeo)(struct block_device *, struct hd_geometry *);
> + int (*swapon)(struct block_device *, struct swap_info_struct *);
> struct module *owner;
> };
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7c15334..6603009 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -8,6 +8,7 @@
> #include<linux/memcontrol.h>
> #include<linux/sched.h>
> #include<linux/node.h>
> +#include<linux/blkdev.h>
>
> #include<asm/atomic.h>
> #include<asm/page.h>
> @@ -20,6 +21,8 @@ struct bio;
> #define SWAP_FLAG_PRIO_MASK 0x7fff
> #define SWAP_FLAG_PRIO_SHIFT 0
>
> +typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
> +
> static inline int current_is_kswapd(void)
> {
> return current->flags& PF_KSWAPD;
> @@ -155,6 +158,7 @@ struct swap_info_struct {
> unsigned int max;
> unsigned int inuse_pages;
> unsigned int old_block_size;
> + swap_free_notify_fn *swap_free_notify_fn;
> };
>
> struct swap_list_t {
> @@ -297,6 +301,18 @@ extern int reuse_swap_page(struct page *);
> extern int try_to_free_swap(struct page *);
> struct backing_dev_info;
>
> +static inline int
> +blkdev_swapon(struct block_device *bdev, struct swap_info_struct *sis)
> +{
> + struct gendisk *disk = bdev->bd_disk;
> + int err = 0;
> +
> + if (disk->fops->swapon)
> + err = disk->fops->swapon(bdev, sis);
> +
> + return err;
> +}
> +
> /* linux/mm/thrash.c */
> extern struct mm_struct *swap_token_mm;
> extern void grab_swap_token(struct mm_struct *);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 74f1102..9d10f02 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -585,6 +585,8 @@ static int swap_entry_free(struct swap_info_struct *p,
> swap_list.next = p - swap_info;
> nr_swap_pages++;
> p->inuse_pages--;
> + if (p->swap_free_notify_fn)
> + p->swap_free_notify_fn(p->bdev, offset);
> }
> if (!swap_count(count))
> mem_cgroup_uncharge_swap(ent);
> @@ -1845,6 +1847,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> if (error< 0)
> goto bad_swap;
> p->bdev = bdev;
> + error = blkdev_swapon(bdev, p);
> + if (error< 0)
> + goto bad_swap;
> } else if (S_ISREG(inode->i_mode)) {
> p->bdev = inode->i_sb->s_bdev;
> mutex_lock(&inode->i_mutex);

2009-09-18 16:44:43

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH] ramzswap prefix for swap free callback

Swap free callback is used to inform the underlying block
device that a swap page has been freed. Currently, ramzswap
is the only user of this callback. So, rename swap notify
interface to reflect this.

Signed-off-by: Nitin Gupta <[email protected]>

---
drivers/staging/ramzswap/ramzswap_drv.c | 4 ++--
include/linux/swap.h | 7 ++++---
mm/swapfile.c | 14 +++++++-------
3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/staging/ramzswap/ramzswap_drv.c b/drivers/staging/ramzswap/ramzswap_drv.c
index 46c387a..1a7167f 100644
--- a/drivers/staging/ramzswap/ramzswap_drv.c
+++ b/drivers/staging/ramzswap/ramzswap_drv.c
@@ -761,7 +761,7 @@ static int ramzswap_read(struct ramzswap *rzs, struct bio *bio)
index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT;

if (unlikely(!rzs->init_notify_callback) && PageSwapCache(page)) {
- set_swap_free_notify(bio->bi_bdev, ramzswap_free_notify);
+ set_ramzswap_free_notify(bio->bi_bdev, ramzswap_free_notify);
rzs->init_notify_callback = 1;
}

@@ -1323,7 +1323,7 @@ static int ramzswap_ioctl(struct block_device *bdev, fmode_t mode,
* can happen if another swapon is done before this reset.
* TODO: A callback from swapoff() will solve this problem.
*/
- set_swap_free_notify(bdev, NULL);
+ set_ramzswap_free_notify(bdev, NULL);
rzs->init_notify_callback = 0;
break;

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 64796fc..ace7900 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -21,7 +21,7 @@ struct bio;
#define SWAP_FLAG_PRIO_MASK 0x7fff
#define SWAP_FLAG_PRIO_SHIFT 0

-typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
+typedef void (ramzswap_free_notify_fn) (struct block_device *, unsigned long);

static inline int current_is_kswapd(void)
{
@@ -158,7 +158,7 @@ struct swap_info_struct {
unsigned int max;
unsigned int inuse_pages;
unsigned int old_block_size;
- swap_free_notify_fn *swap_free_notify_fn;
+ ramzswap_free_notify_fn *ramzswap_free_notify_fn;
};

struct swap_list_t {
@@ -299,7 +299,8 @@ extern sector_t swapdev_block(int, pgoff_t);
extern struct swap_info_struct *get_swap_info_struct(unsigned);
extern int reuse_swap_page(struct page *);
extern int try_to_free_swap(struct page *);
-extern void set_swap_free_notify(struct block_device *, swap_free_notify_fn *);
+extern void set_ramzswap_free_notify(struct block_device *,
+ ramzswap_free_notify_fn *);
struct backing_dev_info;

/* linux/mm/thrash.c */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b165db0..0cc9c9c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -558,8 +558,8 @@ out:
* Sets callback for event when swap_map[offset] == 0
* i.e. page at this swap offset is no longer used.
*/
-void set_swap_free_notify(struct block_device *bdev,
- swap_free_notify_fn *notify_fn)
+void set_ramzswap_free_notify(struct block_device *bdev,
+ ramzswap_free_notify_fn *notify_fn)
{
unsigned int i;
struct swap_info_struct *sis;
@@ -579,12 +579,12 @@ void set_swap_free_notify(struct block_device *bdev,
return;
}

- BUG_ON(!sis || sis->swap_free_notify_fn);
- sis->swap_free_notify_fn = notify_fn;
+ BUG_ON(!sis || sis->ramzswap_free_notify_fn);
+ sis->ramzswap_free_notify_fn = notify_fn;
spin_unlock(&swap_lock);
return;
}
-EXPORT_SYMBOL_GPL(set_swap_free_notify);
+EXPORT_SYMBOL_GPL(set_ramzswap_free_notify);

static int swap_entry_free(struct swap_info_struct *p,
swp_entry_t ent, int cache)
@@ -617,8 +617,8 @@ static int swap_entry_free(struct swap_info_struct *p,
swap_list.next = p - swap_info;
nr_swap_pages++;
p->inuse_pages--;
- if (p->swap_free_notify_fn)
- p->swap_free_notify_fn(p->bdev, offset);
+ if (p->ramzswap_free_notify_fn)
+ p->ramzswap_free_notify_fn(p->bdev, offset);
}
if (!swap_count(count))
mem_cgroup_uncharge_swap(ent);

2009-09-18 20:48:54

by Marcin Ślusarz

[permalink] [raw]
Subject: Re: [PATCH 3/4] virtual block device driver (ramzswap)

Nitin Gupta wrote:
> (...)
> +
> +static int page_zero_filled(void *ptr)
> +{
> + u32 pos;
> + u64 *page;
> +
> + page = (u64 *)ptr;
> +
> + for (pos = 0; pos != PAGE_SIZE / sizeof(*page); pos++) {
> + if (page[pos])
> + return 0;
> + }
> +
> + return 1;
> +}

Wouldn't unsigned long *page be better for both 32-bit and 64-bit machines?

(This function could return bool)

> (...)
> +static void create_device(struct ramzswap *rzs, int device_id)
> +{
> + mutex_init(&rzs->lock);
> + INIT_LIST_HEAD(&rzs->backing_swap_extent_list);
> +
> + rzs->queue = blk_alloc_queue(GFP_KERNEL);
> + if (!rzs->queue) {
> + pr_err("Error allocating disk queue for device %d\n",
> + device_id);
> + return;
> + }
> +
> + blk_queue_make_request(rzs->queue, ramzswap_make_request);
> + rzs->queue->queuedata = rzs;
> +
> + /* gendisk structure */
> + rzs->disk = alloc_disk(1);
> + if (!rzs->disk) {
> + blk_cleanup_queue(rzs->queue);
> + pr_warning("Error allocating disk structure for device %d\n",
> + device_id);
> + return;
> + }
> +
> + rzs->disk->major = ramzswap_major;
> + rzs->disk->first_minor = device_id;
> + rzs->disk->fops = &ramzswap_devops;
> + rzs->disk->queue = rzs->queue;
> + rzs->disk->private_data = rzs;
> + snprintf(rzs->disk->disk_name, 16, "ramzswap%d", device_id);
> +
> + /*
> + * Actual capacity set using RZSIO_SET_DISKSIZE_KB ioctl
> + * or set equal to backing swap device (if provided)
> + */
> + set_capacity(rzs->disk, 0);
> + add_disk(rzs->disk);
> +
> + rzs->init_done = 0;
> +
> + return;
> +}

needless return

Marcin

2009-09-18 21:10:29

by Marcin Ślusarz

[permalink] [raw]
Subject: Re: [PATCH 1/4] xvmalloc memory allocator

Nitin Gupta wrote:
> (...)
> +
> +/*
> + * Allocate a memory page. Called when a pool needs to grow.
> + */
> +static struct page *xv_alloc_page(gfp_t flags)
> +{
> + struct page *page;
> +
> + page = alloc_page(flags);
> + if (unlikely(!page))
> + return 0;
> +
> + return page;
> +}

When alloc_page returns 0 it returns 0, when not - it returns page.
Why not call alloc_page directly?

> (...)
> +/*
> + * Remove block from freelist. Index 'slindex' identifies the freelist.
> + */
> +static void remove_block(struct xv_pool *pool, struct page *page, u32 offset,
> + struct block_header *block, u32 slindex)
> +{
> + u32 flindex;
> + struct block_header *tmpblock;
> +
> + if (pool->freelist[slindex].page == page
> + && pool->freelist[slindex].offset == offset) {
> + remove_block_head(pool, block, slindex);
> + return;
> + }
> +
> + flindex = slindex / BITS_PER_LONG;
> +
> + if (block->link.prev_page) {
> + tmpblock = get_ptr_atomic(block->link.prev_page,
> + block->link.prev_offset, KM_USER1);
> + tmpblock->link.next_page = block->link.next_page;
> + tmpblock->link.next_offset = block->link.next_offset;
> + put_ptr_atomic(tmpblock, KM_USER1);
> + }
> +
> + if (block->link.next_page) {
> + tmpblock = get_ptr_atomic(block->link.next_page,
> + block->link.next_offset, KM_USER1);
> + tmpblock->link.prev_page = block->link.prev_page;
> + tmpblock->link.prev_offset = block->link.prev_offset;
> + put_ptr_atomic(tmpblock, KM_USER1);
> + }
> +
> + return;
> +}

needless return

> +
> +/*
> + * Allocate a page and add it freelist of given pool.
> + */
> +static int grow_pool(struct xv_pool *pool, gfp_t flags)
> +{
> + struct page *page;
> + struct block_header *block;
> +
> + page = xv_alloc_page(flags);
> + if (unlikely(!page))
> + return -ENOMEM;
> +
> + stat_inc(&pool->total_pages);
> +
> + spin_lock(&pool->lock);
> + block = get_ptr_atomic(page, 0, KM_USER0);
> +
> + block->size = PAGE_SIZE - XV_ALIGN;
> + set_flag(block, BLOCK_FREE);
> + clear_flag(block, PREV_FREE);
> + set_blockprev(block, 0);
> +
> + insert_block(pool, page, 0, block);
> +
> + put_ptr_atomic(block, KM_USER0);
> + spin_unlock(&pool->lock);
> +
> + return 0;
> +}
> +
> (...)
> +/**
> + * xv_malloc - Allocate block of given size from pool.
> + * @pool: pool to allocate from
> + * @size: size of block to allocate
> + * @page: page no. that holds the object
> + * @offset: location of object within page
> + *
> + * On success, <page, offset> identifies block allocated
> + * and 0 is returned. On failure, <page, offset> is set to
> + * 0 and -ENOMEM is returned.
> + *
> + * Allocation requests with size > XV_MAX_ALLOC_SIZE will fail.
> + */
> +int xv_malloc(struct xv_pool *pool, u32 size, struct page **page,
> + u32 *offset, gfp_t flags)
> +{
> + int error;
> + u32 index, tmpsize, origsize, tmpoffset;
> + struct block_header *block, *tmpblock;
> +
> + *page = NULL;
> + *offset = 0;
> + origsize = size;
> +
> + if (unlikely(!size || size > XV_MAX_ALLOC_SIZE))
> + return -ENOMEM;
> +
> + size = ALIGN(size, XV_ALIGN);
> +
> + spin_lock(&pool->lock);
> +
> + index = find_block(pool, size, page, offset);
> +
> + if (!*page) {
> + spin_unlock(&pool->lock);
> + if (flags & GFP_NOWAIT)
> + return -ENOMEM;
> + error = grow_pool(pool, flags);
> + if (unlikely(error))
> + return -ENOMEM;

shouldn't it return error? (grow_pool returns 0 or -ENOMEM for now but...)

> +
> + spin_lock(&pool->lock);
> + index = find_block(pool, size, page, offset);
> + }
> +
> + if (!*page) {
> + spin_unlock(&pool->lock);
> + return -ENOMEM;
> + }
> +
> + block = get_ptr_atomic(*page, *offset, KM_USER0);
> +
> + remove_block_head(pool, block, index);
> +
> + /* Split the block if required */
> + tmpoffset = *offset + size + XV_ALIGN;
> + tmpsize = block->size - size;
> + tmpblock = (struct block_header *)((char *)block + size + XV_ALIGN);
> + if (tmpsize) {
> + tmpblock->size = tmpsize - XV_ALIGN;
> + set_flag(tmpblock, BLOCK_FREE);
> + clear_flag(tmpblock, PREV_FREE);
> +
> + set_blockprev(tmpblock, *offset);
> + if (tmpblock->size >= XV_MIN_ALLOC_SIZE)
> + insert_block(pool, *page, tmpoffset, tmpblock);
> +
> + if (tmpoffset + XV_ALIGN + tmpblock->size != PAGE_SIZE) {
> + tmpblock = BLOCK_NEXT(tmpblock);
> + set_blockprev(tmpblock, tmpoffset);
> + }
> + } else {
> + /* This block is exact fit */
> + if (tmpoffset != PAGE_SIZE)
> + clear_flag(tmpblock, PREV_FREE);
> + }
> +
> + block->size = origsize;
> + clear_flag(block, BLOCK_FREE);
> +
> + put_ptr_atomic(block, KM_USER0);
> + spin_unlock(&pool->lock);
> +
> + *offset += XV_ALIGN;
> +
> + return 0;
> +}
> +
> +/*
> + * Free block identified with <page, offset>
> + */
> +void xv_free(struct xv_pool *pool, struct page *page, u32 offset)
> +{
> + void *page_start;
> + struct block_header *block, *tmpblock;
> +
> + offset -= XV_ALIGN;
> +
> + spin_lock(&pool->lock);
> +
> + page_start = get_ptr_atomic(page, 0, KM_USER0);
> + block = (struct block_header *)((char *)page_start + offset);
> +
> + /* Catch double free bugs */
> + BUG_ON(test_flag(block, BLOCK_FREE));
> +
> + block->size = ALIGN(block->size, XV_ALIGN);
> +
> + tmpblock = BLOCK_NEXT(block);
> + if (offset + block->size + XV_ALIGN == PAGE_SIZE)
> + tmpblock = NULL;
> +
> + /* Merge next block if its free */
> + if (tmpblock && test_flag(tmpblock, BLOCK_FREE)) {
> + /*
> + * Blocks smaller than XV_MIN_ALLOC_SIZE
> + * are not inserted in any free list.
> + */
> + if (tmpblock->size >= XV_MIN_ALLOC_SIZE) {
> + remove_block(pool, page,
> + offset + block->size + XV_ALIGN, tmpblock,
> + get_index_for_insert(tmpblock->size));
> + }
> + block->size += tmpblock->size + XV_ALIGN;
> + }
> +
> + /* Merge previous block if its free */
> + if (test_flag(block, PREV_FREE)) {
> + tmpblock = (struct block_header *)((char *)(page_start) +
> + get_blockprev(block));
> + offset = offset - tmpblock->size - XV_ALIGN;
> +
> + if (tmpblock->size >= XV_MIN_ALLOC_SIZE)
> + remove_block(pool, page, offset, tmpblock,
> + get_index_for_insert(tmpblock->size));
> +
> + tmpblock->size += block->size + XV_ALIGN;
> + block = tmpblock;
> + }
> +
> + /* No used objects in this page. Free it. */
> + if (block->size == PAGE_SIZE - XV_ALIGN) {
> + put_ptr_atomic(page_start, KM_USER0);
> + spin_unlock(&pool->lock);
> +
> + xv_free_page(page);
> + stat_dec(&pool->total_pages);
> + return;
> + }
> +
> + set_flag(block, BLOCK_FREE);
> + if (block->size >= XV_MIN_ALLOC_SIZE)
> + insert_block(pool, page, offset, block);
> +
> + if (offset + block->size + XV_ALIGN != PAGE_SIZE) {
> + tmpblock = BLOCK_NEXT(block);
> + set_flag(tmpblock, PREV_FREE);
> + set_blockprev(tmpblock, offset);
> + }
> +
> + put_ptr_atomic(page_start, KM_USER0);
> + spin_unlock(&pool->lock);
> +
> + return;
> +}

needless return


Marcin

2009-09-19 05:48:37

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On 09/18/2009 12:47 PM, Hugh Dickins wrote:
> On Fri, 18 Sep 2009, Pekka Enberg wrote:
>> On Fri, 2009-09-18 at 04:13 +0530, Nitin Gupta wrote:
>>> +EXPORT_SYMBOL_GPL(set_swap_free_notify);
>>> +
>>> static int swap_entry_free(struct swap_info_struct *p,
>>> swp_entry_t ent, int cache)
>>> {
>>> @@ -585,6 +617,8 @@ static int swap_entry_free(struct swap_info_struct *p,
>>> swap_list.next = p - swap_info;
>>> nr_swap_pages++;
>>> p->inuse_pages--;
>>> + if (p->swap_free_notify_fn)
>>> + p->swap_free_notify_fn(p->bdev, offset);
>>> }
>>> if (!swap_count(count))
>>> mem_cgroup_uncharge_swap(ent);
>>
>> OK, this hits core kernel code so we need to CC some more mm/swapfile.c
>> people. The set_swap_free_notify() API looks strange to me. Hugh, I
>> think you mentioned that you're okay with an explicit hook. Any
>> suggestions how to do this cleanly?
>
> No, no better suggestion. I quite see Nitin's point that ramzswap
> would benefit significantly from a callback here, though it's not a
> place (holding swap_lock) where we'd like to offer a callback at all.
>
> I think I would prefer the naming to make it absolutely clear that
> it's a special for ramzswap or compcache, rather than dressing it
> up in the grand generality of a swap_free_notify_fn: giving our
> hacks fancy names doesn't really make them better.
>

One more thing. If this renaming is done, then I think this notify
callback should no longer be unconditionally compiled. It should depend
on some ramzswap specific symbol.

Do you think you will be able to Ack this swap notify patch if above things
are done (rest of the driver is aimed at staging)? For your reference, below
is the patch to do this. I think I will have to send "v4" patches that will
include this one and related changes in ramzswap driver code.


(patch to make ramzswap notify callback conditional)
---
diff --git a/drivers/staging/ramzswap/Kconfig b/drivers/staging/ramzswap/Kconfig
index 24e2569..e9c0900 100644
--- a/drivers/staging/ramzswap/Kconfig
+++ b/drivers/staging/ramzswap/Kconfig
@@ -19,3 +19,8 @@ config RAMZSWAP_STATS
help
Enable statistics collection for ramzswap. This adds only a minimal
overhead. In unsure, say Y.
+
+config RAMZSWAP_NOTIFY
+ bool
+ depends on RAMZSWAP
+ default y
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ace7900..8755b1e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -299,8 +299,15 @@ extern sector_t swapdev_block(int, pgoff_t);
extern struct swap_info_struct *get_swap_info_struct(unsigned);
extern int reuse_swap_page(struct page *);
extern int try_to_free_swap(struct page *);
+#ifdef CONFIG_RAMZSWAP_NOTIFY
extern void set_ramzswap_free_notify(struct block_device *,
ramzswap_free_notify_fn *);
+#else
+void set_ramzswap_free_notify(struct block_device *bdev,
+ ramzswap_free_notify_fn *notify_fn)
+{
+}
+#endif
struct backing_dev_info;

/* linux/mm/thrash.c */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0cc9c9c..d4459e4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -554,6 +554,7 @@ out:
return NULL;
}

+#ifdef CONFIG_RAMZSWAP_NOTIFY
/*
* Sets callback for event when swap_map[offset] == 0
* i.e. page at this swap offset is no longer used.
@@ -585,6 +586,7 @@ void set_ramzswap_free_notify(struct block_device *bdev,
return;
}
EXPORT_SYMBOL_GPL(set_ramzswap_free_notify);
+#endif

static int swap_entry_free(struct swap_info_struct *p,
swp_entry_t ent, int cache)

2009-09-19 07:27:57

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

Hi Nitin,

Nitin Gupta wrote:
> It is understood that this swap notify callback is bit of a hack. I think
> we will not gain much trying to beautify this hack. However, I agree with
> Hugh's suggestion to rename this notify callback related function/variables
> to make it explicit that its completely ramzswap related. I will send a
> path that affects these renames as reply to patch 0/4.

I don't quite agree and do think that my approach is a better long-term
solution. That said, it's Hugh's call, not mine. Hugh?

Pekka

2009-09-20 15:03:53

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

Hi Pekka,

On 09/19/2009 12:57 PM, Pekka Enberg wrote:
>
> Nitin Gupta wrote:
>> It is understood that this swap notify callback is bit of a hack. I think
>> we will not gain much trying to beautify this hack. However, I agree with
>> Hugh's suggestion to rename this notify callback related
>> function/variables
>> to make it explicit that its completely ramzswap related. I will send
>> a path that affects these renames as reply to patch 0/4.
>
> I don't quite agree and do think that my approach is a better long-term
> solution. That said, it's Hugh's call, not mine. Hugh?
>

Ok, lets discard all this. I will soon start working on a generic notifier based
interface for various swap events: swapon, swapoff, swap slot free that I hope would
be more acceptable. I will now surely miss this merge window but I hope the end result
would be better.

The issue of swap_lock is still bugging me but I think atomic notifier list should
be acceptable for swap slot free event, at least for the initial revision. If this
particular event finds more users then we will have to work on reducing contention
on swap_lock (per-swap file lock?).

Thanks,
Nitin

2009-09-21 11:07:31

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On Sat, 19 Sep 2009, Pekka Enberg wrote:
> Nitin Gupta wrote:
> > It is understood that this swap notify callback is bit of a hack. I think
> > we will not gain much trying to beautify this hack. However, I agree with
> > Hugh's suggestion to rename this notify callback related function/variables
> > to make it explicit that its completely ramzswap related. I will send a path
> > that affects these renames as reply to patch 0/4.
>
> I don't quite agree and do think that my approach is a better long-term
> solution. That said, it's Hugh's call, not mine. Hugh?

Sorry, Pekka, I do prefer Nitin's more explicit hackery.

Yours of course looks nicer: but again this method is actually just
for the one single use, and it is "exporting" the swap_info_struct to
the block device, whereas I'd prefer to move in the opposite direction,
making that struct internal to swapfile.c. (I'd have done so already,
but noticed TuxOnIce making use of it, and don't want to make life
awkward there.)

Is the main basis for your disgust at the way that Nitin installs the
callback, that loop down the swap_info_structs? I should point out
that it was I who imposed that on Nitin: before that he was passing a
swap entry (or was it a swap type extracted from a swap entry?
I forget), which was the sole reference to a swp_entry_t in his
driver - I advised a bdev interface.

Would a compromise be to extend the #ifdef CONFIG_HIBERNATION around
swap_type_of() to cover ramzswap too, then Nitin use swap_type_of()
on his bdev to get a swap type to use to install the notifier?

I'm not saying that would be better, haven't even thought through
if it works: I'm just looking for a compromise, whereby you and I
don't keep sending Nitin scurrying off in opposite directions.

Hugh

2009-09-21 11:12:28

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

Hi Hugh,

On Mon, 2009-09-21 at 12:07 +0100, Hugh Dickins wrote:
> Is the main basis for your disgust at the way that Nitin installs the
> callback, that loop down the swap_info_structs? I should point out
> that it was I who imposed that on Nitin: before that he was passing a
> swap entry (or was it a swap type extracted from a swap entry?
> I forget), which was the sole reference to a swp_entry_t in his
> driver - I advised a bdev interface.

The callback setup from ->read() just looks gross. However, it's your
call Hugh so I'll just shut up now. ;-)

Pekka

2009-09-21 11:17:53

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On Sun, 20 Sep 2009, Nitin Gupta wrote:
>
> Ok, lets discard all this.

I think we don't have the right infrastructure even for discard yet ;)

> I will soon start working on a generic notifier based
> interface for various swap events: swapon, swapoff, swap slot free that I hope would
> be more acceptable. I will now surely miss this merge window but I hope the end result
> would be better.

Don't worry about missing the merge window, I think it was already missed.
But beware of overdesign: who will be using these notifiers than you?

That's a rhetorical question: I just don't want you expending your time
on something fancy, then see it rejected next time around. If you can
convince that what you come up with will be generally useful, great,
it'll go in; but I'm not interested in window dressing.

>
> The issue of swap_lock is still bugging me but I think atomic notifier list should
> be acceptable for swap slot free event, at least for the initial revision. If this
> particular event finds more users then we will have to work on reducing contention
> on swap_lock (per-swap file lock?).

The swap_lock in swap_free is troubling, and argues strongly against
pretending a general interface; but I quite understand your difficulty
in doing without it, it was awkward when I tried to discard near there.

Hugh

2009-09-21 11:55:26

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On Mon, 21 Sep 2009, Pekka Enberg wrote:
> On Mon, 2009-09-21 at 12:07 +0100, Hugh Dickins wrote:
> > Is the main basis for your disgust at the way that Nitin installs the
> > callback, that loop down the swap_info_structs? I should point out
> > that it was I who imposed that on Nitin: before that he was passing a
> > swap entry (or was it a swap type extracted from a swap entry?
> > I forget), which was the sole reference to a swp_entry_t in his
> > driver - I advised a bdev interface.
>
> The callback setup from ->read() just looks gross. However, it's your
> call Hugh so I'll just shut up now. ;-)

Ah, no, please don't! So it's _that_ end of it that's upsetting you,
and rightly so, I hadn't grasped that.

I must have glanced at that end, setting the notifier in ramzswap_read(),
in a previous revision, to have spotted the swp_entry_t that was there;
but haven't refreshed my memory of it recently.

(Nitin, your patch division is quite wrong: you should present a patch
in which your driver works, albeit poorly, without the notifier; then
a patch in which the notifier is added at the swapfile.c end and your
driver end, so we can see how they fit together.)

I'd naively hoped when suggesting the bdev interface that it would
then be usable at opening time, but I guess we don't know enough then.

I don't think installing the notifier at ramzswap_read() time quite
covers all cases: though it's a exceptional, imagine writing some
stuff out to swap, then swapoff called while all those pages are
still in swapcache - no reads would be done, the swap would be
freed, but the notifier never even installed, let alone called.

Well, it remains the case that I don't have time to review this at
present. But when I get back here I ought to take another look at
your patch (if it's not superseded by something obviously better
from one or another).

Though exporting the swap_info_struct still bothers me, and it
seems convoluted that the block device should have a method, so
swapon can call the block device, so the block device can call
swapfile.c to install a callout, so that swapfile.c can call the
block device when freeing swap. I'm not saying there is a better
way, just that I'd be glad of a better way.

Hugh

2009-09-21 12:01:03

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On Mon, Sep 21, 2009 at 2:55 PM, Hugh Dickins
<[email protected]> wrote:
>> The callback setup from ->read() just looks gross. However, it's your
>> call Hugh so I'll just shut up now. ;-)
>
> Ah, no, please don't! ?So it's _that_ end of it that's upsetting you,
> and rightly so, I hadn't grasped that.

Heh, right. :-)

On Mon, Sep 21, 2009 at 2:55 PM, Hugh Dickins
<[email protected]> wrote:
> (Nitin, your patch division is quite wrong: you should present a patch
> in which your driver works, albeit poorly, without the notifier; then
> a patch in which the notifier is added at the swapfile.c end and your
> driver end, so we can see how they fit together.)

Yes, please. Such standalone driver could go into drivers/staging for
2.6.32 probably.

2009-09-21 12:08:06

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

Hi Hugh,

On Mon, Sep 21, 2009 at 2:55 PM, Hugh Dickins
<[email protected]> wrote:
> Though exporting the swap_info_struct still bothers me, and it
> seems convoluted that the block device should have a method, so
> swapon can call the block device, so the block device can call
> swapfile.c to install a callout, so that swapfile.c can call the
> block device when freeing swap. ?I'm not saying there is a better
> way, just that I'd be glad of a better way.

I guess we can combine my ->swapon() hook in struct
block_device_operations with Nitin's set_swap_free_notify() function
to avoid exporting struct swap_info_struct. Alternatively, we could
add the ->swap_free() hook too struct block_device_operations. In any
case, I don't think we can do the setup at sys_open() if we keep
Nitin's hook as struct swap_info_struct is not set up until
sys_swapon().

Pekka

2009-09-21 12:36:11

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On 09/21/2009 05:38 PM, Pekka Enberg wrote:
> Hi Hugh,
>
> On Mon, Sep 21, 2009 at 2:55 PM, Hugh Dickins
> <[email protected]> wrote:
>> Though exporting the swap_info_struct still bothers me, and it
>> seems convoluted that the block device should have a method, so
>> swapon can call the block device, so the block device can call
>> swapfile.c to install a callout, so that swapfile.c can call the
>> block device when freeing swap. I'm not saying there is a better
>> way, just that I'd be glad of a better way.
>
> I guess we can combine my ->swapon() hook in struct
> block_device_operations with Nitin's set_swap_free_notify() function
> to avoid exporting struct swap_info_struct. Alternatively, we could
> add the ->swap_free() hook too struct block_device_operations. In any
> case, I don't think we can do the setup at sys_open() if we keep
> Nitin's hook as struct swap_info_struct is not set up until
> sys_swapon().
>
>


I just converted all this to use more generic notifier interface and tested
it by adding ramzswap notifier to list and it all works. I will shortly send
RFC patches for swapfile.c changes and corresponding ramzswap changes to show
its usage.

I'm still not sure if these notifiers will find more users, making all this
look like a mere decoration. But, at least it looks better than whatever
hack we currently have :)

(I will do better patch division when I post the next revision).

Thanks,
Nitin

2009-09-22 03:10:59

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On 09/21/2009 05:31 PM, Pekka Enberg wrote:
> On Mon, Sep 21, 2009 at 2:55 PM, Hugh Dickins
>
> On Mon, Sep 21, 2009 at 2:55 PM, Hugh Dickins
> <[email protected]> wrote:
>> (Nitin, your patch division is quite wrong: you should present a patch
>> in which your driver works, albeit poorly, without the notifier; then
>> a patch in which the notifier is added at the swapfile.c end and your
>> driver end, so we can see how they fit together.)
>
> Yes, please. Such standalone driver could go into drivers/staging for
> 2.6.32 probably.
>


Ok, for now all I will remove all swap notifier bits from the patches so
they reside completely in drivers/staging/. That should make it easier for
GregKH to merge it. The problematic swap notifier part will be added later.

Thanks to you and Hugh for looking into these patches.

Nitin

2009-09-22 03:51:40

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 1/4] xvmalloc memory allocator

Sorry for late reply. I nearly missed this mail. My comments inline.

On 09/19/2009 02:35 AM, Marcin Slusarz wrote:
> Nitin Gupta wrote:
>> (...)
>> +
>> +/*
>> + * Allocate a memory page. Called when a pool needs to grow.
>> + */
>> +static struct page *xv_alloc_page(gfp_t flags)
>> +{
>> + struct page *page;
>> +
>> + page = alloc_page(flags);
>> + if (unlikely(!page))
>> + return 0;
>> +
>> + return page;
>> +}
>
> When alloc_page returns 0 it returns 0, when not - it returns page.
> Why not call alloc_page directly?
>

We now call alloc_page() and __free_page directly. Removed these wrappers.

>> (...)
>> +/*
>> + * Remove block from freelist. Index 'slindex' identifies the freelist.
>> + */
>> +static void remove_block(struct xv_pool *pool, struct page *page, u32 offset,
>> + struct block_header *block, u32 slindex)
>> +{
>> + u32 flindex;
>> + struct block_header *tmpblock;
<snip>
>> +
>> + return;
>> +}
>
> needless return
>

Removed.


>> +int xv_malloc(struct xv_pool *pool, u32 size, struct page **page,
>> + u32 *offset, gfp_t flags)
>> +{
>> + int error;
>> +
<snip>
>> + if (!*page) {
>> + spin_unlock(&pool->lock);
>> + if (flags & GFP_NOWAIT)
>> + return -ENOMEM;
>> + error = grow_pool(pool, flags);
>> + if (unlikely(error))
>> + return -ENOMEM;
>
> shouldn't it return error? (grow_pool returns 0 or -ENOMEM for now but...)
>

Yes, it should return error. Corrected.


>> +
>> + spin_lock(&pool->lock);
>> + index = find_block(pool, size, page, offset);
>> + }
>> +
>> + if (!*page) {
>> + spin_unlock(&pool->lock);
>> + return -ENOMEM;
>> + }
>> +
>> + block = get_ptr_atomic(*page, *offset, KM_USER0);
>> +
>> + remove_block_head(pool, block, index);
>> +
>> + /* Split the block if required */
>> + tmpoffset = *offset + size + XV_ALIGN;
>> + tmpsize = block->size - size;
>> + tmpblock = (struct block_header *)((char *)block + size + XV_ALIGN);
>> + if (tmpsize) {
>> + tmpblock->size = tmpsize - XV_ALIGN;
>> + set_flag(tmpblock, BLOCK_FREE);
>> + clear_flag(tmpblock, PREV_FREE);
>> +
>> + set_blockprev(tmpblock, *offset);
>> + if (tmpblock->size >= XV_MIN_ALLOC_SIZE)
>> + insert_block(pool, *page, tmpoffset, tmpblock);
>> +
>> + if (tmpoffset + XV_ALIGN + tmpblock->size != PAGE_SIZE) {
>> + tmpblock = BLOCK_NEXT(tmpblock);
>> + set_blockprev(tmpblock, tmpoffset);
>> + }
>> + } else {
>> + /* This block is exact fit */
>> + if (tmpoffset != PAGE_SIZE)
>> + clear_flag(tmpblock, PREV_FREE);
>> + }
>> +
>> + block->size = origsize;
>> + clear_flag(block, BLOCK_FREE);
>> +
>> + put_ptr_atomic(block, KM_USER0);
>> + spin_unlock(&pool->lock);
>> +
>> + *offset += XV_ALIGN;
>> +
>> + return 0;
>> +}
>> +
>> +/*
>> + * Free block identified with <page, offset>
>> + */
>> +void xv_free(struct xv_pool *pool, struct page *page, u32 offset)
>> +{
<snip>
>> + return;
>> +}
>
> needless return
>
>

Removed.


Regarding your comments on page_zero_filled: I'm not sure if using unsigned
long is better or just u64 irrespective of arch. I just changed it to ulong
-- some bechmarks can help decide which one is optimal. Maybe we need arch
specific optimized versions which means moving it to lib/ or something.


Thanks for your feedback.

Nitin

2009-09-24 01:41:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/4] send callback when swap slot is freed

On Fri, 18 Sep 2009 04:13:30 +0530
Nitin Gupta <[email protected]> wrote:

> @@ -585,6 +617,8 @@ static int swap_entry_free(struct swap_info_struct *p,
> swap_list.next = p - swap_info;
> nr_swap_pages++;
> p->inuse_pages--;
> + if (p->swap_free_notify_fn)
> + p->swap_free_notify_fn(p->bdev, offset);
> }
> if (!swap_count(count))
> mem_cgroup_uncharge_swap(ent);

A nitpick but I feel I have to explain why mem_cgroup_ucharge_swap() is called
here. (difference with p->swap_free_notify_fn)

if (!swap_count(count))

means "It seems no users for this swap entry but we're not sure there are
SwapCache for this entry or not."

In mem_cgroup_uncharge_swap(), swap_cgroup is checked and if there is
a record (which means the SwapCache is not mapped anywhere), swap usage
is uncharged.

This is for race window at freeing swap entry via
zap_pte_range() => free_swap_and_cache().
(swap entry is not freed if the page is locked.)

I'll add some explanation in next series of memcg-cleanup patches.

Thanks,
-Kame