2005-04-12 12:44:49

by Nick Piggin

[permalink] [raw]
Subject: [patch 0/9] various (mainly mempool fixes and block layer improvements)

Andrew, please consider.

Ken, you'll probably have something similar to this if you
were following various random threads closely and picking
out my various random patches ;)

--
SUSE Labs, Novell Inc.


2005-04-12 12:54:06

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/9] GFP_ZERO fix

__GFP_ZERO really shouldn't tempt fate.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h 2005-04-12 22:05:49.000000000 +1000
+++ linux-2.6/include/linux/gfp.h 2005-04-12 22:26:10.000000000 +1000
@@ -44,8 +44,8 @@ struct vm_area_struct;

/* if you forget to add the bitmask here kernel will crash, period */
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
- __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
- __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP)
+ __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL| \
+ __GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP|__GFP_ZERO)

#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_NOIO (__GFP_WAIT)


Attachments:
GFP_ZERO-fix.patch (806.00 B)

2005-04-12 13:02:29

by Nick Piggin

[permalink] [raw]
Subject: [patch 7/9] blk: efficiency improvements

In the case where the request is not able to be merged by the elevator,
don't retake the lock and retry the merge mechanism after allocating a
new request.

Instead assume that the chance of a merge remains slim, and now that
we've done most of the work allocating a request we may as well just
go with it.

Also be rid of the GFP_ATOMIC allocation: we've got working mempools
for the block layer now, so let's save atomic memory for things like
networking.

Lastly, in get_request_wait, do an initial get_request call before
going into the waitqueue. This is reported to help efficiency.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/drivers/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/ll_rw_blk.c 2005-04-12 22:26:14.000000000 +1000
+++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-12 22:26:14.000000000 +1000
@@ -1952,10 +1952,11 @@ out:
*/
static struct request *get_request_wait(request_queue_t *q, int rw)
{
- DEFINE_WAIT(wait);
struct request *rq;

- do {
+ rq = get_request(q, rw, GFP_NOIO);
+ while (!rq) {
+ DEFINE_WAIT(wait);
struct request_list *rl = &q->rq;

prepare_to_wait_exclusive(&rl->wait[rw], &wait,
@@ -1980,7 +1981,7 @@ static struct request *get_request_wait(
put_io_context(ioc);
}
finish_wait(&rl->wait[rw], &wait);
- } while (!rq);
+ }

return rq;
}
@@ -2557,7 +2558,7 @@ EXPORT_SYMBOL(__blk_attempt_remerge);

static int __make_request(request_queue_t *q, struct bio *bio)
{
- struct request *req, *freereq = NULL;
+ struct request *req;
int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, err;
sector_t sector;

@@ -2582,14 +2583,9 @@ static int __make_request(request_queue_
goto end_io;
}

-again:
spin_lock_irq(q->queue_lock);

- if (elv_queue_empty(q)) {
- blk_plug_device(q);
- goto get_rq;
- }
- if (barrier)
+ if (unlikely(barrier) || elv_queue_empty(q))
goto get_rq;

el_ret = elv_merge(q, &req, bio);
@@ -2632,40 +2628,23 @@ again:
elv_merged_request(q, req);
goto out;

- /*
- * elevator says don't/can't merge. get new request
- */
- case ELEVATOR_NO_MERGE:
- break;
-
+ /* ELV_NO_MERGE: elevator says don't/can't merge. */
default:
- printk("elevator returned crap (%d)\n", el_ret);
- BUG();
+ ;
}

+get_rq:
/*
- * Grab a free request from the freelist - if that is empty, check
- * if we are doing read ahead and abort instead of blocking for
- * a free slot.
+ * Grab a free request. This is might sleep but can not fail.
+ */
+ spin_unlock_irq(q->queue_lock);
+ req = get_request_wait(q, rw);
+ /*
+ * After dropping the lock and possibly sleeping here, our request
+ * may now be mergeable after it had proven unmergeable (above).
+ * We don't worry about that case for efficiency. It won't happen
+ * often, and the elevators are able to handle it.
*/
-get_rq:
- if (freereq) {
- req = freereq;
- freereq = NULL;
- } else {
- spin_unlock_irq(q->queue_lock);
- if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) {
- /*
- * READA bit set
- */
- err = -EWOULDBLOCK;
- if (bio_rw_ahead(bio))
- goto end_io;
-
- freereq = get_request_wait(q, rw);
- }
- goto again;
- }

req->flags |= REQ_CMD;

@@ -2693,10 +2672,11 @@ get_rq:
req->rq_disk = bio->bi_bdev->bd_disk;
req->start_time = jiffies;

+ spin_lock_irq(q->queue_lock);
+ if (elv_queue_empty(q))
+ blk_plug_device(q);
add_request(q, req);
out:
- if (freereq)
- __blk_put_request(q, freereq);
if (bio_sync(bio))
__generic_unplug_device(q);


Attachments:
blk-efficient.patch (3.50 kB)

2005-04-12 13:02:28

by Nick Piggin

[permalink] [raw]
Subject: [patch 6/9] blk: unplug later

get_request_wait needn't unplug the device immediately.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/drivers/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/ll_rw_blk.c 2005-04-12 22:26:13.000000000 +1000
+++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-12 22:26:14.000000000 +1000
@@ -1955,7 +1955,6 @@ static struct request *get_request_wait(
DEFINE_WAIT(wait);
struct request *rq;

- generic_unplug_device(q);
do {
struct request_list *rl = &q->rq;

@@ -1967,6 +1966,7 @@ static struct request *get_request_wait(
if (!rq) {
struct io_context *ioc;

+ generic_unplug_device(q);
io_schedule();

/*


Attachments:
blk-unplug-later.patch (726.00 B)

2005-04-12 13:06:58

by Nick Piggin

[permalink] [raw]
Subject: [patch doh/9] mempool simplify alloc

Mempool is pretty clever. Looks too clever for its own good.
It shouldn't really know so much about MM internals.

- don't guess about what effective page reclaim might involve.

- don't randomly flush out all dirty data if some unlikely thing
happens (alloc returns NULL). page reclaim can (sort of :P) handle
it.

I think the main motivation is trying to avoid pool->lock at all
costs. However the first allocation is attempted with __GFP_WAIT
cleared, so it will be 'can_try_harder' if it hits the page allocator.
So if allocation still fails, then we can probably afford to hit the
pool->lock - and what's the alternative? Try page reclaim and hit
zone->lru_lock?

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/mempool.c
===================================================================
--- linux-2.6.orig/mm/mempool.c 2005-04-12 22:47:02.000000000 +1000
+++ linux-2.6/mm/mempool.c 2005-04-12 22:47:02.000000000 +1000
@@ -198,36 +198,22 @@ void * mempool_alloc(mempool_t *pool, un
void *element;
unsigned long flags;
DEFINE_WAIT(wait);
- int gfp_nowait;
+ int gfp_temp;

+ might_sleep_if(gfp_mask & __GFP_WAIT);
+
gfp_mask |= __GFP_MEMPOOL;
gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
gfp_mask |= __GFP_NOWARN; /* failures are OK */
- gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);

- might_sleep_if(gfp_mask & __GFP_WAIT);
+ gfp_temp = gfp_mask & ~__GFP_WAIT;
+
repeat_alloc:
- element = pool->alloc(gfp_nowait, pool->pool_data);
+
+ element = pool->alloc(gfp_temp, pool->pool_data);
if (likely(element != NULL))
return element;

- /*
- * If the pool is less than 50% full and we can perform effective
- * page reclaim then try harder to allocate an element.
- */
- mb();
- if ((gfp_mask & __GFP_FS) && (gfp_mask != gfp_nowait) &&
- (pool->curr_nr <= pool->min_nr/2)) {
- element = pool->alloc(gfp_mask, pool->pool_data);
- if (likely(element != NULL))
- return element;
- }
-
- /*
- * Kick the VM at this point.
- */
- wakeup_bdflush(0);
-
spin_lock_irqsave(&pool->lock, flags);
if (likely(pool->curr_nr)) {
element = remove_element(pool);
@@ -240,6 +226,8 @@ repeat_alloc:
if (!(gfp_mask & __GFP_WAIT))
return NULL;

+ /* Now start performing page reclaim */
+ gfp_temp = gfp_mask;
prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
mb();
if (!pool->curr_nr)


Attachments:
mempool-simplify-alloc.patch (2.33 kB)

2005-04-12 13:12:23

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/9] mempool gfp flag

Mempools have 2 problems.

The first is that mempool_alloc can possibly get stuck in __alloc_pages
when they should opt to fail, and take an element from their reserved pool.

The second is that it will happily eat emergency PF_MEMALLOC reserves
instead of going to their reserved pools.

Fix the first by passing __GFP_NORETRY in the allocation calls in
mempool_alloc. Fix the second by introducing a __GFP_MEMPOOL flag
which directs the page allocator not to allocate from the reserve
pool.


Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h 2005-04-12 22:26:10.000000000 +1000
+++ linux-2.6/include/linux/gfp.h 2005-04-12 22:26:11.000000000 +1000
@@ -38,14 +38,16 @@ struct vm_area_struct;
#define __GFP_NO_GROW 0x2000u /* Slab internal usage */
#define __GFP_COMP 0x4000u /* Add compound page metadata */
#define __GFP_ZERO 0x8000u /* Return zeroed page on success */
+#define __GFP_MEMPOOL 0x10000u/* Mempool allocation */

-#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 17 /* Room for 17 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)

/* if you forget to add the bitmask here kernel will crash, period */
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL| \
- __GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP|__GFP_ZERO)
+ __GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP|__GFP_ZERO| \
+ __GFP_MEMPOOL)

#define GFP_ATOMIC (__GFP_HIGH)
#define GFP_NOIO (__GFP_WAIT)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2005-04-12 22:05:44.000000000 +1000
+++ linux-2.6/mm/page_alloc.c 2005-04-12 22:26:11.000000000 +1000
@@ -799,14 +799,18 @@ __alloc_pages(unsigned int __nocast gfp_
}

/* This allocation should allow future memory freeing. */
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) && !in_interrupt()) {
- /* go through the zonelist yet again, ignoring mins */
- for (i = 0; (z = zones[i]) != NULL; i++) {
- if (!cpuset_zone_allowed(z))
- continue;
- page = buffered_rmqueue(z, order, gfp_mask);
- if (page)
- goto got_pg;
+
+ if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+ && !in_interrupt()) {
+ if (!(gfp_mask & __GFP_MEMPOOL)) {
+ /* go through the zonelist yet again, ignoring mins */
+ for (i = 0; (z = zones[i]) != NULL; i++) {
+ if (!cpuset_zone_allowed(z))
+ continue;
+ page = buffered_rmqueue(z, order, gfp_mask);
+ if (page)
+ goto got_pg;
+ }
}
goto nopage;
}
Index: linux-2.6/mm/mempool.c
===================================================================
--- linux-2.6.orig/mm/mempool.c 2005-04-12 22:05:44.000000000 +1000
+++ linux-2.6/mm/mempool.c 2005-04-12 22:26:11.000000000 +1000
@@ -198,11 +198,16 @@ void * mempool_alloc(mempool_t *pool, un
void *element;
unsigned long flags;
DEFINE_WAIT(wait);
- int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
+ int gfp_nowait;
+
+ gfp_mask |= __GFP_MEMPOOL;
+ gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
+ gfp_mask |= __GFP_NOWARN; /* failures are OK */
+ gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);

might_sleep_if(gfp_mask & __GFP_WAIT);
repeat_alloc:
- element = pool->alloc(gfp_nowait|__GFP_NOWARN, pool->pool_data);
+ element = pool->alloc(gfp_nowait, pool->pool_data);
if (likely(element != NULL))
return element;


Attachments:
mempool-gfp-flag.patch (3.48 kB)

2005-04-12 13:19:38

by Nick Piggin

[permalink] [raw]
Subject: [patch 4/9] blk: no memory barrier

This memory barrier is not needed because the waitqueue will only get
waiters on it in the following situations:

rq->count has exceeded the threshold - however all manipulations of ->count
are performed under the runqueue lock, and so we will correctly pick up any
waiter.

Memory allocation for the request fails. In this case, there is no additional
help provided by the memory barrier. We are guaranteed to eventually wake
up waiters because the request allocation mempool guarantees that if the mem
allocation for a request fails, there must be some requests in flight. They
will wake up waiters when they are retired.

Signed-off-by: Nick Piggin <[email protected]>


Index: linux-2.6/drivers/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/ll_rw_blk.c 2005-04-12 22:05:44.000000000 +1000
+++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-12 22:26:13.000000000 +1000
@@ -1828,7 +1828,6 @@ static void __freed_request(request_queu
clear_queue_congested(q, rw);

if (rl->count[rw] + 1 <= q->nr_requests) {
- smp_mb();
if (waitqueue_active(&rl->wait[rw]))
wake_up(&rl->wait[rw]);


Attachments:
blk-no-mb.patch (1.15 kB)

2005-04-12 13:31:24

by Nick Piggin

[permalink] [raw]
Subject: [patch 5/9] blk: branch hints

Sprinkle around a few branch hints in the block layer.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/drivers/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/ll_rw_blk.c 2005-04-12 22:26:13.000000000 +1000
+++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-12 22:26:13.000000000 +1000
@@ -1450,7 +1450,7 @@ EXPORT_SYMBOL(blk_remove_plug);
*/
void __generic_unplug_device(request_queue_t *q)
{
- if (test_bit(QUEUE_FLAG_STOPPED, &q->queue_flags))
+ if (unlikely(test_bit(QUEUE_FLAG_STOPPED, &q->queue_flags)))
return;

if (!blk_remove_plug(q))
@@ -1749,7 +1749,7 @@ EXPORT_SYMBOL(blk_init_queue);

int blk_get_queue(request_queue_t *q)
{
- if (!test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)) {
+ if (likely(!test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
atomic_inc(&q->refcnt);
return 0;
}
@@ -2577,7 +2577,7 @@ static int __make_request(request_queue_
spin_lock_prefetch(q->queue_lock);

barrier = bio_barrier(bio);
- if (barrier && (q->ordered == QUEUE_ORDERED_NONE)) {
+ if (unlikely(barrier) && (q->ordered == QUEUE_ORDERED_NONE)) {
err = -EOPNOTSUPP;
goto end_io;
}
@@ -2678,7 +2678,7 @@ get_rq:
/*
* REQ_BARRIER implies no merging, but lets make it explicit
*/
- if (barrier)
+ if (unlikely(barrier))
req->flags |= (REQ_HARDBARRIER | REQ_NOMERGE);

req->errors = 0;
@@ -2802,7 +2802,7 @@ static inline void block_wait_queue_runn
{
DEFINE_WAIT(wait);

- while (test_bit(QUEUE_FLAG_DRAIN, &q->queue_flags)) {
+ while (unlikely(test_bit(QUEUE_FLAG_DRAIN, &q->queue_flags))) {
struct request_list *rl = &q->rq;

prepare_to_wait_exclusive(&rl->drain, &wait,
@@ -2911,7 +2911,7 @@ end_io:
goto end_io;
}

- if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))
+ if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
goto end_io;

block_wait_queue_running(q);


Attachments:
blk-branch-hint.patch (1.88 kB)

2005-04-12 13:43:30

by Nick Piggin

[permalink] [raw]
Subject: [patch 0/9] blk: reduce locking

Change around locking a bit for a result of 1-2 less spin lock
unlock pairs in request submission paths.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/drivers/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/ll_rw_blk.c 2005-04-12 22:26:14.000000000 +1000
+++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-12 22:26:15.000000000 +1000
@@ -1859,18 +1859,20 @@ static void freed_request(request_queue_

#define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist)
/*
- * Get a free request, queue_lock must not be held
+ * Get a free request, queue_lock must be held.
+ * Returns NULL on failure, with queue_lock held.
+ * Returns !NULL on success, with queue_lock *not held*.
*/
static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
{
+ int batching;
struct request *rq = NULL;
struct request_list *rl = &q->rq;
- struct io_context *ioc = get_io_context(gfp_mask);
+ struct io_context *ioc = get_io_context(GFP_ATOMIC);

if (unlikely(test_bit(QUEUE_FLAG_DRAIN, &q->queue_flags)))
goto out;

- spin_lock_irq(q->queue_lock);
if (rl->count[rw]+1 >= q->nr_requests) {
/*
* The queue will fill after this allocation, so set it as
@@ -1883,6 +1885,8 @@ static struct request *get_request(reque
blk_set_queue_full(q, rw);
}
}
+
+ batching = ioc_batching(q, ioc);

switch (elv_may_queue(q, rw)) {
case ELV_MQUEUE_NO:
@@ -1893,12 +1897,11 @@ static struct request *get_request(reque
goto get_rq;
}

- if (blk_queue_full(q, rw) && !ioc_batching(q, ioc)) {
+ if (blk_queue_full(q, rw) && !batching) {
/*
* The queue is full and the allocating process is not a
* "batcher", and not exempted by the IO scheduler
*/
- spin_unlock_irq(q->queue_lock);
goto out;
}

@@ -1932,11 +1935,10 @@ rq_starved:
if (unlikely(rl->count[rw] == 0))
rl->starved[rw] = 1;

- spin_unlock_irq(q->queue_lock);
goto out;
}

- if (ioc_batching(q, ioc))
+ if (batching)
ioc->nr_batch_requests--;

rq_init(q, rq);
@@ -1949,6 +1951,8 @@ out:
/*
* No available requests for this queue, unplug the device and wait for some
* requests to become available.
+ *
+ * Called with q->queue_lock held, and returns with it unlocked.
*/
static struct request *get_request_wait(request_queue_t *q, int rw)
{
@@ -1967,7 +1971,8 @@ static struct request *get_request_wait(
if (!rq) {
struct io_context *ioc;

- generic_unplug_device(q);
+ __generic_unplug_device(q);
+ spin_unlock_irq(q->queue_lock);
io_schedule();

/*
@@ -1979,6 +1984,8 @@ static struct request *get_request_wait(
ioc = get_io_context(GFP_NOIO);
ioc_set_batching(q, ioc);
put_io_context(ioc);
+
+ spin_lock_irq(q->queue_lock);
}
finish_wait(&rl->wait[rw], &wait);
}
@@ -1992,10 +1999,15 @@ struct request *blk_get_request(request_

BUG_ON(rw != READ && rw != WRITE);

+ spin_lock_irq(q->queue_lock);
if (gfp_mask & __GFP_WAIT)
rq = get_request_wait(q, rw);
- else
+ else {
rq = get_request(q, rw, gfp_mask);
+ if (!rq)
+ spin_unlock_irq(q->queue_lock);
+ }
+ /* q->queue_lock is unlocked at this point */

return rq;
}
@@ -2636,9 +2648,10 @@ static int __make_request(request_queue_
get_rq:
/*
* Grab a free request. This is might sleep but can not fail.
+ * Returns with the queue unlocked.
*/
- spin_unlock_irq(q->queue_lock);
req = get_request_wait(q, rw);
+
/*
* After dropping the lock and possibly sleeping here, our request
* may now be mergeable after it had proven unmergeable (above).


Attachments:
blk-reduce-locking.patch (3.55 kB)

2005-04-12 13:50:50

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/9] no PF_MEMALLOC tinkering

PF_MEMALLOC is really not a tool for tinkering. It is pretty specifically
used to prevent recursion into page reclaim, and to prevent low memory
deadlocks.

The mm/swap_state.c code was the only legitimate tinkerer. Its concern
was addressed by the previous patch.

Signed-off-by: Nick Piggin <[email protected]>


Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c 2005-04-12 22:05:44.000000000 +1000
+++ linux-2.6/mm/swap_state.c 2005-04-12 22:26:12.000000000 +1000
@@ -143,7 +143,6 @@ void __delete_from_swap_cache(struct pag
int add_to_swap(struct page * page)
{
swp_entry_t entry;
- int pf_flags;
int err;

if (!PageLocked(page))
@@ -154,30 +153,11 @@ int add_to_swap(struct page * page)
if (!entry.val)
return 0;

- /* Radix-tree node allocations are performing
- * GFP_ATOMIC allocations under PF_MEMALLOC.
- * They can completely exhaust the page allocator.
- *
- * So PF_MEMALLOC is dropped here. This causes the slab
- * allocations to fail earlier, so radix-tree nodes will
- * then be allocated from the mempool reserves.
- *
- * We're still using __GFP_HIGH for radix-tree node
- * allocations, so some of the emergency pools are available,
- * just not all of them.
- */
-
- pf_flags = current->flags;
- current->flags &= ~PF_MEMALLOC;
-
/*
* Add it to the swap cache and mark it dirty
*/
err = __add_to_swap_cache(page, entry, GFP_ATOMIC|__GFP_NOWARN);

- if (pf_flags & PF_MEMALLOC)
- current->flags |= PF_MEMALLOC;
-
switch (err) {
case 0: /* Success */
SetPageUptodate(page);
Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c 2005-04-12 22:05:44.000000000 +1000
+++ linux-2.6/drivers/md/dm-crypt.c 2005-04-12 22:26:12.000000000 +1000
@@ -331,25 +331,14 @@ crypt_alloc_buffer(struct crypt_config *
struct bio *bio;
unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
int gfp_mask = GFP_NOIO | __GFP_HIGHMEM;
- unsigned long flags = current->flags;
unsigned int i;

- /*
- * Tell VM to act less aggressively and fail earlier.
- * This is not necessary but increases throughput.
- * FIXME: Is this really intelligent?
- */
- current->flags &= ~PF_MEMALLOC;
-
if (base_bio)
bio = bio_clone(base_bio, GFP_NOIO);
else
bio = bio_alloc(GFP_NOIO, nr_iovecs);
- if (!bio) {
- if (flags & PF_MEMALLOC)
- current->flags |= PF_MEMALLOC;
+ if (!bio)
return NULL;
- }

/* if the last bio was not complete, continue where that one ended */
bio->bi_idx = *bio_vec_idx;
@@ -386,9 +375,6 @@ crypt_alloc_buffer(struct crypt_config *
size -= bv->bv_len;
}

- if (flags & PF_MEMALLOC)
- current->flags |= PF_MEMALLOC;
-
if (!bio->bi_size) {
bio_put(bio);
return NULL;
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c 2005-04-12 22:05:44.000000000 +1000
+++ linux-2.6/fs/mpage.c 2005-04-12 22:26:12.000000000 +1000
@@ -105,11 +105,6 @@ mpage_alloc(struct block_device *bdev,

bio = bio_alloc(gfp_flags, nr_vecs);

- if (bio == NULL && (current->flags & PF_MEMALLOC)) {
- while (!bio && (nr_vecs /= 2))
- bio = bio_alloc(gfp_flags, nr_vecs);
- }
-
if (bio) {
bio->bi_bdev = bdev;
bio->bi_sector = first_sector;


Attachments:
no-PF_MEMALLOC-tinkering.patch (3.35 kB)

2005-04-12 20:54:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 6/9] blk: unplug later

Nick Piggin <[email protected]> wrote:
>
> get_request_wait needn't unplug the device immediately.

Probably. But what if the get_request(q, rw, GFP_NOIO); did
some sleeping?

2005-04-12 21:02:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 2/9] mempool gfp flag

Nick Piggin <[email protected]> wrote:
>
> The first is that mempool_alloc can possibly get stuck in __alloc_pages
> when they should opt to fail, and take an element from their reserved pool.
>
> The second is that it will happily eat emergency PF_MEMALLOC reserves
> instead of going to their reserved pools.
>
> Fix the first by passing __GFP_NORETRY in the allocation calls in
> mempool_alloc. Fix the second by introducing a __GFP_MEMPOOL flag
> which directs the page allocator not to allocate from the reserve
> pool.
>
>
> Index: linux-2.6/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/gfp.h 2005-04-12 22:26:10.000000000 +1000
> +++ linux-2.6/include/linux/gfp.h 2005-04-12 22:26:11.000000000 +1000
> @@ -38,14 +38,16 @@ struct vm_area_struct;
> #define __GFP_NO_GROW 0x2000u /* Slab internal usage */
> #define __GFP_COMP 0x4000u /* Add compound page metadata */
> #define __GFP_ZERO 0x8000u /* Return zeroed page on success */
> +#define __GFP_MEMPOOL 0x10000u/* Mempool allocation */

I think I'll rename this to "__GFP_NOMEMALLOC". Things other then mempool
might want to use this.

2005-04-13 01:05:26

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/9] GFP_ZERO fix

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>> #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
>> - __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
>> - __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP)
>> + __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL| \
>> + __GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP|__GFP_ZERO)
>
>
> Passing GFP_ZERO into kmem_cache_alloc() is such a bizarre thing to do,
> perhaps a BUG is the correct response.
>

I just thought it was a bit cheeky given the comment right above it ;)


--
SUSE Labs, Novell Inc.

2005-04-13 01:06:16

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 2/9] mempool gfp flag

Andrew Morton wrote:

>> Index: linux-2.6/include/linux/gfp.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/gfp.h 2005-04-12 22:26:10.000000000 +1000
>> +++ linux-2.6/include/linux/gfp.h 2005-04-12 22:26:11.000000000 +1000
>> @@ -38,14 +38,16 @@ struct vm_area_struct;
>> #define __GFP_NO_GROW 0x2000u /* Slab internal usage */
>> #define __GFP_COMP 0x4000u /* Add compound page metadata */
>> #define __GFP_ZERO 0x8000u /* Return zeroed page on success */
>> +#define __GFP_MEMPOOL 0x10000u/* Mempool allocation */
>
>
> I think I'll rename this to "__GFP_NOMEMALLOC". Things other then mempool
> might want to use this.
>

Sure, I wasn't set on GFP_MEMPOOL.


--
SUSE Labs, Novell Inc.

2005-04-13 01:14:51

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 3/9] no PF_MEMALLOC tinkering

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>PF_MEMALLOC is really not a tool for tinkering. It is pretty specifically
>> used to prevent recursion into page reclaim, and to prevent low memory
>> deadlocks.
>>
>> The mm/swap_state.c code was the only legitimate tinkerer. Its concern
>> was addressed by the previous patch.
>
>
> What previous patch? radix tree allocation doesn't use mempools, so this
> patch will cause add_to_swap() to oom the machine with radix tree node
> allocations.
>

Sorry, just assumed they did fromt that comment.

> Now if we were to add __GFP_NOMEMALLOC in add_to_swap() then things would
> work as we want them to.
>

That would be good.

>
> The dm_crypt change looks OK.
>
>
> The code in mpage.c is saying "if we failed to allocate a correctly-sized
> bvec and if we're doing pageout then try to allocate a smaller-sized bvec
> instead".
>
> It's probably fairly useless, but afaict there's nothing in any of the
> other patches here which makes it redundant.
>

Well, I didn't like it because it uses mempools anyway, so it
might as well just wait for its allocation.

My motivation is to remove all external users of PF_MEMALLOC,
really.

But it looks like in this case, the code is dead anyway, because
mempool_alloc will never return NULL if it is passed __GFP_WAIT.

--
SUSE Labs, Novell Inc.

2005-04-13 01:37:38

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 6/9] blk: unplug later

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>get_request_wait needn't unplug the device immediately.
>
>
> Probably. But what if the get_request(q, rw, GFP_NOIO); did
> some sleeping?
>

It can't sleep unless it returns the request, because it
is using mempool allocs. So any time it returns NULL, it
hasn't slept.

But Jens would have a better idea of the correct behaviour.
Jens, what do you think?

--
SUSE Labs, Novell Inc.

2005-04-13 01:30:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/9] no PF_MEMALLOC tinkering

Nick Piggin <[email protected]> wrote:
>
> PF_MEMALLOC is really not a tool for tinkering. It is pretty specifically
> used to prevent recursion into page reclaim, and to prevent low memory
> deadlocks.
>
> The mm/swap_state.c code was the only legitimate tinkerer. Its concern
> was addressed by the previous patch.

What previous patch? radix tree allocation doesn't use mempools, so this
patch will cause add_to_swap() to oom the machine with radix tree node
allocations.

Now if we were to add __GFP_NOMEMALLOC in add_to_swap() then things would
work as we want them to.


The dm_crypt change looks OK.


The code in mpage.c is saying "if we failed to allocate a correctly-sized
bvec and if we're doing pageout then try to allocate a smaller-sized bvec
instead".

It's probably fairly useless, but afaict there's nothing in any of the
other patches here which makes it redundant.


2005-04-13 02:28:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/9] GFP_ZERO fix

Nick Piggin <[email protected]> wrote:
>
> #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
> - __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
> - __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP)
> + __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL| \
> + __GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP|__GFP_ZERO)

Passing GFP_ZERO into kmem_cache_alloc() is such a bizarre thing to do,
perhaps a BUG is the correct response.

I guess it could be argued that the kmem_cache_alloc() callers "knows" that
the ctor will be zeroing out all the objects, but it would seem cleaner to
me to pass the "you should use GFP_ZERO" hint into kmem_cache_create()
rather than kmem_cache_alloc().

2005-04-13 10:21:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch 6/9] blk: unplug later

On Wed, Apr 13 2005, Nick Piggin wrote:
> Andrew Morton wrote:
> >Nick Piggin <[email protected]> wrote:
> >
> >>get_request_wait needn't unplug the device immediately.
> >
> >
> >Probably. But what if the get_request(q, rw, GFP_NOIO); did
> >some sleeping?
> >
>
> It can't sleep unless it returns the request, because it
> is using mempool allocs. So any time it returns NULL, it
> hasn't slept.
>
> But Jens would have a better idea of the correct behaviour.
> Jens, what do you think?

I think the patch makes sense. Additionally, it looks safer to unplug in
the loop as well - not just as an optimization for the first run, but
further loops of the code may need to trigger an unplug of the queue.

--
Jens Axboe

2005-04-14 11:54:16

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch 1/9] GFP_ZERO fix

Andrew Morton wrote:

>Nick Piggin <[email protected]> wrote:
>
>
>> #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
>> - __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
>> - __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP)
>> + __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL| \
>> + __GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP|__GFP_ZERO)
>>
>>
>
>Passing GFP_ZERO into kmem_cache_alloc() is such a bizarre thing to do,
>perhaps a BUG is the correct response.
>
>I guess it could be argued that the kmem_cache_alloc() callers "knows" that
>the ctor will be zeroing out all the objects, but it would seem cleaner to
>me to pass the "you should use GFP_ZERO" hint into kmem_cache_create()
>rather than kmem_cache_alloc().
>
>
Right now, slab is not really suitable for GFP_ZERO:
- if debug is enabled, then objects are definitively not 0-initialized.
- if a ctor is used for zero initialization, then objects would have to
be zeroed before kmem_cache_free: The ctor is only called at object
creation, not before object reuse. But memset(,0,) just before free
would be a bit silly.

Probably a BUG_ON or WARN_ON should be added into kmem_flagcheck() and
into kmem_cache_create().

Should I write a patch?
--
Manfred