2002-01-04 08:44:18

by Jens Axboe

[permalink] [raw]
Subject: [PATCH][RFT] simple deadline I/O scheduler

Hi,

I've played around with implementing an I/O scheduler that _tries_ to
start request within a given time limit. Note that it makes no
guarentees of any sort, it's simply a "how does this work in real life"
sort of thing. It's main use is actually to properly extend the i/o
scheduler / elevator api to be able to implement more advanced
schedulers (eg cello).

The construction of this new i/o scheduler is similar to how cello is
build -- you have several "low level" schedulers and a class independent
one on top of those that decides which one to start processing.

Each request is sorted into two lists -- one is purely sector sorted,
the other is sorted [1] by expire time. We always serve request from the
sector sorted list, until one of the front requests on the expire list
has its deadline violated. Then we start following the sorted list from
the point of the deadline violated request. This is always done in
batches of eg 8 or 16 requests, to avoid seeking like mad if we can't
keep up with the deadlines. It has three configurable parameters:

READ expire time
WRITE expire time
Request batch

The meaning should be clear from the above explanation, I hope :-). The
only test I've done so far (besides verifying that it doesn't destroy
data), is to start a dd bomb ala

dd if=/dev/zero of=somefile bs=64k

and see if I can get some work done while that is running. find or
editor opens of files actually works to some degree with this i/o
scheduler, at least better than with the current stock elevator.

So give it a whirl if you want -- i/o scheduler settings are set during
kernel config, there you also have the opportunity to select which i/o
scheduler should be the kernel default. I didn't bother with configure
help entries yet, since I literally just tossed in the configure stuff
at the last moment... A few clues -- elevator_linus is the one we
currently have in the kernel, elevator_noop is really not recommend to
be user settable (mainly a driver thing), elevator_jens is the one I've
just described here.

Against 2.5.2-pre7

[1] not really sorted. currently it does not support priorities (at
which point we'll probably want to change the expire sort to something
faster than a dumb list sort), so it's just two lists (read and write)
in fifo order.

diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/drivers/block/Config.in linux/drivers/block/Config.in
--- /opt/kernel/linux-2.5.2-pre7/drivers/block/Config.in Fri Sep 14 17:04:06 2001
+++ linux/drivers/block/Config.in Fri Jan 4 03:10:42 2002
@@ -4,6 +4,25 @@
mainmenu_option next_comment
comment 'Block devices'

+mainmenu_option next_comment
+comment 'I/O schedulers'
+
+comment 'Linus I/O scheduler'
+int ' READ passover seed (kb)' CONFIG_BLK_IOSCHED_LINUS_RL 4096
+int ' WRITE passover seed (kb)' CONFIG_BLK_IOSCHED_LINUS_WL 8192
+
+comment 'Jens I/O scheduler'
+int ' READ expire time (seconds)' CONFIG_BLK_IOSCHED_JENS_RE 1
+int ' WRITE expire time (seconds)' CONFIG_BLK_IOSCHED_JENS_WE 5
+int ' Request batch' CONFIG_BLK_IOSCHED_JENS_RB 16
+
+choice 'default scheduler' \
+ "Linus-iosched CONFIG_BLK_IOSCHED_DEF_L \
+ Jens-iosched CONFIG_BLK_IOSCHED_DEF_J \
+ Noop-iosched CONFIG_BLK_IOSCHED_DEF_N" Jens-iosched
+
+endmenu
+
tristate 'Normal PC floppy disk support' CONFIG_BLK_DEV_FD
if [ "$CONFIG_AMIGA" = "y" ]; then
tristate 'Amiga floppy support' CONFIG_AMIGA_FLOPPY
diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/drivers/block/elevator.c linux/drivers/block/elevator.c
--- /opt/kernel/linux-2.5.2-pre7/drivers/block/elevator.c Fri Jan 4 02:31:26 2002
+++ linux/drivers/block/elevator.c Fri Jan 4 03:24:10 2002
@@ -58,7 +58,9 @@

next_rq = list_entry(next, struct request, queuelist);

+#if 0
BUG_ON(next_rq->flags & REQ_STARTED);
+#endif

/*
* not a sector based request
@@ -147,9 +149,10 @@
*/
if (q->last_merge) {
struct request *__rq = list_entry_rq(q->last_merge);
- BUG_ON(__rq->flags & REQ_STARTED);

- if ((ret = elv_try_merge(__rq, bio)))
+ if (!rq_mergeable(__rq))
+ q->last_merge = NULL;
+ else if ((ret = elv_try_merge(__rq, bio)))
*req = __rq;
}

@@ -255,9 +258,10 @@
if (!latency)
return -ENOMEM;

- latency[READ] = 8192;
- latency[WRITE] = 16384;
+ latency[READ] = CONFIG_BLK_IOSCHED_LINUS_RL << 1;
+ latency[WRITE] = CONFIG_BLK_IOSCHED_LINUS_WL << 1;

+ printk("elv_linus: r/w sequence %d/%d\n", latency[READ],latency[WRITE]);
e->elevator_data = latency;
return 0;
}
@@ -324,6 +328,256 @@
}

/*
+ * multi-list I/O scheduler
+ */
+static kmem_cache_t *elv_jens_entries;
+extern int queue_nr_requests;
+
+int elevator_jens_init(request_queue_t *q, elevator_t *e)
+{
+ struct elv_jens_data *edat;
+ struct elv_jens_entry *ee;
+ int creat = 0, i, re, we, rb;
+
+ if (!elv_jens_entries) {
+ elv_jens_entries = kmem_cache_create("elv_jens", sizeof(struct elv_jens_entry), 0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+ if (!elv_jens_entries)
+ return -ENOMEM;
+ creat = 1;
+ }
+
+ edat = kmalloc(sizeof(struct elv_jens_data), GFP_KERNEL);
+ if (!edat) {
+ if (creat)
+ kmem_cache_destroy(elv_jens_entries);
+ return -ENOMEM;
+ }
+
+ memset(edat, 0, sizeof(*edat));
+
+ INIT_LIST_HEAD(&edat->fifo_r);
+ INIT_LIST_HEAD(&edat->fifo_w);
+ INIT_LIST_HEAD(&edat->free_entry);
+
+ for (i = 0; i < queue_nr_requests; i++) {
+ ee = kmem_cache_alloc(elv_jens_entries, SLAB_KERNEL);
+ memset(ee, 0, sizeof(*ee));
+ list_add(&ee->fifo_list, &edat->free_entry);
+ }
+
+ /*
+ * configured values
+ */
+ re = CONFIG_BLK_IOSCHED_JENS_RE;
+ we = CONFIG_BLK_IOSCHED_JENS_WE;
+ rb = CONFIG_BLK_IOSCHED_JENS_RB;
+
+ edat->expires[READ] = re * HZ;
+ edat->expires[WRITE] = we * HZ;
+ edat->batch = rb;
+
+ e->elevator_data = edat;
+ return 0;
+}
+
+#define list_entry_fifo(entry) \
+ list_entry((entry), struct elv_jens_entry, fifo_list)
+
+void elevator_jens_exit(request_queue_t *q, elevator_t *e)
+{
+ struct elv_jens_data *edat = e->elevator_data;
+ struct list_head *head = &edat->free_entry;
+ struct elv_jens_entry *ee;
+
+ BUG_ON(!list_empty(&edat->fifo_r));
+ BUG_ON(!list_empty(&edat->fifo_w));
+
+ while (!list_empty(head)) {
+ ee = list_entry(head->next, struct elv_jens_entry, fifo_list);
+ list_del(&ee->fifo_list);
+ kmem_cache_free(elv_jens_entries, ee);
+ }
+
+ kfree(edat);
+ e->elevator_data = NULL;
+}
+
+int elevator_jens_merge(request_queue_t *q, struct request **req,
+ struct bio *bio)
+{
+ struct list_head *entry = &q->queue_head;
+ struct request *__rq;
+ int ret;
+
+ if ((ret = elv_try_last_merge(q, req, bio)))
+ return ret;
+
+ while ((entry = entry->prev) != &q->queue_head) {
+ __rq = list_entry_rq(entry);
+
+ if (__rq->flags & REQ_BARRIER)
+ break;
+
+ /*
+ * we can have started requests in the middle of the queue,
+ * not a problem
+ */
+ if (__rq->flags & REQ_STARTED)
+ continue;
+
+ if (!(__rq->flags & REQ_CMD))
+ continue;
+
+ /*
+ * for multi-list scheduler...
+ */
+ if (!*req && bio_rq_in_between(bio, __rq, &q->queue_head))
+ *req = __rq;
+
+ if ((ret = elv_try_merge(__rq, bio))) {
+ *req = __rq;
+ q->last_merge = &__rq->queuelist;
+ return ret;
+ }
+ }
+
+ return ELEVATOR_NO_MERGE;
+}
+
+void elevator_jens_add_request(request_queue_t *q, struct request *rq,
+ struct list_head *insert_here)
+{
+ elevator_t *e = &q->elevator;
+ struct elv_jens_data *edat = e->elevator_data;
+ struct elv_jens_entry *ee;
+ int rw = rq_data_dir(rq);
+
+ /*
+ * insert into sector sorted list
+ */
+ list_add(&rq->queuelist, insert_here);
+
+ if (unlikely(!(rq->flags & REQ_CMD)))
+ return;
+
+ /*
+ * insert into fifo list and assign deadline
+ */
+ BUG_ON(list_empty(&edat->free_entry));
+ ee = list_entry_fifo(edat->free_entry.next);
+ list_del(&ee->fifo_list);
+ ee->rq = rq;
+ ee->expire_time = jiffies + edat->expires[rw];
+ rq->elevator_private = ee;
+
+ if (rw == READ)
+ list_add_tail(&ee->fifo_list, &edat->fifo_r);
+ else
+ list_add_tail(&ee->fifo_list, &edat->fifo_w);
+}
+
+void elevator_jens_remove_request(request_queue_t *q, struct request *rq)
+{
+ struct elv_jens_entry *ee = rq->elevator_private;
+ struct elv_jens_data *edat = q->elevator.elevator_data;
+ struct list_head *next = rq->queuelist.next;
+
+ if (next == &q->queue_head)
+ next = NULL;
+
+ /*
+ * if we are currently following a link of the fifo list, just
+ * adjust next_entry. if not, set to NULL and elv_next_request
+ * will reposition it
+ */
+ if (edat->fifo_count) {
+ edat->next_entry = next;
+ if (!--edat->fifo_count || !next)
+ edat->restart = 1;
+ } else
+ edat->next_entry = NULL;
+
+ if (ee) {
+ rq->elevator_private = NULL;
+ list_del(&ee->fifo_list);
+ list_add(&ee->fifo_list, &edat->free_entry);
+ ee->rq = NULL;
+ }
+}
+
+inline struct list_head *elevator_jens_find_request(request_queue_t *q,
+ struct elv_jens_data *edat)
+{
+ struct list_head *entry = q->queue_head.next;
+ struct elv_jens_entry *ee = NULL, *tmp;
+
+ /*
+ * check if fifo read queue has expired entry
+ */
+ if (!list_empty(&edat->fifo_r)) {
+ tmp = list_entry_fifo(edat->fifo_r.next);
+ if (time_after(jiffies, tmp->expire_time))
+ ee = tmp;
+ }
+
+ /*
+ * check if fifo write queue has expired entry, and if so should it
+ * be served before possible read entry
+ */
+ if (!list_empty(&edat->fifo_w)) {
+ tmp = list_entry_fifo(edat->fifo_w.next);
+ if (time_after(jiffies, tmp->expire_time)) {
+ /*
+ * if they are equal, give preference to read
+ */
+ if (ee && time_after(ee->expire_time, tmp->expire_time))
+ ee = tmp;
+ }
+ }
+
+ if (ee) {
+ edat->restart = 1;
+ entry = &ee->rq->queuelist;
+ }
+
+ return entry;
+}
+
+struct request *elevator_jens_next_request(request_queue_t *q)
+{
+ elevator_t *e = &q->elevator;
+ struct elv_jens_data *edat = e->elevator_data;
+ struct list_head *next;
+
+ if (edat->next_entry)
+ return list_entry_rq(edat->next_entry);
+
+ if (unlikely(list_empty(&q->queue_head)))
+ return NULL;
+
+ next = elevator_jens_find_request(q, edat);
+
+ if (edat->restart) {
+ edat->restart = 0;
+ edat->fifo_count = edat->batch;
+ }
+
+ edat->next_entry = next;
+ return list_entry_rq(edat->next_entry);
+}
+
+void elevator_jens_merge_req(struct request *req, struct request *next)
+{
+ struct elv_jens_entry *ereq = req->elevator_private;
+ struct elv_jens_entry *enext = next->elevator_private;
+
+ if (ereq && enext) {
+ if (time_before(enext->expire_time, ereq->expire_time))
+ ereq->expire_time = enext->expire_time;
+ }
+}
+
+/*
* general block -> elevator interface starts here
*/
int elevator_init(request_queue_t *q, elevator_t *e, elevator_t type)
@@ -416,10 +670,21 @@
elevator_add_req_fn: elevator_noop_add_request,
};

+elevator_t elevator_jens = {
+ elevator_merge_fn: elevator_jens_merge,
+ elevator_merge_req_fn: elevator_jens_merge_req,
+ elevator_next_req_fn: elevator_jens_next_request,
+ elevator_add_req_fn: elevator_jens_add_request,
+ elevator_remove_req_fn: elevator_jens_remove_request,
+ elevator_init_fn: elevator_jens_init,
+ elevator_exit_fn: elevator_jens_exit,
+};
+
module_init(elevator_global_init);

EXPORT_SYMBOL(elevator_linus);
EXPORT_SYMBOL(elevator_noop);
+EXPORT_SYMBOL(elevator_jens);

EXPORT_SYMBOL(__elv_add_request);
EXPORT_SYMBOL(__elv_next_request);
diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/drivers/block/ll_rw_blk.c linux/drivers/block/ll_rw_blk.c
--- /opt/kernel/linux-2.5.2-pre7/drivers/block/ll_rw_blk.c Fri Jan 4 02:31:26 2002
+++ linux/drivers/block/ll_rw_blk.c Thu Jan 3 13:42:14 2002
@@ -798,12 +798,23 @@
**/
int blk_init_queue(request_queue_t *q, request_fn_proc *rfn, spinlock_t *lock)
{
+ elevator_t e;
int ret;

if (blk_init_free_list(q))
return -ENOMEM;

- if ((ret = elevator_init(q, &q->elevator, elevator_linus))) {
+#if defined(CONFIG_BLK_IOSCHED_DEF_L)
+ e = elevator_linus;
+#elif defined(CONFIG_BLK_IOSCHED_DEF_J)
+ e = elevator_jens;
+#elif #defined(CONFIG_BLK_IOSCHED_DEF_N)
+ e = elevator_noop;
+#else
+#error No I/O scheduler defined
+#endif
+
+ if ((ret = elevator_init(q, &q->elevator, e))) {
blk_cleanup_queue(q);
return ret;
}
@@ -818,7 +829,6 @@
q->plug_tq.data = q;
q->queue_flags = (1 << QUEUE_FLAG_CLUSTER);
q->queue_lock = lock;
- q->last_merge = NULL;

blk_queue_segment_boundary(q, 0xffffffff);

@@ -966,8 +976,10 @@
/*
* it's a bug to let this rq preempt an already started request
*/
+#if 0
if (insert_here->next != &q->queue_head)
BUG_ON(list_entry_rq(insert_here->next)->flags & REQ_STARTED);
+#endif

/*
* elevator indicated where it wants this request to be
@@ -1701,7 +1713,7 @@
*/
queue_nr_requests = 64;
if (total_ram > MB(32))
- queue_nr_requests = 256;
+ queue_nr_requests = 512;

/*
* Batch frees according to queue length
diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/include/linux/elevator.h linux/include/linux/elevator.h
--- /opt/kernel/linux-2.5.2-pre7/include/linux/elevator.h Fri Jan 4 02:31:31 2002
+++ linux/include/linux/elevator.h Fri Jan 4 03:11:13 2002
@@ -60,6 +60,49 @@
#define elv_linus_sequence(rq) ((long)(rq)->elevator_private)

/*
+ * multi-list I/O scheduler
+ */
+extern elevator_t elevator_jens;
+
+struct elv_jens_entry {
+ struct request *rq;
+ unsigned long expire_time;
+ struct list_head fifo_list;
+};
+
+struct elv_jens_data {
+ /*
+ * "next" request to be queued
+ */
+ struct list_head *next_entry;
+
+ /*
+ * currently restarted from fifo head due to starvation
+ */
+ int fifo_count;
+
+ /*
+ * time based fifo queue, for read and write. could be one list
+ * or a better data structure, since it should be just sorted
+ * by expire time. later :-)
+ */
+ struct list_head fifo_r;
+ struct list_head fifo_w;
+
+ /*
+ * available entries
+ */
+ struct list_head free_entry;
+
+ /*
+ * I/O scheduler settings
+ */
+ unsigned long expires[2];
+ int batch;
+ int restart : 1;
+};
+
+/*
* use the /proc/iosched interface, all the below is history ->
*/
typedef struct blkelv_ioctl_arg_s {

--
Jens Axboe


2002-01-04 09:21:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFT] simple deadline I/O scheduler

On Fri, Jan 04 2002, Jens Axboe wrote:
> Against 2.5.2-pre7

There was a stupid bug in the last version (last minute change...) which
caused a non-expired write to preempt an expired read, not optimal :-).
Here's a fixed version.

diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/drivers/block/Config.in linux/drivers/block/Config.in
--- /opt/kernel/linux-2.5.2-pre7/drivers/block/Config.in Fri Sep 14 17:04:06 2001
+++ linux/drivers/block/Config.in Fri Jan 4 03:10:42 2002
@@ -4,6 +4,25 @@
mainmenu_option next_comment
comment 'Block devices'

+mainmenu_option next_comment
+comment 'I/O schedulers'
+
+comment 'Linus I/O scheduler'
+int ' READ passover seed (kb)' CONFIG_BLK_IOSCHED_LINUS_RL 4096
+int ' WRITE passover seed (kb)' CONFIG_BLK_IOSCHED_LINUS_WL 8192
+
+comment 'Jens I/O scheduler'
+int ' READ expire time (seconds)' CONFIG_BLK_IOSCHED_JENS_RE 1
+int ' WRITE expire time (seconds)' CONFIG_BLK_IOSCHED_JENS_WE 5
+int ' Request batch' CONFIG_BLK_IOSCHED_JENS_RB 16
+
+choice 'default scheduler' \
+ "Linus-iosched CONFIG_BLK_IOSCHED_DEF_L \
+ Jens-iosched CONFIG_BLK_IOSCHED_DEF_J \
+ Noop-iosched CONFIG_BLK_IOSCHED_DEF_N" Jens-iosched
+
+endmenu
+
tristate 'Normal PC floppy disk support' CONFIG_BLK_DEV_FD
if [ "$CONFIG_AMIGA" = "y" ]; then
tristate 'Amiga floppy support' CONFIG_AMIGA_FLOPPY
diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/drivers/block/elevator.c linux/drivers/block/elevator.c
--- /opt/kernel/linux-2.5.2-pre7/drivers/block/elevator.c Fri Jan 4 02:31:26 2002
+++ linux/drivers/block/elevator.c Fri Jan 4 04:07:59 2002
@@ -58,8 +58,6 @@

next_rq = list_entry(next, struct request, queuelist);

- BUG_ON(next_rq->flags & REQ_STARTED);
-
/*
* not a sector based request
*/
@@ -147,9 +145,10 @@
*/
if (q->last_merge) {
struct request *__rq = list_entry_rq(q->last_merge);
- BUG_ON(__rq->flags & REQ_STARTED);

- if ((ret = elv_try_merge(__rq, bio)))
+ if (!rq_mergeable(__rq))
+ q->last_merge = NULL;
+ else if ((ret = elv_try_merge(__rq, bio)))
*req = __rq;
}

@@ -231,6 +230,12 @@
elevator_t *e = &q->elevator;
int lat = 0, *latency = e->elevator_data;

+ /*
+ * it's a bug to let this rq preempt an already started request
+ */
+ if (insert_here->next != &q->queue_head)
+ BUG_ON(list_entry_rq(insert_here->next)->flags & REQ_STARTED);
+
if (!(rq->flags & REQ_BARRIER))
lat = latency[rq_data_dir(rq)];

@@ -255,9 +260,10 @@
if (!latency)
return -ENOMEM;

- latency[READ] = 8192;
- latency[WRITE] = 16384;
+ latency[READ] = CONFIG_BLK_IOSCHED_LINUS_RL << 1;
+ latency[WRITE] = CONFIG_BLK_IOSCHED_LINUS_WL << 1;

+ printk("elv_linus: r/w sequence %d/%d\n", latency[READ],latency[WRITE]);
e->elevator_data = latency;
return 0;
}
@@ -324,6 +330,254 @@
}

/*
+ * multi-list I/O scheduler
+ */
+static kmem_cache_t *elv_jens_entries;
+extern int queue_nr_requests;
+
+int elevator_jens_init(request_queue_t *q, elevator_t *e)
+{
+ struct elv_jens_data *edat;
+ struct elv_jens_entry *ee;
+ int creat = 0, i, re, we, rb;
+
+ if (!elv_jens_entries) {
+ elv_jens_entries = kmem_cache_create("elv_jens", sizeof(struct elv_jens_entry), 0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+ if (!elv_jens_entries)
+ return -ENOMEM;
+ creat = 1;
+ }
+
+ edat = kmalloc(sizeof(struct elv_jens_data), GFP_KERNEL);
+ if (!edat) {
+ if (creat)
+ kmem_cache_destroy(elv_jens_entries);
+ return -ENOMEM;
+ }
+
+ memset(edat, 0, sizeof(*edat));
+
+ INIT_LIST_HEAD(&edat->fifo_r);
+ INIT_LIST_HEAD(&edat->fifo_w);
+ INIT_LIST_HEAD(&edat->free_entry);
+
+ for (i = 0; i < queue_nr_requests; i++) {
+ ee = kmem_cache_alloc(elv_jens_entries, SLAB_KERNEL);
+ memset(ee, 0, sizeof(*ee));
+ list_add(&ee->fifo_list, &edat->free_entry);
+ }
+
+ /*
+ * configured values
+ */
+ re = CONFIG_BLK_IOSCHED_JENS_RE;
+ we = CONFIG_BLK_IOSCHED_JENS_WE;
+ rb = CONFIG_BLK_IOSCHED_JENS_RB;
+
+ edat->expires[READ] = re * HZ;
+ edat->expires[WRITE] = we * HZ;
+ edat->batch = rb;
+
+ e->elevator_data = edat;
+ return 0;
+}
+
+#define list_entry_fifo(entry) \
+ list_entry((entry), struct elv_jens_entry, fifo_list)
+
+void elevator_jens_exit(request_queue_t *q, elevator_t *e)
+{
+ struct elv_jens_data *edat = e->elevator_data;
+ struct list_head *head = &edat->free_entry;
+ struct elv_jens_entry *ee;
+
+ BUG_ON(!list_empty(&edat->fifo_r));
+ BUG_ON(!list_empty(&edat->fifo_w));
+
+ while (!list_empty(head)) {
+ ee = list_entry(head->next, struct elv_jens_entry, fifo_list);
+ list_del(&ee->fifo_list);
+ kmem_cache_free(elv_jens_entries, ee);
+ }
+
+ kfree(edat);
+ e->elevator_data = NULL;
+}
+
+int elevator_jens_merge(request_queue_t *q, struct request **req,
+ struct bio *bio)
+{
+ struct list_head *entry = &q->queue_head;
+ struct request *__rq;
+ int ret;
+
+ if ((ret = elv_try_last_merge(q, req, bio)))
+ return ret;
+
+ while ((entry = entry->prev) != &q->queue_head) {
+ __rq = list_entry_rq(entry);
+
+ if (__rq->flags & REQ_BARRIER)
+ break;
+
+ /*
+ * we can have started requests in the middle of the queue,
+ * not a problem
+ */
+ if (__rq->flags & REQ_STARTED)
+ continue;
+
+ if (!(__rq->flags & REQ_CMD))
+ continue;
+
+ if (!*req && bio_rq_in_between(bio, __rq, &q->queue_head))
+ *req = __rq;
+
+ if ((ret = elv_try_merge(__rq, bio))) {
+ *req = __rq;
+ q->last_merge = &__rq->queuelist;
+ return ret;
+ }
+ }
+
+ return ELEVATOR_NO_MERGE;
+}
+
+void elevator_jens_add_request(request_queue_t *q, struct request *rq,
+ struct list_head *insert_here)
+{
+ elevator_t *e = &q->elevator;
+ struct elv_jens_data *edat = e->elevator_data;
+ struct elv_jens_entry *ee;
+
+ /*
+ * insert into sector sorted list
+ */
+ list_add(&rq->queuelist, insert_here);
+
+ if (unlikely(!(rq->flags & REQ_CMD)))
+ return;
+
+ /*
+ * insert into fifo list and assign deadline
+ */
+ BUG_ON(list_empty(&edat->free_entry));
+ ee = list_entry_fifo(edat->free_entry.next);
+ list_del(&ee->fifo_list);
+ ee->rq = rq;
+ rq->elevator_private = ee;
+
+ if (rq_data_dir(rq) == READ) {
+ ee->expire_time = jiffies + edat->expires[READ];
+ list_add_tail(&ee->fifo_list, &edat->fifo_r);
+ } else {
+ ee->expire_time = jiffies + edat->expires[WRITE];
+ list_add_tail(&ee->fifo_list, &edat->fifo_w);
+ }
+}
+
+void elevator_jens_remove_request(request_queue_t *q, struct request *rq)
+{
+ struct elv_jens_entry *ee = rq->elevator_private;
+ struct elv_jens_data *edat = q->elevator.elevator_data;
+ struct list_head *next = rq->queuelist.next;
+
+ if (next == &q->queue_head)
+ next = NULL;
+
+ /*
+ * if we are currently following a link of the fifo list, just
+ * adjust next_entry. if not, set to NULL and elv_next_request
+ * will reposition it
+ */
+ if (edat->fifo_count) {
+ edat->next_entry = next;
+ if (!--edat->fifo_count || !next)
+ edat->restart = 1;
+ } else
+ edat->next_entry = NULL;
+
+ if (ee) {
+ rq->elevator_private = NULL;
+ list_del(&ee->fifo_list);
+ list_add(&ee->fifo_list, &edat->free_entry);
+ ee->rq = NULL;
+ }
+}
+
+inline struct list_head *elevator_jens_find_request(request_queue_t *q,
+ struct elv_jens_data *edat)
+{
+ struct list_head *entry = q->queue_head.next;
+ struct elv_jens_entry *ee = NULL, *tmp;
+
+ /*
+ * check if fifo read queue has expired entry
+ */
+ if (!list_empty(&edat->fifo_r)) {
+ tmp = list_entry_fifo(edat->fifo_r.next);
+ if (time_after(jiffies, tmp->expire_time))
+ ee = tmp;
+ }
+
+ /*
+ * check if fifo write queue has expired entry, and if so should it
+ * be served before possible read entry
+ */
+ if (!list_empty(&edat->fifo_w)) {
+ tmp = list_entry_fifo(edat->fifo_w.next);
+ if (time_after(jiffies, tmp->expire_time)) {
+ /*
+ * if they are equal, give preference to read
+ */
+ if (ee && time_after(tmp->expire_time, ee->expire_time))
+ ee = tmp;
+ }
+ }
+
+ if (ee) {
+ edat->restart = 1;
+ entry = &ee->rq->queuelist;
+ }
+
+ return entry;
+}
+
+struct request *elevator_jens_next_request(request_queue_t *q)
+{
+ elevator_t *e = &q->elevator;
+ struct elv_jens_data *edat = e->elevator_data;
+ struct list_head *next;
+
+ if (edat->next_entry)
+ return list_entry_rq(edat->next_entry);
+
+ if (unlikely(list_empty(&q->queue_head)))
+ return NULL;
+
+ next = elevator_jens_find_request(q, edat);
+
+ if (edat->restart) {
+ edat->restart = 0;
+ edat->fifo_count = edat->batch;
+ }
+
+ edat->next_entry = next;
+ return list_entry_rq(edat->next_entry);
+}
+
+void elevator_jens_merge_req(struct request *req, struct request *next)
+{
+ struct elv_jens_entry *ereq = req->elevator_private;
+ struct elv_jens_entry *enext = next->elevator_private;
+
+ if (ereq && enext) {
+ if (time_before(enext->expire_time, ereq->expire_time))
+ ereq->expire_time = enext->expire_time;
+ }
+}
+
+/*
* general block -> elevator interface starts here
*/
int elevator_init(request_queue_t *q, elevator_t *e, elevator_t type)
@@ -416,10 +670,21 @@
elevator_add_req_fn: elevator_noop_add_request,
};

+elevator_t elevator_jens = {
+ elevator_merge_fn: elevator_jens_merge,
+ elevator_merge_req_fn: elevator_jens_merge_req,
+ elevator_next_req_fn: elevator_jens_next_request,
+ elevator_add_req_fn: elevator_jens_add_request,
+ elevator_remove_req_fn: elevator_jens_remove_request,
+ elevator_init_fn: elevator_jens_init,
+ elevator_exit_fn: elevator_jens_exit,
+};
+
module_init(elevator_global_init);

EXPORT_SYMBOL(elevator_linus);
EXPORT_SYMBOL(elevator_noop);
+EXPORT_SYMBOL(elevator_jens);

EXPORT_SYMBOL(__elv_add_request);
EXPORT_SYMBOL(__elv_next_request);
diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/drivers/block/ll_rw_blk.c linux/drivers/block/ll_rw_blk.c
--- /opt/kernel/linux-2.5.2-pre7/drivers/block/ll_rw_blk.c Fri Jan 4 02:31:26 2002
+++ linux/drivers/block/ll_rw_blk.c Fri Jan 4 03:49:28 2002
@@ -798,12 +798,23 @@
**/
int blk_init_queue(request_queue_t *q, request_fn_proc *rfn, spinlock_t *lock)
{
+ elevator_t e;
int ret;

if (blk_init_free_list(q))
return -ENOMEM;

- if ((ret = elevator_init(q, &q->elevator, elevator_linus))) {
+#if defined(CONFIG_BLK_IOSCHED_DEF_L)
+ e = elevator_linus;
+#elif defined(CONFIG_BLK_IOSCHED_DEF_J)
+ e = elevator_jens;
+#elif #defined(CONFIG_BLK_IOSCHED_DEF_N)
+ e = elevator_noop;
+#else
+#error No I/O scheduler defined
+#endif
+
+ if ((ret = elevator_init(q, &q->elevator, e))) {
blk_cleanup_queue(q);
return ret;
}
@@ -818,7 +829,6 @@
q->plug_tq.data = q;
q->queue_flags = (1 << QUEUE_FLAG_CLUSTER);
q->queue_lock = lock;
- q->last_merge = NULL;

blk_queue_segment_boundary(q, 0xffffffff);

@@ -964,12 +974,6 @@
drive_stat_acct(req, req->nr_sectors, 1);

/*
- * it's a bug to let this rq preempt an already started request
- */
- if (insert_here->next != &q->queue_head)
- BUG_ON(list_entry_rq(insert_here->next)->flags & REQ_STARTED);
-
- /*
* elevator indicated where it wants this request to be
* inserted at elevator_merge time
*/
@@ -1701,7 +1705,7 @@
*/
queue_nr_requests = 64;
if (total_ram > MB(32))
- queue_nr_requests = 256;
+ queue_nr_requests = 512;

/*
* Batch frees according to queue length
diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/include/linux/elevator.h linux/include/linux/elevator.h
--- /opt/kernel/linux-2.5.2-pre7/include/linux/elevator.h Fri Jan 4 02:31:31 2002
+++ linux/include/linux/elevator.h Fri Jan 4 03:11:13 2002
@@ -60,6 +60,49 @@
#define elv_linus_sequence(rq) ((long)(rq)->elevator_private)

/*
+ * multi-list I/O scheduler
+ */
+extern elevator_t elevator_jens;
+
+struct elv_jens_entry {
+ struct request *rq;
+ unsigned long expire_time;
+ struct list_head fifo_list;
+};
+
+struct elv_jens_data {
+ /*
+ * "next" request to be queued
+ */
+ struct list_head *next_entry;
+
+ /*
+ * currently restarted from fifo head due to starvation
+ */
+ int fifo_count;
+
+ /*
+ * time based fifo queue, for read and write. could be one list
+ * or a better data structure, since it should be just sorted
+ * by expire time. later :-)
+ */
+ struct list_head fifo_r;
+ struct list_head fifo_w;
+
+ /*
+ * available entries
+ */
+ struct list_head free_entry;
+
+ /*
+ * I/O scheduler settings
+ */
+ unsigned long expires[2];
+ int batch;
+ int restart : 1;
+};
+
+/*
* use the /proc/iosched interface, all the below is history ->
*/
typedef struct blkelv_ioctl_arg_s {

--
Jens Axboe

2002-01-04 10:52:07

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFT] simple deadline I/O scheduler

On Fri, Jan 04 2002, Jens Axboe wrote:
> On Fri, Jan 04 2002, Jens Axboe wrote:
> > Against 2.5.2-pre7
>
> There was a stupid bug in the last version (last minute change...) which
> caused a non-expired write to preempt an expired read, not optimal :-).
> Here's a fixed version.

Last version for today, honest :-). Changes:

- 'grab expired read of write' logic changed a bit
- changed expire time configuration to be in miliseconds
- removed annoying banner prints
- add better comments
- remember to reposition in fifo queue when merging two requests and
expire time changes

diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/drivers/block/Config.in linux/drivers/block/Config.in
--- /opt/kernel/linux-2.5.2-pre7/drivers/block/Config.in Fri Sep 14 17:04:06 2001
+++ linux/drivers/block/Config.in Fri Jan 4 04:41:28 2002
@@ -4,6 +4,25 @@
mainmenu_option next_comment
comment 'Block devices'

+mainmenu_option next_comment
+comment 'I/O schedulers'
+
+comment 'Linus I/O scheduler'
+int ' READ passover seed (kb)' CONFIG_BLK_IOSCHED_LINUS_RL 4096
+int ' WRITE passover seed (kb)' CONFIG_BLK_IOSCHED_LINUS_WL 8192
+
+comment 'Jens I/O scheduler'
+int ' READ expire time (ms)' CONFIG_BLK_IOSCHED_JENS_RE 1000
+int ' WRITE expire time (ms)' CONFIG_BLK_IOSCHED_JENS_WE 5000
+int ' Request batch' CONFIG_BLK_IOSCHED_JENS_RB 16
+
+choice 'default scheduler' \
+ "Linus-iosched CONFIG_BLK_IOSCHED_DEF_L \
+ Jens-iosched CONFIG_BLK_IOSCHED_DEF_J \
+ Noop-iosched CONFIG_BLK_IOSCHED_DEF_N" Jens-iosched
+
+endmenu
+
tristate 'Normal PC floppy disk support' CONFIG_BLK_DEV_FD
if [ "$CONFIG_AMIGA" = "y" ]; then
tristate 'Amiga floppy support' CONFIG_AMIGA_FLOPPY
diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/drivers/block/elevator.c linux/drivers/block/elevator.c
--- /opt/kernel/linux-2.5.2-pre7/drivers/block/elevator.c Fri Jan 4 02:31:26 2002
+++ linux/drivers/block/elevator.c Fri Jan 4 05:45:58 2002
@@ -58,8 +58,6 @@

next_rq = list_entry(next, struct request, queuelist);

- BUG_ON(next_rq->flags & REQ_STARTED);
-
/*
* not a sector based request
*/
@@ -147,9 +145,10 @@
*/
if (q->last_merge) {
struct request *__rq = list_entry_rq(q->last_merge);
- BUG_ON(__rq->flags & REQ_STARTED);

- if ((ret = elv_try_merge(__rq, bio)))
+ if (!rq_mergeable(__rq))
+ q->last_merge = NULL;
+ else if ((ret = elv_try_merge(__rq, bio)))
*req = __rq;
}

@@ -231,6 +230,12 @@
elevator_t *e = &q->elevator;
int lat = 0, *latency = e->elevator_data;

+ /*
+ * it's a bug to let this rq preempt an already started request
+ */
+ if (insert_here->next != &q->queue_head)
+ BUG_ON(list_entry_rq(insert_here->next)->flags & REQ_STARTED);
+
if (!(rq->flags & REQ_BARRIER))
lat = latency[rq_data_dir(rq)];

@@ -255,8 +260,8 @@
if (!latency)
return -ENOMEM;

- latency[READ] = 8192;
- latency[WRITE] = 16384;
+ latency[READ] = CONFIG_BLK_IOSCHED_LINUS_RL << 1;
+ latency[WRITE] = CONFIG_BLK_IOSCHED_LINUS_WL << 1;

e->elevator_data = latency;
return 0;
@@ -324,6 +329,268 @@
}

/*
+ * multi-list I/O scheduler
+ */
+static kmem_cache_t *elv_jens_entries;
+extern int queue_nr_requests;
+
+int elevator_jens_init(request_queue_t *q, elevator_t *e)
+{
+ struct elv_jens_data *edat;
+ struct elv_jens_entry *ee;
+ int creat = 0, i, re, we, rb;
+
+ if (!elv_jens_entries) {
+ elv_jens_entries = kmem_cache_create("elv_jens", sizeof(struct elv_jens_entry), 0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+ if (!elv_jens_entries)
+ return -ENOMEM;
+ creat = 1;
+ }
+
+ edat = kmalloc(sizeof(struct elv_jens_data), GFP_KERNEL);
+ if (!edat) {
+ if (creat)
+ kmem_cache_destroy(elv_jens_entries);
+ return -ENOMEM;
+ }
+
+ memset(edat, 0, sizeof(*edat));
+
+ INIT_LIST_HEAD(&edat->fifo_r);
+ INIT_LIST_HEAD(&edat->fifo_w);
+ INIT_LIST_HEAD(&edat->free_entry);
+
+ for (i = 0; i < queue_nr_requests; i++) {
+ ee = kmem_cache_alloc(elv_jens_entries, SLAB_KERNEL);
+ memset(ee, 0, sizeof(*ee));
+ list_add(&ee->fifo_list, &edat->free_entry);
+ }
+
+ /*
+ * configured values
+ */
+ re = CONFIG_BLK_IOSCHED_JENS_RE;
+ we = CONFIG_BLK_IOSCHED_JENS_WE;
+ rb = CONFIG_BLK_IOSCHED_JENS_RB;
+
+ edat->expires[READ] = (re / 1000) * HZ;
+ edat->expires[WRITE] = (we / 1000) * HZ;
+ edat->batch = rb;
+
+ e->elevator_data = edat;
+ return 0;
+}
+
+#define list_entry_fifo(entry) \
+ list_entry((entry), struct elv_jens_entry, fifo_list)
+
+void elevator_jens_exit(request_queue_t *q, elevator_t *e)
+{
+ struct elv_jens_data *edat = e->elevator_data;
+ struct list_head *head = &edat->free_entry;
+ struct elv_jens_entry *ee;
+
+ BUG_ON(!list_empty(&edat->fifo_r));
+ BUG_ON(!list_empty(&edat->fifo_w));
+
+ while (!list_empty(head)) {
+ ee = list_entry(head->next, struct elv_jens_entry, fifo_list);
+ list_del(&ee->fifo_list);
+ kmem_cache_free(elv_jens_entries, ee);
+ }
+
+ kfree(edat);
+ e->elevator_data = NULL;
+}
+
+int elevator_jens_merge(request_queue_t *q, struct request **req,
+ struct bio *bio)
+{
+ struct list_head *entry = &q->queue_head;
+ struct request *__rq;
+ int ret;
+
+ if ((ret = elv_try_last_merge(q, req, bio)))
+ return ret;
+
+ while ((entry = entry->prev) != &q->queue_head) {
+ __rq = list_entry_rq(entry);
+
+ if (__rq->flags & REQ_BARRIER)
+ break;
+
+ /*
+ * we can have started requests in the middle of the queue,
+ * not a problem
+ */
+ if (__rq->flags & REQ_STARTED)
+ continue;
+
+ if (!(__rq->flags & REQ_CMD))
+ continue;
+
+ if (!*req && bio_rq_in_between(bio, __rq, &q->queue_head))
+ *req = __rq;
+
+ if ((ret = elv_try_merge(__rq, bio))) {
+ *req = __rq;
+ q->last_merge = &__rq->queuelist;
+ return ret;
+ }
+ }
+
+ return ELEVATOR_NO_MERGE;
+}
+
+void elevator_jens_add_request(request_queue_t *q, struct request *rq,
+ struct list_head *insert_here)
+{
+ elevator_t *e = &q->elevator;
+ struct elv_jens_data *edat = e->elevator_data;
+ struct elv_jens_entry *ee;
+
+ /*
+ * insert into sector sorted list
+ */
+ list_add(&rq->queuelist, insert_here);
+
+ if (unlikely(!(rq->flags & REQ_CMD)))
+ return;
+
+ /*
+ * insert into fifo list and assign deadline
+ */
+ BUG_ON(list_empty(&edat->free_entry));
+ ee = list_entry_fifo(edat->free_entry.next);
+ list_del(&ee->fifo_list);
+ ee->rq = rq;
+ rq->elevator_private = ee;
+
+ if (rq_data_dir(rq) == READ) {
+ ee->expire_time = jiffies + edat->expires[READ];
+ list_add_tail(&ee->fifo_list, &edat->fifo_r);
+ } else {
+ ee->expire_time = jiffies + edat->expires[WRITE];
+ list_add_tail(&ee->fifo_list, &edat->fifo_w);
+ }
+}
+
+void elevator_jens_remove_request(request_queue_t *q, struct request *rq)
+{
+ struct elv_jens_entry *ee = rq->elevator_private;
+ struct elv_jens_data *edat = q->elevator.elevator_data;
+ struct list_head *next = rq->queuelist.next;
+
+ if (next == &q->queue_head)
+ next = NULL;
+
+ /*
+ * if we are currently following a link off the fifo list, just
+ * adjust next_entry. if not, set to NULL and elv_next_request
+ * will reposition it
+ */
+ if (edat->fifo_count) {
+ edat->next_entry = next;
+ if (!--edat->fifo_count || !next)
+ edat->restart = 1;
+ } else
+ edat->next_entry = NULL;
+
+ if (ee) {
+ rq->elevator_private = NULL;
+ list_del(&ee->fifo_list);
+ list_add(&ee->fifo_list, &edat->free_entry);
+ ee->rq = NULL;
+ }
+}
+
+inline struct list_head *elevator_jens_find_request(request_queue_t *q,
+ struct elv_jens_data *edat)
+{
+ struct list_head *entry = q->queue_head.next;
+ struct elv_jens_entry *ee = NULL, *tmp;
+
+ /*
+ * check if fifo write queue has expired entry
+ */
+ if (!list_empty(&edat->fifo_w)) {
+ tmp = list_entry_fifo(edat->fifo_w.next);
+ if (time_after(jiffies, tmp->expire_time))
+ ee = tmp;
+ }
+
+
+ /*
+ * always let expired read preempt expired write
+ */
+ if (!list_empty(&edat->fifo_r)) {
+ tmp = list_entry_fifo(edat->fifo_r.next);
+ if (time_after(jiffies, tmp->expire_time))
+ ee = tmp;
+ }
+
+ if (ee) {
+ edat->restart = 1;
+ entry = &ee->rq->queuelist;
+ }
+
+ return entry;
+}
+
+struct request *elevator_jens_next_request(request_queue_t *q)
+{
+ elevator_t *e = &q->elevator;
+ struct elv_jens_data *edat = e->elevator_data;
+ struct list_head *next;
+
+ /*
+ * we are currently batching, so just follow next_entry
+ */
+ if (edat->next_entry)
+ return list_entry_rq(edat->next_entry);
+
+ /*
+ * if queue_head is empty, none of the fifo queues must contain
+ * entries
+ */
+ if (unlikely(list_empty(&q->queue_head)))
+ return NULL;
+
+ /*
+ * check to see where to start serving from
+ */
+ next = elevator_jens_find_request(q, edat);
+
+ /*
+ * start of new batch
+ */
+ if (edat->restart) {
+ edat->restart = 0;
+ edat->fifo_count = edat->batch;
+ }
+
+ edat->next_entry = next;
+ return list_entry_rq(edat->next_entry);
+}
+
+void elevator_jens_merge_req(struct request *req, struct request *next)
+{
+ struct elv_jens_entry *ereq = req->elevator_private;
+ struct elv_jens_entry *enext = next->elevator_private;
+
+ if (ereq && enext) {
+ /*
+ * reposition req if expire time changes
+ */
+ if (time_before(enext->expire_time, ereq->expire_time)) {
+ list_del(&ereq->fifo_list);
+ list_add(&ereq->fifo_list, &enext->fifo_list);
+ ereq->expire_time = enext->expire_time;
+ }
+ }
+}
+
+/*
* general block -> elevator interface starts here
*/
int elevator_init(request_queue_t *q, elevator_t *e, elevator_t type)
@@ -416,10 +683,21 @@
elevator_add_req_fn: elevator_noop_add_request,
};

+elevator_t elevator_jens = {
+ elevator_merge_fn: elevator_jens_merge,
+ elevator_merge_req_fn: elevator_jens_merge_req,
+ elevator_next_req_fn: elevator_jens_next_request,
+ elevator_add_req_fn: elevator_jens_add_request,
+ elevator_remove_req_fn: elevator_jens_remove_request,
+ elevator_init_fn: elevator_jens_init,
+ elevator_exit_fn: elevator_jens_exit,
+};
+
module_init(elevator_global_init);

EXPORT_SYMBOL(elevator_linus);
EXPORT_SYMBOL(elevator_noop);
+EXPORT_SYMBOL(elevator_jens);

EXPORT_SYMBOL(__elv_add_request);
EXPORT_SYMBOL(__elv_next_request);
diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/drivers/block/ll_rw_blk.c linux/drivers/block/ll_rw_blk.c
--- /opt/kernel/linux-2.5.2-pre7/drivers/block/ll_rw_blk.c Fri Jan 4 02:31:26 2002
+++ linux/drivers/block/ll_rw_blk.c Fri Jan 4 03:49:28 2002
@@ -798,12 +798,23 @@
**/
int blk_init_queue(request_queue_t *q, request_fn_proc *rfn, spinlock_t *lock)
{
+ elevator_t e;
int ret;

if (blk_init_free_list(q))
return -ENOMEM;

- if ((ret = elevator_init(q, &q->elevator, elevator_linus))) {
+#if defined(CONFIG_BLK_IOSCHED_DEF_L)
+ e = elevator_linus;
+#elif defined(CONFIG_BLK_IOSCHED_DEF_J)
+ e = elevator_jens;
+#elif #defined(CONFIG_BLK_IOSCHED_DEF_N)
+ e = elevator_noop;
+#else
+#error No I/O scheduler defined
+#endif
+
+ if ((ret = elevator_init(q, &q->elevator, e))) {
blk_cleanup_queue(q);
return ret;
}
@@ -818,7 +829,6 @@
q->plug_tq.data = q;
q->queue_flags = (1 << QUEUE_FLAG_CLUSTER);
q->queue_lock = lock;
- q->last_merge = NULL;

blk_queue_segment_boundary(q, 0xffffffff);

@@ -964,12 +974,6 @@
drive_stat_acct(req, req->nr_sectors, 1);

/*
- * it's a bug to let this rq preempt an already started request
- */
- if (insert_here->next != &q->queue_head)
- BUG_ON(list_entry_rq(insert_here->next)->flags & REQ_STARTED);
-
- /*
* elevator indicated where it wants this request to be
* inserted at elevator_merge time
*/
@@ -1701,7 +1705,7 @@
*/
queue_nr_requests = 64;
if (total_ram > MB(32))
- queue_nr_requests = 256;
+ queue_nr_requests = 512;

/*
* Batch frees according to queue length
diff -ur -X exclude /opt/kernel/linux-2.5.2-pre7/include/linux/elevator.h linux/include/linux/elevator.h
--- /opt/kernel/linux-2.5.2-pre7/include/linux/elevator.h Fri Jan 4 02:31:31 2002
+++ linux/include/linux/elevator.h Fri Jan 4 03:11:13 2002
@@ -60,6 +60,49 @@
#define elv_linus_sequence(rq) ((long)(rq)->elevator_private)

/*
+ * multi-list I/O scheduler
+ */
+extern elevator_t elevator_jens;
+
+struct elv_jens_entry {
+ struct request *rq;
+ unsigned long expire_time;
+ struct list_head fifo_list;
+};
+
+struct elv_jens_data {
+ /*
+ * "next" request to be queued
+ */
+ struct list_head *next_entry;
+
+ /*
+ * currently restarted from fifo head due to starvation
+ */
+ int fifo_count;
+
+ /*
+ * time based fifo queue, for read and write. could be one list
+ * or a better data structure, since it should be just sorted
+ * by expire time. later :-)
+ */
+ struct list_head fifo_r;
+ struct list_head fifo_w;
+
+ /*
+ * available entries
+ */
+ struct list_head free_entry;
+
+ /*
+ * I/O scheduler settings
+ */
+ unsigned long expires[2];
+ int batch;
+ int restart : 1;
+};
+
+/*
* use the /proc/iosched interface, all the below is history ->
*/
typedef struct blkelv_ioctl_arg_s {

--
Jens Axboe

2002-01-07 19:17:30

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][RFT] simple deadline I/O scheduler

Hi!

> I've played around with implementing an I/O scheduler that _tries_ to
> start request within a given time limit. Note that it makes no
> guarentees of any sort, it's simply a "how does this work in real life"
> sort of thing. It's main use is actually to properly extend the i/o
> scheduler / elevator api to be able to implement more advanced
> schedulers (eg cello).

Would it be possible to introduce concept of I/O priority? I.e. I want
updatedb not to load disk if I need it for something else?

--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2002-01-07 19:31:39

by Mark Hahn

[permalink] [raw]
Subject: Re: [PATCH][RFT] simple deadline I/O scheduler

> Would it be possible to introduce concept of I/O priority? I.e. I want
> updatedb not to load disk if I need it for something else?

makes sense to me. actually, VM is another place where priority
could be quite useful - for instance, how hard the VM scavenges
a proc's pages. oops, there I go advocating a tunable...

VM_SWAP_ME_HARDER anyone?

2002-01-07 19:58:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][RFT] simple deadline I/O scheduler

On Mon, 7 Jan 2002, Mark Hahn wrote:

> > Would it be possible to introduce concept of I/O priority? I.e. I want
> > updatedb not to load disk if I need it for something else?
>
> makes sense to me. actually, VM is another place where priority
> could be quite useful - for instance, how hard the VM scavenges
> a proc's pages. oops, there I go advocating a tunable...
>
> VM_SWAP_ME_HARDER anyone?

This seems to work very badly, making one process swap more
means it pagefaults more and sucks up more IO bandwidth ;)

regards,

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-08 06:44:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFT] simple deadline I/O scheduler

On Sat, Jan 05 2002, Pavel Machek wrote:
> Hi!
>
> > I've played around with implementing an I/O scheduler that _tries_ to
> > start request within a given time limit. Note that it makes no
> > guarentees of any sort, it's simply a "how does this work in real life"
> > sort of thing. It's main use is actually to properly extend the i/o
> > scheduler / elevator api to be able to implement more advanced
> > schedulers (eg cello).
>
> Would it be possible to introduce concept of I/O priority? I.e. I want
> updatedb not to load disk if I need it for something else?

I've been toying with equal i/o distribution between the processes in
the system, but it isn't done yet. I know Arjan is working on a priority
scheduler, too. So something is bound to materialize sooner or later :)

--
Jens Axboe

2002-03-13 13:37:52

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH][RFT] simple deadline I/O scheduler

On January 4, 2002 09:43 am, Jens Axboe wrote:
> I've played around with implementing an I/O scheduler that _tries_ to
> start request within a given time limit. Note that it makes no
> guarentees of any sort, it's simply a "how does this work in real life"
> sort of thing. It's main use is actually to properly extend the i/o
> scheduler / elevator api to be able to implement more advanced
> schedulers (eg cello).
>
> The construction of this new i/o scheduler is similar to how cello is
> build -- you have several "low level" schedulers and a class independent
> one on top of those that decides which one to start processing.
>
> Each request is sorted into two lists -- one is purely sector sorted,
> the other is sorted [1] by expire time. We always serve request from the
> sector sorted list, until one of the front requests on the expire list
> has its deadline violated. Then we start following the sorted list from
> the point of the deadline violated request. This is always done in
> batches of eg 8 or 16 requests, to avoid seeking like mad if we can't
> keep up with the deadlines.

This post never seemed to get the attention it deserves. Do you have
performance measurements now?

As part of my experimental hack to get rid of buffer_heads I was casting
around for a structure to replace the dirty buffer list. I find myself
heading towards the conclusion I want a structure that's remarkably similar
to what you've cooked up here, but that lives at a higher level in the
system. The idea is that a page goes into the queue as soon as it's dirtied
and the elevator takes care of scheduling and merging from there. Admittedly
these ideas are half-formed at the moment, but what I see developing is a
situation where we have attempts at IO scheduling going on at two levels in
the system, the VM and bio, and the more I think about it the less sense it
makes. It's trying to be one subsystem.

Andrew Morton is also working in here, with a collection of ideas that I hope
are complementary if looked at the right way. See his '[patch] delayed disk
block allocation' post.

--
Daniel

2002-03-14 07:32:55

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFT] simple deadline I/O scheduler

On Wed, Mar 13 2002, Daniel Phillips wrote:
> On January 4, 2002 09:43 am, Jens Axboe wrote:
> > I've played around with implementing an I/O scheduler that _tries_ to
> > start request within a given time limit. Note that it makes no
> > guarentees of any sort, it's simply a "how does this work in real life"
> > sort of thing. It's main use is actually to properly extend the i/o
> > scheduler / elevator api to be able to implement more advanced
> > schedulers (eg cello).
> >
> > The construction of this new i/o scheduler is similar to how cello is
> > build -- you have several "low level" schedulers and a class independent
> > one on top of those that decides which one to start processing.
> >
> > Each request is sorted into two lists -- one is purely sector sorted,
> > the other is sorted [1] by expire time. We always serve request from the
> > sector sorted list, until one of the front requests on the expire list
> > has its deadline violated. Then we start following the sorted list from
> > the point of the deadline violated request. This is always done in
> > batches of eg 8 or 16 requests, to avoid seeking like mad if we can't
> > keep up with the deadlines.
>
> This post never seemed to get the attention it deserves. Do you have
> performance measurements now?

I have some old ones that are not too interesting. Someone else
expressed an interest in the deadline scheduler a few days back, so I
resurrected it for 2.5.7-pre1 and started testing. Right now I'm just
writing a few tools to properly stress it and get the performance
measurements that I want. So I'll be back with something soonish, I
hope.

> As part of my experimental hack to get rid of buffer_heads I was casting
> around for a structure to replace the dirty buffer list. I find myself
> heading towards the conclusion I want a structure that's remarkably similar
> to what you've cooked up here, but that lives at a higher level in the
> system. The idea is that a page goes into the queue as soon as it's dirtied
> and the elevator takes care of scheduling and merging from there. Admittedly
> these ideas are half-formed at the moment, but what I see developing is a
> situation where we have attempts at IO scheduling going on at two levels in
> the system, the VM and bio, and the more I think about it the less sense it
> makes. It's trying to be one subsystem.

Very close to some ideas I have as well. My first plan was to do 'pull
flushing' instead of the current (often horrible) pushes from bdflush
etc. Having the vm give up the dirty pages as soon as possible would
make it easier to do merging at the vm level as well.

> Andrew Morton is also working in here, with a collection of ideas that I hope
> are complementary if looked at the right way. See his '[patch] delayed disk
> block allocation' post.

Right, I noticed. We could get good write clustering in general with vm
merging, though.

--
Jens Axboe

2002-03-14 15:56:25

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH][RFT] simple deadline I/O scheduler

On March 14, 2002 08:32 am, Jens Axboe wrote:
> On Wed, Mar 13 2002, Daniel Phillips wrote:
> > Andrew Morton is also working in here, with a collection of ideas that I hope
> > are complementary if looked at the right way. See his '[patch] delayed disk
> > block allocation' post.
>
> Right, I noticed. We could get good write clustering in general with vm
> merging, though.

Sorry, I can't parse that, do you mean 'if the elevator merging is merged with
vm's' or 'if vm does merging' or... ?

--
Daniel

2002-03-15 10:59:54

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFT] simple deadline I/O scheduler

On Thu, Mar 14 2002, Daniel Phillips wrote:
> On March 14, 2002 08:32 am, Jens Axboe wrote:
> > On Wed, Mar 13 2002, Daniel Phillips wrote:
> > > Andrew Morton is also working in here, with a collection of ideas that I hope
> > > are complementary if looked at the right way. See his '[patch] delayed disk
> > > block allocation' post.
> >
> > Right, I noticed. We could get good write clustering in general with vm
> > merging, though.
>
> Sorry, I can't parse that, do you mean 'if the elevator merging is merged with
> vm's' or 'if vm does merging' or... ?

Heh, guess it wasn't too clear. It was supposed to say 'if vm does the
merging' or at least partially does it.

--
Jens Axboe