2009-04-06 12:48:42

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 0/8][RFC] IO latency/throughput fixes

Hi,

This is a set of patches that I worked on today in the hopes
of furthering the latency goals and at least fixing some of
the write regression with fwrite + fsync that current -git
is suffering from.

I haven't done any latency tests yet, I'm just tossing this
out there so we can collaborate on improving things. What I
did test was the silly fwrite() + fsync() loop test, which
is a LOT slower in current -git that it used to be. The test
is basically:

while (nr--) {
f = fopen();
fprintf(f, "Some data here\n");
fsync(fileno(f));
fclose(f);
}

which (for nr == 2000) takes 16 seconds in -git, completes
in 0.9s with the patches.

--
Jens Axboe


2009-04-06 12:48:29

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 2/8] Add WRITE_SYNC_PLUG and SWRITE_SYNC_PLUG

(S)WRITE_SYNC always unplugs the device right after IO submission.
Sometimes we want to build up a queue before doing so, so add
variants that explicitly DON'T unplug the queue. The caller must
then do that after submitting all the IO.

Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/fs.h | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index a09e17c..ea05109 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -96,7 +96,10 @@ struct inodes_stat_t {
#define READ_SYNC (READ | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
#define READ_META (READ | (1 << BIO_RW_META))
#define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
+#define WRITE_SYNC_PLUG (WRITE | (1 << BIO_RW_SYNCIO))
#define SWRITE_SYNC (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
+#define SWRITE_SYNC_PLUG \
+ (SWRITE | (1 << BIO_RW_SYNCIO))
#define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER))
#define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)
#define DISCARD_BARRIER ((1 << BIO_RW_DISCARD) | (1 << BIO_RW_BARRIER))
--
1.6.2.1.423.g442d

2009-04-06 12:49:17

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 4/8] jbd: use WRITE_SYNC_PLUG instead of WRITE_SYNC

When you are going to be submitting several sync writes, we want to
give the IO scheduler a chance to merge some of them. Instead of
using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
and rely on sync_buffer() doing the unplug when someone does a
wait_on_buffer()/lock_buffer().

Signed-off-by: Jens Axboe <[email protected]>
---
fs/jbd/commit.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index f8077b9..a8e8513 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -351,8 +351,13 @@ void journal_commit_transaction(journal_t *journal)
spin_lock(&journal->j_state_lock);
commit_transaction->t_state = T_LOCKED;

+ /*
+ * Use plugged writes here, since we want to submit several before
+ * we unplug the device. We don't do explicit unplugging in here,
+ * instead we rely on sync_buffer() doing the unplug for us.
+ */
if (commit_transaction->t_synchronous_commit)
- write_op = WRITE_SYNC;
+ write_op = WRITE_SYNC_PLUG;
spin_lock(&commit_transaction->t_handle_lock);
while (commit_transaction->t_updates) {
DEFINE_WAIT(wait);
--
1.6.2.1.423.g442d

2009-04-06 12:49:49

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 5/8] jbd2: use WRITE_SYNC_PLUG instead of WRITE_SYNC

When you are going to be submitting several sync writes, we want to
give the IO scheduler a chance to merge some of them. Instead of
using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
and rely on sync_buffer() doing the unplug when someone does a
wait_on_buffer()/lock_buffer().

Signed-off-by: Jens Axboe <[email protected]>
---
fs/jbd2/commit.c | 13 +++++++++----
1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 4ea7237..073c8c3 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -138,7 +138,7 @@ static int journal_submit_commit_record(journal_t *journal,
set_buffer_ordered(bh);
barrier_done = 1;
}
- ret = submit_bh(WRITE_SYNC, bh);
+ ret = submit_bh(WRITE_SYNC_PLUG, bh);
if (barrier_done)
clear_buffer_ordered(bh);

@@ -159,7 +159,7 @@ static int journal_submit_commit_record(journal_t *journal,
lock_buffer(bh);
set_buffer_uptodate(bh);
clear_buffer_dirty(bh);
- ret = submit_bh(WRITE_SYNC, bh);
+ ret = submit_bh(WRITE_SYNC_PLUG, bh);
}
*cbh = bh;
return ret;
@@ -190,7 +190,7 @@ retry:
set_buffer_uptodate(bh);
bh->b_end_io = journal_end_buffer_io_sync;

- ret = submit_bh(WRITE_SYNC, bh);
+ ret = submit_bh(WRITE_SYNC_PLUG, bh);
if (ret) {
unlock_buffer(bh);
return ret;
@@ -402,8 +402,13 @@ void jbd2_journal_commit_transaction(journal_t *journal)
spin_lock(&journal->j_state_lock);
commit_transaction->t_state = T_LOCKED;

+ /*
+ * Use plugged writes here, since we want to submit several before
+ * we unplug the device. We don't do explicit unplugging in here,
+ * instead we rely on sync_buffer() doing the unplug for us.
+ */
if (commit_transaction->t_synchronous_commit)
- write_op = WRITE_SYNC;
+ write_op = WRITE_SYNC_PLUG;
stats.u.run.rs_wait = commit_transaction->t_max_wait;
stats.u.run.rs_locked = jiffies;
stats.u.run.rs_running = jbd2_time_diff(commit_transaction->t_start,
--
1.6.2.1.423.g442d

2009-04-06 12:49:34

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 1/8] block: change the request allocation/congestion logic to be sync/async based

This makes sure that we never wait on async IO for sync requests, instead
of doing the split on writes vs reads.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-core.c | 70 +++++++++++++++++++++---------------------
block/blk-sysfs.c | 40 ++++++++++++------------
block/elevator.c | 2 +-
include/linux/backing-dev.h | 12 ++++----
include/linux/blkdev.h | 52 +++++++++++++++++++++----------
mm/backing-dev.c | 10 +++---
6 files changed, 102 insertions(+), 84 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 996ed90..a32b571 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -484,11 +484,11 @@ static int blk_init_free_list(struct request_queue *q)
{
struct request_list *rl = &q->rq;

- rl->count[READ] = rl->count[WRITE] = 0;
- rl->starved[READ] = rl->starved[WRITE] = 0;
+ rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
+ rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
rl->elvpriv = 0;
- init_waitqueue_head(&rl->wait[READ]);
- init_waitqueue_head(&rl->wait[WRITE]);
+ init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
+ init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);

rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
mempool_free_slab, request_cachep, q->node);
@@ -699,18 +699,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
ioc->last_waited = jiffies;
}

-static void __freed_request(struct request_queue *q, int rw)
+static void __freed_request(struct request_queue *q, int sync)
{
struct request_list *rl = &q->rq;

- if (rl->count[rw] < queue_congestion_off_threshold(q))
- blk_clear_queue_congested(q, rw);
+ if (rl->count[sync] < queue_congestion_off_threshold(q))
+ blk_clear_queue_congested(q, sync);

- if (rl->count[rw] + 1 <= q->nr_requests) {
- if (waitqueue_active(&rl->wait[rw]))
- wake_up(&rl->wait[rw]);
+ if (rl->count[sync] + 1 <= q->nr_requests) {
+ if (waitqueue_active(&rl->wait[sync]))
+ wake_up(&rl->wait[sync]);

- blk_clear_queue_full(q, rw);
+ blk_clear_queue_full(q, sync);
}
}

@@ -718,18 +718,18 @@ static void __freed_request(struct request_queue *q, int rw)
* A request has just been released. Account for it, update the full and
* congestion status, wake up any waiters. Called under q->queue_lock.
*/
-static void freed_request(struct request_queue *q, int rw, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv)
{
struct request_list *rl = &q->rq;

- rl->count[rw]--;
+ rl->count[sync]--;
if (priv)
rl->elvpriv--;

- __freed_request(q, rw);
+ __freed_request(q, sync);

- if (unlikely(rl->starved[rw ^ 1]))
- __freed_request(q, rw ^ 1);
+ if (unlikely(rl->starved[sync ^ 1]))
+ __freed_request(q, sync ^ 1);
}

/*
@@ -743,15 +743,15 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
struct request *rq = NULL;
struct request_list *rl = &q->rq;
struct io_context *ioc = NULL;
- const int rw = rw_flags & 0x01;
+ const bool is_sync = rw_is_sync(rw_flags) != 0;
int may_queue, priv;

may_queue = elv_may_queue(q, rw_flags);
if (may_queue == ELV_MQUEUE_NO)
goto rq_starved;

- if (rl->count[rw]+1 >= queue_congestion_on_threshold(q)) {
- if (rl->count[rw]+1 >= q->nr_requests) {
+ if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
+ if (rl->count[is_sync]+1 >= q->nr_requests) {
ioc = current_io_context(GFP_ATOMIC, q->node);
/*
* The queue will fill after this allocation, so set
@@ -759,9 +759,9 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* This process will be allowed to complete a batch of
* requests, others will be blocked.
*/
- if (!blk_queue_full(q, rw)) {
+ if (!blk_queue_full(q, is_sync)) {
ioc_set_batching(q, ioc);
- blk_set_queue_full(q, rw);
+ blk_set_queue_full(q, is_sync);
} else {
if (may_queue != ELV_MQUEUE_MUST
&& !ioc_batching(q, ioc)) {
@@ -774,7 +774,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
}
}
}
- blk_set_queue_congested(q, rw);
+ blk_set_queue_congested(q, is_sync);
}

/*
@@ -782,11 +782,11 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* limit of requests, otherwise we could have thousands of requests
* allocated with any setting of ->nr_requests
*/
- if (rl->count[rw] >= (3 * q->nr_requests / 2))
+ if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
goto out;

- rl->count[rw]++;
- rl->starved[rw] = 0;
+ rl->count[is_sync]++;
+ rl->starved[is_sync] = 0;

priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
if (priv)
@@ -804,7 +804,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* wait queue, but this is pretty rare.
*/
spin_lock_irq(q->queue_lock);
- freed_request(q, rw, priv);
+ freed_request(q, is_sync, priv);

/*
* in the very unlikely event that allocation failed and no
@@ -814,8 +814,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* rq mempool into READ and WRITE
*/
rq_starved:
- if (unlikely(rl->count[rw] == 0))
- rl->starved[rw] = 1;
+ if (unlikely(rl->count[is_sync] == 0))
+ rl->starved[is_sync] = 1;

goto out;
}
@@ -829,7 +829,7 @@ rq_starved:
if (ioc_batching(q, ioc))
ioc->nr_batch_requests--;

- trace_block_getrq(q, bio, rw);
+ trace_block_getrq(q, bio, rw_flags & 1);
out:
return rq;
}
@@ -843,7 +843,7 @@ out:
static struct request *get_request_wait(struct request_queue *q, int rw_flags,
struct bio *bio)
{
- const int rw = rw_flags & 0x01;
+ const bool is_sync = rw_is_sync(rw_flags) != 0;
struct request *rq;

rq = get_request(q, rw_flags, bio, GFP_NOIO);
@@ -852,10 +852,10 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
struct io_context *ioc;
struct request_list *rl = &q->rq;

- prepare_to_wait_exclusive(&rl->wait[rw], &wait,
+ prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
TASK_UNINTERRUPTIBLE);

- trace_block_sleeprq(q, bio, rw);
+ trace_block_sleeprq(q, bio, rw_flags & 1);

__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);
@@ -871,7 +871,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
ioc_set_batching(q, ioc);

spin_lock_irq(q->queue_lock);
- finish_wait(&rl->wait[rw], &wait);
+ finish_wait(&rl->wait[is_sync], &wait);

rq = get_request(q, rw_flags, bio, GFP_NOIO);
};
@@ -1070,14 +1070,14 @@ void __blk_put_request(struct request_queue *q, struct request *req)
* it didn't come out of our reserved rq pools
*/
if (req->cmd_flags & REQ_ALLOCED) {
- int rw = rq_data_dir(req);
+ int is_sync = rq_is_sync(req) != 0;
int priv = req->cmd_flags & REQ_ELVPRIV;

BUG_ON(!list_empty(&req->queuelist));
BUG_ON(!hlist_unhashed(&req->hash));

blk_free_request(q, req);
- freed_request(q, rw, priv);
+ freed_request(q, is_sync, priv);
}
}
EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e29ddfc..3ff9bba 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -48,28 +48,28 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
q->nr_requests = nr;
blk_queue_congestion_threshold(q);

- if (rl->count[READ] >= queue_congestion_on_threshold(q))
- blk_set_queue_congested(q, READ);
- else if (rl->count[READ] < queue_congestion_off_threshold(q))
- blk_clear_queue_congested(q, READ);
-
- if (rl->count[WRITE] >= queue_congestion_on_threshold(q))
- blk_set_queue_congested(q, WRITE);
- else if (rl->count[WRITE] < queue_congestion_off_threshold(q))
- blk_clear_queue_congested(q, WRITE);
-
- if (rl->count[READ] >= q->nr_requests) {
- blk_set_queue_full(q, READ);
- } else if (rl->count[READ]+1 <= q->nr_requests) {
- blk_clear_queue_full(q, READ);
- wake_up(&rl->wait[READ]);
+ if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+ blk_set_queue_congested(q, BLK_RW_SYNC);
+ else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+ blk_clear_queue_congested(q, BLK_RW_SYNC);
+
+ if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+ blk_set_queue_congested(q, BLK_RW_ASYNC);
+ else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+ blk_clear_queue_congested(q, BLK_RW_ASYNC);
+
+ if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+ blk_set_queue_full(q, BLK_RW_SYNC);
+ } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+ blk_clear_queue_full(q, BLK_RW_SYNC);
+ wake_up(&rl->wait[BLK_RW_SYNC]);
}

- if (rl->count[WRITE] >= q->nr_requests) {
- blk_set_queue_full(q, WRITE);
- } else if (rl->count[WRITE]+1 <= q->nr_requests) {
- blk_clear_queue_full(q, WRITE);
- wake_up(&rl->wait[WRITE]);
+ if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+ blk_set_queue_full(q, BLK_RW_ASYNC);
+ } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+ blk_clear_queue_full(q, BLK_RW_ASYNC);
+ wake_up(&rl->wait[BLK_RW_ASYNC]);
}
spin_unlock_irq(q->queue_lock);
return ret;
diff --git a/block/elevator.c b/block/elevator.c
index 98259ed..ca6788a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -677,7 +677,7 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
}

if (unplug_it && blk_queue_plugged(q)) {
- int nrq = q->rq.count[READ] + q->rq.count[WRITE]
+ int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
- q->in_flight;

if (nrq >= q->unplug_thresh)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index bee52ab..0ec2c59 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -24,8 +24,8 @@ struct dentry;
*/
enum bdi_state {
BDI_pdflush, /* A pdflush thread is working this device */
- BDI_write_congested, /* The write queue is getting full */
- BDI_read_congested, /* The read queue is getting full */
+ BDI_async_congested, /* The async (write) queue is getting full */
+ BDI_sync_congested, /* The sync queue is getting full */
BDI_unused, /* Available bits start here */
};

@@ -215,18 +215,18 @@ static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)

static inline int bdi_read_congested(struct backing_dev_info *bdi)
{
- return bdi_congested(bdi, 1 << BDI_read_congested);
+ return bdi_congested(bdi, 1 << BDI_sync_congested);
}

static inline int bdi_write_congested(struct backing_dev_info *bdi)
{
- return bdi_congested(bdi, 1 << BDI_write_congested);
+ return bdi_congested(bdi, 1 << BDI_async_congested);
}

static inline int bdi_rw_congested(struct backing_dev_info *bdi)
{
- return bdi_congested(bdi, (1 << BDI_read_congested)|
- (1 << BDI_write_congested));
+ return bdi_congested(bdi, (1 << BDI_sync_congested) |
+ (1 << BDI_async_congested));
}

void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 465d6ba..67dae3b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -38,6 +38,10 @@ struct request;
typedef void (rq_end_io_fn)(struct request *, int);

struct request_list {
+ /*
+ * count[], starved[], and wait[] are indexed by
+ * BLK_RW_SYNC/BLK_RW_ASYNC
+ */
int count[2];
int starved[2];
int elvpriv;
@@ -66,6 +70,11 @@ enum rq_cmd_type_bits {
REQ_TYPE_ATA_PC,
};

+enum {
+ BLK_RW_ASYNC = 0,
+ BLK_RW_SYNC = 1,
+};
+
/*
* For request of type REQ_TYPE_LINUX_BLOCK, rq->cmd[0] is the opcode being
* sent down (similar to how REQ_TYPE_BLOCK_PC means that ->cmd[] holds a
@@ -103,7 +112,7 @@ enum rq_flag_bits {
__REQ_QUIET, /* don't worry about errors */
__REQ_PREEMPT, /* set for "ide_preempt" requests */
__REQ_ORDERED_COLOR, /* is before or after barrier */
- __REQ_RW_SYNC, /* request is sync (O_DIRECT) */
+ __REQ_RW_SYNC, /* request is sync (sync write or read) */
__REQ_ALLOCED, /* request came from our alloc pool */
__REQ_RW_META, /* metadata io request */
__REQ_COPY_USER, /* contains copies of user pages */
@@ -438,8 +447,8 @@ struct request_queue
#define QUEUE_FLAG_CLUSTER 0 /* cluster several segments into 1 */
#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
#define QUEUE_FLAG_STOPPED 2 /* queue is stopped */
-#define QUEUE_FLAG_READFULL 3 /* read queue has been filled */
-#define QUEUE_FLAG_WRITEFULL 4 /* write queue has been filled */
+#define QUEUE_FLAG_SYNCFULL 3 /* read queue has been filled */
+#define QUEUE_FLAG_ASYNCFULL 4 /* write queue has been filled */
#define QUEUE_FLAG_DEAD 5 /* queue being torn down */
#define QUEUE_FLAG_REENTER 6 /* Re-entrancy avoidance */
#define QUEUE_FLAG_PLUGGED 7 /* queue is plugged */
@@ -611,32 +620,41 @@ enum {
#define rq_data_dir(rq) ((rq)->cmd_flags & 1)

/*
- * We regard a request as sync, if it's a READ or a SYNC write.
+ * We regard a request as sync, if either a read or a sync write
*/
-#define rq_is_sync(rq) (rq_data_dir((rq)) == READ || (rq)->cmd_flags & REQ_RW_SYNC)
+static inline bool rw_is_sync(unsigned int rw_flags)
+{
+ return !(rw_flags & REQ_RW) || (rw_flags & REQ_RW_SYNC);
+}
+
+static inline bool rq_is_sync(struct request *rq)
+{
+ return rw_is_sync(rq->cmd_flags);
+}
+
#define rq_is_meta(rq) ((rq)->cmd_flags & REQ_RW_META)

-static inline int blk_queue_full(struct request_queue *q, int rw)
+static inline int blk_queue_full(struct request_queue *q, int sync)
{
- if (rw == READ)
- return test_bit(QUEUE_FLAG_READFULL, &q->queue_flags);
- return test_bit(QUEUE_FLAG_WRITEFULL, &q->queue_flags);
+ if (sync)
+ return test_bit(QUEUE_FLAG_SYNCFULL, &q->queue_flags);
+ return test_bit(QUEUE_FLAG_ASYNCFULL, &q->queue_flags);
}

-static inline void blk_set_queue_full(struct request_queue *q, int rw)
+static inline void blk_set_queue_full(struct request_queue *q, int sync)
{
- if (rw == READ)
- queue_flag_set(QUEUE_FLAG_READFULL, q);
+ if (sync)
+ queue_flag_set(QUEUE_FLAG_SYNCFULL, q);
else
- queue_flag_set(QUEUE_FLAG_WRITEFULL, q);
+ queue_flag_set(QUEUE_FLAG_ASYNCFULL, q);
}

-static inline void blk_clear_queue_full(struct request_queue *q, int rw)
+static inline void blk_clear_queue_full(struct request_queue *q, int sync)
{
- if (rw == READ)
- queue_flag_clear(QUEUE_FLAG_READFULL, q);
+ if (sync)
+ queue_flag_clear(QUEUE_FLAG_SYNCFULL, q);
else
- queue_flag_clear(QUEUE_FLAG_WRITEFULL, q);
+ queue_flag_clear(QUEUE_FLAG_ASYNCFULL, q);
}


diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index be68c95..493b468 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -284,12 +284,12 @@ static wait_queue_head_t congestion_wqh[2] = {
};


-void clear_bdi_congested(struct backing_dev_info *bdi, int rw)
+void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
enum bdi_state bit;
- wait_queue_head_t *wqh = &congestion_wqh[rw];
+ wait_queue_head_t *wqh = &congestion_wqh[sync];

- bit = (rw == WRITE) ? BDI_write_congested : BDI_read_congested;
+ bit = sync ? BDI_sync_congested : BDI_async_congested;
clear_bit(bit, &bdi->state);
smp_mb__after_clear_bit();
if (waitqueue_active(wqh))
@@ -297,11 +297,11 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int rw)
}
EXPORT_SYMBOL(clear_bdi_congested);

-void set_bdi_congested(struct backing_dev_info *bdi, int rw)
+void set_bdi_congested(struct backing_dev_info *bdi, int sync)
{
enum bdi_state bit;

- bit = (rw == WRITE) ? BDI_write_congested : BDI_read_congested;
+ bit = sync ? BDI_sync_congested : BDI_async_congested;
set_bit(bit, &bdi->state);
}
EXPORT_SYMBOL(set_bdi_congested);
--
1.6.2.1.423.g442d

2009-04-06 12:48:56

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 3/8] block: fsync_buffers_list() should use SWRITE_SYNC_PLUG

Then it can submit all the buffers without unplugging for each one.
We will kick off the pending IO if we come across a new address space.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/buffer.c | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 5d55a89..43afaa5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -737,7 +737,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
{
struct buffer_head *bh;
struct list_head tmp;
- struct address_space *mapping;
+ struct address_space *mapping, *prev_mapping = NULL;
int err = 0, err2;

INIT_LIST_HEAD(&tmp);
@@ -762,7 +762,18 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
* contents - it is a noop if I/O is still in
* flight on potentially older contents.
*/
- ll_rw_block(SWRITE_SYNC, 1, &bh);
+ ll_rw_block(SWRITE_SYNC_PLUG, 1, &bh);
+
+ /*
+ * Kick off IO for the previous mapping. Note
+ * that we will not run the very last mapping,
+ * wait_on_buffer() will do that for us
+ * through sync_buffer().
+ */
+ if (prev_mapping && prev_mapping != mapping)
+ blk_run_address_space(prev_mapping);
+ prev_mapping = mapping;
+
brelse(bh);
spin_lock(lock);
}
@@ -2957,12 +2968,13 @@ void ll_rw_block(int rw, int nr, struct buffer_head *bhs[])
for (i = 0; i < nr; i++) {
struct buffer_head *bh = bhs[i];

- if (rw == SWRITE || rw == SWRITE_SYNC)
+ if (rw == SWRITE || rw == SWRITE_SYNC || rw == SWRITE_SYNC_PLUG)
lock_buffer(bh);
else if (!trylock_buffer(bh))
continue;

- if (rw == WRITE || rw == SWRITE || rw == SWRITE_SYNC) {
+ if (rw == WRITE || rw == SWRITE || rw == SWRITE_SYNC ||
+ rw == SWRITE_SYNC_PLUG) {
if (test_clear_buffer_dirty(bh)) {
bh->b_end_io = end_buffer_write_sync;
get_bh(bh);
--
1.6.2.1.423.g442d

2009-04-06 12:50:09

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 6/8] block: enabling plugging on SSD devices that don't do queuing

For the older SSD devices that don't do command queuing, we do want to
enable plugging to get better merging.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-core.c | 13 +++++++++++--
1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a32b571..c4198f0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1136,6 +1136,15 @@ void init_request_from_bio(struct request *req, struct bio *bio)
blk_rq_bio_prep(req->q, req, bio);
}

+/*
+ * Only disabling plugging for non-rotational devices if it does tagging
+ * as well, otherwise we do need the proper merging
+ */
+static inline bool queue_should_plug(struct request_queue *q)
+{
+ return !(blk_queue_nonrot(q) && blk_queue_tagged(q));
+}
+
static int __make_request(struct request_queue *q, struct bio *bio)
{
struct request *req;
@@ -1242,11 +1251,11 @@ get_rq:
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
bio_flagged(bio, BIO_CPU_AFFINE))
req->cpu = blk_cpu_to_group(smp_processor_id());
- if (!blk_queue_nonrot(q) && elv_queue_empty(q))
+ if (queue_should_plug(q) && elv_queue_empty(q))
blk_plug_device(q);
add_request(q, req);
out:
- if (unplug || blk_queue_nonrot(q))
+ if (unplug || !queue_should_plug(q))
__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);
return 0;
--
1.6.2.1.423.g442d

2009-04-06 12:50:36

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 8/8] block: switch sync_dirty_buffer() over to WRITE_SYNC

We should now have the logic in place to handle this properly
without regressing on the write performance, so re-enable
the sync writes.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/buffer.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 43afaa5..6e35762 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3010,7 +3010,7 @@ int sync_dirty_buffer(struct buffer_head *bh)
if (test_clear_buffer_dirty(bh)) {
get_bh(bh);
bh->b_end_io = end_buffer_write_sync;
- ret = submit_bh(WRITE, bh);
+ ret = submit_bh(WRITE_SYNC, bh);
wait_on_buffer(bh);
if (buffer_eopnotsupp(bh)) {
clear_buffer_eopnotsupp(bh);
--
1.6.2.1.423.g442d

2009-04-06 12:50:53

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 7/8] block: Add flag for telling the IO schedulers NOT to anticipate more IO

By default, CFQ will anticipate more IO from a given io context if the
previously completed IO was sync. This used to be fine, since the only
sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
being used now, we don't want to anticipate for those.

Add a bio/request flag that informs the IO scheduler that this is a sync
request that we should not idle for. Introduce WRITE_ODIRECT specifically
for O_DIRECT writes, and make sure that the other sync writes set this
flag.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-core.c | 2 ++
block/cfq-iosched.c | 4 +++-
fs/direct-io.c | 2 +-
include/linux/bio.h | 19 +++++++++++--------
include/linux/blkdev.h | 3 +++
include/linux/fs.h | 9 +++++----
6 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c4198f0..2557280 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1128,6 +1128,8 @@ void init_request_from_bio(struct request *req, struct bio *bio)
req->cmd_flags |= REQ_UNPLUG;
if (bio_rw_meta(bio))
req->cmd_flags |= REQ_RW_META;
+ if (bio_noidle(bio))
+ req->cmd_flags |= REQ_NOIDLE;

req->errors = 0;
req->hard_sector = req->sector = bio->bi_sector;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 664ebfd..9e80934 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1992,8 +1992,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
}
if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
cfq_slice_expired(cfqd, 1);
- else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list))
+ else if (sync && !rq_noidle(rq) &&
+ RB_EMPTY_ROOT(&cfqq->sort_list)) {
cfq_arm_slice_timer(cfqd);
+ }
}

if (!cfqd->rq_in_driver)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index b6d4390..da258e7 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1126,7 +1126,7 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
int acquire_i_mutex = 0;

if (rw & WRITE)
- rw = WRITE_SYNC;
+ rw = WRITE_ODIRECT;

if (bdev)
bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b05b1d4..b900d2c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -145,20 +145,21 @@ struct bio {
* bit 2 -- barrier
* Insert a serialization point in the IO queue, forcing previously
* submitted IO to be completed before this one is issued.
- * bit 3 -- synchronous I/O hint: the block layer will unplug immediately
- * Note that this does NOT indicate that the IO itself is sync, just
- * that the block layer will not postpone issue of this IO by plugging.
- * bit 4 -- metadata request
+ * bit 3 -- synchronous I/O hint.
+ * bit 4 -- Unplug the device immediately after submitting this bio.
+ * bit 5 -- metadata request
* Used for tracing to differentiate metadata and data IO. May also
* get some preferential treatment in the IO scheduler
- * bit 5 -- discard sectors
+ * bit 6 -- discard sectors
* Informs the lower level device that this range of sectors is no longer
* used by the file system and may thus be freed by the device. Used
* for flash based storage.
- * bit 6 -- fail fast device errors
- * bit 7 -- fail fast transport errors
- * bit 8 -- fail fast driver errors
+ * bit 7 -- fail fast device errors
+ * bit 8 -- fail fast transport errors
+ * bit 9 -- fail fast driver errors
* Don't want driver retries for any fast fail whatever the reason.
+ * bit 10 -- Tell the IO scheduler not to wait for more requests after this
+ one has been submitted, even if it is a SYNC request.
*/
#define BIO_RW 0 /* Must match RW in req flags (blkdev.h) */
#define BIO_RW_AHEAD 1 /* Must match FAILFAST in req flags */
@@ -170,6 +171,7 @@ struct bio {
#define BIO_RW_FAILFAST_DEV 7
#define BIO_RW_FAILFAST_TRANSPORT 8
#define BIO_RW_FAILFAST_DRIVER 9
+#define BIO_RW_NOIDLE 10

#define bio_rw_flagged(bio, flag) ((bio)->bi_rw & (1 << (flag)))

@@ -188,6 +190,7 @@ struct bio {
#define bio_rw_ahead(bio) bio_rw_flagged(bio, BIO_RW_AHEAD)
#define bio_rw_meta(bio) bio_rw_flagged(bio, BIO_RW_META)
#define bio_discard(bio) bio_rw_flagged(bio, BIO_RW_DISCARD)
+#define bio_noidle(bio) bio_rw_flagged(bio, BIO_RW_NOIDLE)

/*
* upper 16 bits of bi_rw define the io priority of this bio
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 67dae3b..e036609 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -118,6 +118,7 @@ enum rq_flag_bits {
__REQ_COPY_USER, /* contains copies of user pages */
__REQ_INTEGRITY, /* integrity metadata has been remapped */
__REQ_UNPLUG, /* unplug queue on submission */
+ __REQ_NOIDLE, /* Don't anticipate more IO after this one */
__REQ_NR_BITS, /* stops here */
};

@@ -145,6 +146,7 @@ enum rq_flag_bits {
#define REQ_COPY_USER (1 << __REQ_COPY_USER)
#define REQ_INTEGRITY (1 << __REQ_INTEGRITY)
#define REQ_UNPLUG (1 << __REQ_UNPLUG)
+#define REQ_NOIDLE (1 << __REQ_NOIDLE)

#define BLK_MAX_CDB 16

@@ -633,6 +635,7 @@ static inline bool rq_is_sync(struct request *rq)
}

#define rq_is_meta(rq) ((rq)->cmd_flags & REQ_RW_META)
+#define rq_noidle(rq) ((rq)->cmd_flags & REQ_NOIDLE)

static inline int blk_queue_full(struct request_queue *q, int sync)
{
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ea05109..cae5720 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -95,11 +95,12 @@ struct inodes_stat_t {
#define SWRITE 3 /* for ll_rw_block() - wait for buffer lock */
#define READ_SYNC (READ | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
#define READ_META (READ | (1 << BIO_RW_META))
-#define WRITE_SYNC (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
-#define WRITE_SYNC_PLUG (WRITE | (1 << BIO_RW_SYNCIO))
-#define SWRITE_SYNC (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
+#define WRITE_SYNC_PLUG (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_NOIDLE))
+#define WRITE_SYNC (WRITE_SYNC_PLUG | (1 << BIO_RW_UNPLUG))
+#define WRITE_ODIRECT (WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
#define SWRITE_SYNC_PLUG \
- (SWRITE | (1 << BIO_RW_SYNCIO))
+ (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_NOIDLE))
+#define SWRITE_SYNC (SWRITE_SYNC_PLUG | (1 << BIO_RW_UNPLUG))
#define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER))
#define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)
#define DISCARD_BARRIER ((1 << BIO_RW_DISCARD) | (1 << BIO_RW_BARRIER))
--
1.6.2.1.423.g442d

2009-04-06 13:04:26

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 06 2009, Jens Axboe wrote:
> Hi,
>
> This is a set of patches that I worked on today in the hopes
> of furthering the latency goals and at least fixing some of
> the write regression with fwrite + fsync that current -git
> is suffering from.
>
> I haven't done any latency tests yet, I'm just tossing this
> out there so we can collaborate on improving things. What I
> did test was the silly fwrite() + fsync() loop test, which
> is a LOT slower in current -git that it used to be. The test
> is basically:
>
> while (nr--) {
> f = fopen();
> fprintf(f, "Some data here\n");
> fsync(fileno(f));
> fclose(f);
> }
>
> which (for nr == 2000) takes 16 seconds in -git, completes
> in 0.9s with the patches.

Ran the fsync-tester [1]. Drive is a 3-4 years old SATA drive, fs is
ext3/writeback. IO scheduler is CFQ.

fsync time: 0.0402s
fsync time: 0.6572s
fsync time: 0.3187s
fsync time: 0.2901s
fsync time: 0.1478s
fsync time: 0.4158s
fsync time: 0.2815s
fsync time: 0.3216s
fsync time: 0.1604s
fsync time: 0.1929s
fsync time: 0.2413s
fsync time: 0.2138s
fsync time: 0.2441s
fsync time: 0.2785s
fsync time: 0.2640s

And with Linus torture dd running in the background:

fsync time: 0.0109s
fsync time: 0.5236s
fsync time: 1.2108s
fsync time: 0.2999s
fsync time: 1.5286s
fsync time: 0.2549s
fsync time: 0.4164s
fsync time: 1.1586s
fsync time: 1.6630s
fsync time: 0.6949s
fsync time: 1.0102s
fsync time: 0.3715s
fsync time: 0.6553s

[1]
http://www.gossamer-threads.com/lists/engine?do=post_attachment;postatt_id=33847;list=linux

--
Jens Axboe

2009-04-06 13:13:48

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 06 2009, Jens Axboe wrote:
> On Mon, Apr 06 2009, Jens Axboe wrote:
> > Hi,
> >
> > This is a set of patches that I worked on today in the hopes
> > of furthering the latency goals and at least fixing some of
> > the write regression with fwrite + fsync that current -git
> > is suffering from.
> >
> > I haven't done any latency tests yet, I'm just tossing this
> > out there so we can collaborate on improving things. What I
> > did test was the silly fwrite() + fsync() loop test, which
> > is a LOT slower in current -git that it used to be. The test
> > is basically:
> >
> > while (nr--) {
> > f = fopen();
> > fprintf(f, "Some data here\n");
> > fsync(fileno(f));
> > fclose(f);
> > }
> >
> > which (for nr == 2000) takes 16 seconds in -git, completes
> > in 0.9s with the patches.
>
> Ran the fsync-tester [1]. Drive is a 3-4 years old SATA drive, fs is
> ext3/writeback. IO scheduler is CFQ.
>
> fsync time: 0.0402s
> fsync time: 0.6572s
> fsync time: 0.3187s
> fsync time: 0.2901s
> fsync time: 0.1478s
> fsync time: 0.4158s
> fsync time: 0.2815s
> fsync time: 0.3216s
> fsync time: 0.1604s
> fsync time: 0.1929s
> fsync time: 0.2413s
> fsync time: 0.2138s
> fsync time: 0.2441s
> fsync time: 0.2785s
> fsync time: 0.2640s
>
> And with Linus torture dd running in the background:
>
> fsync time: 0.0109s
> fsync time: 0.5236s
> fsync time: 1.2108s
> fsync time: 0.2999s
> fsync time: 1.5286s
> fsync time: 0.2549s
> fsync time: 0.4164s
> fsync time: 1.1586s
> fsync time: 1.6630s
> fsync time: 0.6949s
> fsync time: 1.0102s
> fsync time: 0.3715s
> fsync time: 0.6553s

And here is that latter test again with NCQ disabled on the drive:

fsync time: 0.0092s
fsync time: 0.7815s
fsync time: 0.6490s
fsync time: 0.7894s
fsync time: 0.5385s
fsync time: 0.6598s
fsync time: 0.7701s
fsync time: 0.2155s
fsync time: 0.0901s
fsync time: 0.2765s
fsync time: 0.2879s
fsync time: 0.2882s
fsync time: 0.2457s
fsync time: 0.4319s
fsync time: 0.3752s

It stays nicely below 1s.

--
Jens Axboe

2009-04-06 15:06:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes



On Mon, 6 Apr 2009, Jens Axboe wrote:
>
> This is a set of patches that I worked on today in the hopes
> of furthering the latency goals and at least fixing some of
> the write regression with fwrite + fsync that current -git
> is suffering from.

Goodie. Let's just do this. After all, right now we would otherwise have
to revert the other changes as being a regression, and I absolutely _love_
the fact that we're actually finally getting somewhere on this fsync
latency issue that has been with us for so long.

Jens - I can just apply this queue (I want to test it out anyway), or if
you prefer I can pull from you. Just tell me.

Linus

2009-04-06 15:11:14

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 06 2009, Linus Torvalds wrote:
>
>
> On Mon, 6 Apr 2009, Jens Axboe wrote:
> >
> > This is a set of patches that I worked on today in the hopes
> > of furthering the latency goals and at least fixing some of
> > the write regression with fwrite + fsync that current -git
> > is suffering from.
>
> Goodie. Let's just do this. After all, right now we would otherwise have
> to revert the other changes as being a regression, and I absolutely _love_
> the fact that we're actually finally getting somewhere on this fsync
> latency issue that has been with us for so long.

Agree, it's been nagging for some time...

> Jens - I can just apply this queue (I want to test it out anyway), or if
> you prefer I can pull from you. Just tell me.

Whatever you want, if you want to pull instead of manually applying,
it's:

git://git.kernel.dk/linux-2.6-block.git blk-latency

--
Jens Axboe

2009-04-06 15:44:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes



On Mon, 6 Apr 2009, Jens Axboe wrote:
>
> Ran the fsync-tester [1]. Drive is a 3-4 years old SATA drive, fs is
> ext3/writeback. IO scheduler is CFQ.
>
> fsync time: 0.2785s
> fsync time: 0.2640s
>
> And with Linus torture dd running in the background:
>
> fsync time: 0.0109s
> fsync time: 0.5236s
> fsync time: 1.2108s

Ok, it's definitely better for me too. CFQ used to be the problem case
(with the previous patches), now I've been trying with CFQ for a while,
and it seems ok.

Not wonderful, by any means, but I haven't seen a 5+ second delay yet.
I've come close (I have a few 2+s hickups in my trace), but it's
clearly more responsive, even if I'd wish it to be better still.

One thing that I find intriguing is how the fsync time seems so
_consistent_ across a wild variety of drives. It's interesting how you see
delays that are roughly the same order of magnitude, even though you are
using an old SATA drive, and I'm using the Intel SSD. And when you turn
off TCQ, your numbers go down even more.

That just makes me suspect that there is something else than pure IO going
on. There shouldn't be any idling by the IO scheduler in my setup
("rotational" is zero for me), and quite frankly, I should not see
latencies in the seconds even _with_ TCQ, since it should be limited to
just 32 tags. Of course, maybe some of those requests just grow humongous.

So maybe one reason the "sync()" workload is so horrible is that we get
insanely big single requests. I see

[root@nehalem queue]# cat max_sectors_kb
512

so we should be limited to half a meg per request, but I guess 32 of those
will take some time even on the Intel SSD. In fact, I guess the SSD is not
really any faster than your 2-3 year old SATA disk when it comes to pure
linear throughput

Hmm. Doing a "echo 64 > max_sectors_kb" does seem to make my experience
nicer. At no really noticeable downside in throughput that I can see: the
"dd+sync" still tends to fluctuate 30-40s. But maybe I'm fooling myself.
But my 'strace' seems to agree: I'm having a hard time triggering anything
even close to a second latency now.

I wonder if we could limit the tag usage by request _size_, ie not let big
requests fill up all the tags (by all means allow writes to fill them up
if they are small - it's with many small requests that you get the biggest
advantage, after all, and with many _big_ requests that the downside is
the biggest too).

Linus

2009-04-06 15:47:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes



On Mon, 6 Apr 2009, Jens Axboe wrote:
>
> > Jens - I can just apply this queue (I want to test it out anyway), or if
> > you prefer I can pull from you. Just tell me.
>
> Whatever you want, if you want to pull instead of manually applying,
> it's:
>
> git://git.kernel.dk/linux-2.6-block.git blk-latency

Ok, I applied them for testing anyway, so I think I'll just keep the
series I have in my tree. I'm keeping my nasty "dd+sync" going in the
background, just to verify that things really feel better, but I'm already
convinced this is a winner. Especially with the request size limiter, I
can really read email _almost_ as if nothing else was going on on the
machine.

With that request size limiting, my fsync pauses tend to be in the 250ms
range, with a one 0.75s outlier in the last few minutes. I can definitely
feel that quarter-second thing, and the .75s pause was a real stutter, but
boy what a difference. I don't get the uncontrollable urge to kill that
nasty background writer any more.

Linus

2009-04-06 16:58:10

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 06 2009, Linus Torvalds wrote:
>
>
> On Mon, 6 Apr 2009, Jens Axboe wrote:
> >
> > Ran the fsync-tester [1]. Drive is a 3-4 years old SATA drive, fs is
> > ext3/writeback. IO scheduler is CFQ.
> >
> > fsync time: 0.2785s
> > fsync time: 0.2640s
> >
> > And with Linus torture dd running in the background:
> >
> > fsync time: 0.0109s
> > fsync time: 0.5236s
> > fsync time: 1.2108s
>
> Ok, it's definitely better for me too. CFQ used to be the problem case
> (with the previous patches), now I've been trying with CFQ for a while,
> and it seems ok.
>
> Not wonderful, by any means, but I haven't seen a 5+ second delay yet.
> I've come close (I have a few 2+s hickups in my trace), but it's
> clearly more responsive, even if I'd wish it to be better still.

OK that's good. I'll run some testing with this as well and perhaps we
can even do better still.

> One thing that I find intriguing is how the fsync time seems so
> _consistent_ across a wild variety of drives. It's interesting how you see
> delays that are roughly the same order of magnitude, even though you are
> using an old SATA drive, and I'm using the Intel SSD. And when you turn
> off TCQ, your numbers go down even more.
>
> That just makes me suspect that there is something else than pure IO going
> on. There shouldn't be any idling by the IO scheduler in my setup
> ("rotational" is zero for me), and quite frankly, I should not see
> latencies in the seconds even _with_ TCQ, since it should be limited to
> just 32 tags. Of course, maybe some of those requests just grow humongous.
>
> So maybe one reason the "sync()" workload is so horrible is that we get
> insanely big single requests. I see
>
> [root@nehalem queue]# cat max_sectors_kb
> 512
>
> so we should be limited to half a meg per request, but I guess 32 of those
> will take some time even on the Intel SSD. In fact, I guess the SSD is not
> really any faster than your 2-3 year old SATA disk when it comes to pure
> linear throughput

It's probably close, for your MLC version. This drive does around
60Mb/sec sequential writes, which is in the ball park with the ~70 yours
should be doing.

> Hmm. Doing a "echo 64 > max_sectors_kb" does seem to make my experience
> nicer. At no really noticeable downside in throughput that I can see: the
> "dd+sync" still tends to fluctuate 30-40s. But maybe I'm fooling myself.
> But my 'strace' seems to agree: I'm having a hard time triggering anything
> even close to a second latency now.
>
> I wonder if we could limit the tag usage by request _size_, ie not let big
> requests fill up all the tags (by all means allow writes to fill them up
> if they are small - it's with many small requests that you get the biggest
> advantage, after all, and with many _big_ requests that the downside is
> the biggest too).

I think we are doing OK with the tag async vs sync allocation, there
should be plenty of room for sync IO still. The problem is likely
earlier, before we even assign a tag. The IO schedulers will move IO to
the dispatch list and the driver will grab it from there, assign a tag,
start, etc. Perhaps we end up moving too much to the dispatch list. That
would increase latencies, even if a sync queue preempts the current
async queue and dispatches a request (which goes to the dispatch list,
ordered, and thus may have to wait for others to be serviced first).

Just speculating, I'll test and probe and see what comes up.

--
Jens Axboe

2009-04-06 17:01:28

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 06 2009, Linus Torvalds wrote:
>
>
> On Mon, 6 Apr 2009, Jens Axboe wrote:
> >
> > > Jens - I can just apply this queue (I want to test it out anyway), or if
> > > you prefer I can pull from you. Just tell me.
> >
> > Whatever you want, if you want to pull instead of manually applying,
> > it's:
> >
> > git://git.kernel.dk/linux-2.6-block.git blk-latency
>
> Ok, I applied them for testing anyway, so I think I'll just keep the
> series I have in my tree. I'm keeping my nasty "dd+sync" going in the

No problem, the end result should be identical :-)

> background, just to verify that things really feel better, but I'm already
> convinced this is a winner. Especially with the request size limiter, I
> can really read email _almost_ as if nothing else was going on on the
> machine.
>
> With that request size limiting, my fsync pauses tend to be in the 250ms
> range, with a one 0.75s outlier in the last few minutes. I can definitely
> feel that quarter-second thing, and the .75s pause was a real stutter, but
> boy what a difference. I don't get the uncontrollable urge to kill that
> nasty background writer any more.

If we can make the dd writer almost unnoticable, then I think we've come
a long way already. That really is a nasty workload.

--
Jens Axboe

2009-04-06 18:32:21

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 06, 2009 at 08:45:09AM -0700, Linus Torvalds wrote:
>
> With that request size limiting, my fsync pauses tend to be in the 250ms
> range, with a one 0.75s outlier in the last few minutes. I can definitely
> feel that quarter-second thing, and the .75s pause was a real stutter, but
> boy what a difference. I don't get the uncontrollable urge to kill that
> nasty background writer any more.

Cool! Maybe we can make things even better, but given where we are
at, this brings up an question: given Jens block latency patches
(thanks, Jens!), and the patches which add the flush on
replace-via-rename/replace-via-truncate for ext3 data=writeback,
should we make ext3 data=writeback the default for 2.6.30?

- Ted

2009-04-06 19:59:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes



On Mon, 6 Apr 2009, Theodore Tso wrote:
>
> Cool! Maybe we can make things even better, but given where we are
> at, this brings up an question: given Jens block latency patches
> (thanks, Jens!), and the patches which add the flush on
> replace-via-rename/replace-via-truncate for ext3 data=writeback,
> should we make ext3 data=writeback the default for 2.6.30?

Yes. I want to go back and see how much this all helps for the
data=ordered case (if at all), but I think we should at least consider
trying 'data=writeback' as a default.

Does the 'synchronize on rename / close-after-ftrucate' patches leave any
other intersting cases open?

Linus

2009-04-06 20:12:50

by Hua Zhong

[permalink] [raw]
Subject: RE: [PATCH 0/8][RFC] IO latency/throughput fixes

Grr..making writeback as default would break people's setup that relies on
the ordered semantics, especially in the embedded world. It's a big no-no.

> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Linus Torvalds
> Sent: Monday, April 06, 2009 12:57 PM
> To: Theodore Tso
> Cc: Jens Axboe; Linux Kernel Mailing List
> Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes
>
>
>
> On Mon, 6 Apr 2009, Theodore Tso wrote:
> >
> > Cool! Maybe we can make things even better, but given where we are
> > at, this brings up an question: given Jens block latency patches
> > (thanks, Jens!), and the patches which add the flush on
> > replace-via-rename/replace-via-truncate for ext3 data=writeback,
> > should we make ext3 data=writeback the default for 2.6.30?
>
> Yes. I want to go back and see how much this all helps for the
> data=ordered case (if at all), but I think we should at least consider
> trying 'data=writeback' as a default.
>
> Does the 'synchronize on rename / close-after-ftrucate' patches leave
> any
> other intersting cases open?
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-04-06 20:13:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes



On Mon, 6 Apr 2009, Linus Torvalds wrote:
>
> Yes. I want to go back and see how much this all helps for the
> data=ordered case (if at all), but I think we should at least consider
> trying 'data=writeback' as a default.

Hmm.

So I'm back to "data=ordered", and quite frankly, so far it's a horrible
experience.

The difference is rather startlingly huge, which it was _not_ before.

Sure, data=writeback was better before too, but in the end, the difference
between the occasional 5+ second pause and the occasional 10+ second pause
wasn't really all that interesting. They were both unusuable, and both
made me kill the background writer almost immediately.

This time I really forced myself to keep it around, just to get the whole
horridness of the experience.

Here's wahat I just got from data=writeback and all the latest patches
(which includes one extra experimental one from Jens to further improve on
things):

fsync(6) = 0 <0.839397>
fsync(6) = 0 <0.879706>
fsync(6) = 0 <0.240856>
fsync(6) = 0 <0.006497>
fsync(6) = 0 <0.207197>
fsync(4) = 0 <0.318396>
fsync(4) = 0 <0.044499>
fsync(4) = 0 <0.088381>
fsync(6) = 0 <0.001057>
fsync(9) = 0 <0.245389>
fsync(7) = 0 <0.042728>
fsync(7) = 0 <1.148262>
fsync(6) = 0 <0.200707>
fsync(6) = 0 <0.045259>
fsync(6) = 0 <0.111846>

that's all quite usable. Stuttering, yes, but never anything feeling rally
bad, or like something died.

This is with data=ordered on the same kernel:

fsync(5) = 0 <0.000949>
fsync(8) = 0 <19.674681>
fsync(8) = 0 <1.169357>
fsync(8) = 0 <17.994631>
fsync(8) = 0 <5.711143>
fsync(8) = 0 <4.759515>
fsync(4) = 0 <17.283171>
fsync(4) = 0 <4.371930>
fsync(4) = 0 <0.008327>
fsync(8) = 0 <0.000636>
fsync(11) = 0 <9.802027>
fsync(9) = 0 <14.109262>
fsync(9) = 0 <4.081919>
fsync(8) = 0 <0.000490>
fsync(8) = 0 <0.001588>
fsync(11) = 0 <18.822826>
fsync(9) = 0 <3.917717>
fsync(9) = 0 <0.001035>
fsync(8) = 0 <0.000138>
fsync(9) = 0 <17.466156>

where basically each 'sync' on the big file triggers that 15-20s total
failure. I do wonder if it's actually gotten _worse_ for the data=ordered
case, but at the same time it does make a pretty compelling argument for
switching the defaults around.

Linus

2009-04-06 20:22:49

by Linus Torvalds

[permalink] [raw]
Subject: RE: [PATCH 0/8][RFC] IO latency/throughput fixes



On Mon, 6 Apr 2009, Hua Zhong wrote:
>
> Grr..making writeback as default would break people's setup that relies on
> the ordered semantics, especially in the embedded world. It's a big no-no.

We could make it a config option, perhaps.

Linus

2009-04-06 21:19:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 06, 2009 at 01:12:08PM -0700, Hua Zhong wrote:
> Grr..making writeback as default would break people's setup that relies on
> the ordered semantics, especially in the embedded world. It's a big no-no.

I've added workarounds for 2.6.30 that provide the replace-via-rename
and replace-via-truncate workarounds for ext3 data=writeback cases.
See commits e7c8f507 and f7ab34ea.

There won't be an implied fsync for newly created files, yes, but you
could have crashed 5 seconds earlier, at which point you would have
lost the newly created file anyway. Replace-via-rename and
replace-via-truncate solves the problem for applications which are
editing pre-existing files, which was most of people's complaints
about depending on data=ordered semantics.

- Ted

2009-04-06 21:26:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 06, 2009 at 01:10:50PM -0700, Linus Torvalds wrote:
> where basically each 'sync' on the big file triggers that 15-20s total
> failure. I do wonder if it's actually gotten _worse_ for the data=ordered
> case, but at the same time it does make a pretty compelling argument for
> switching the defaults around.

My first set of patches that I pushed to you were designed to fix
things for data=ordered, and did result in improvements for 2.6.29
versus 2.6.29+ext3-latency-fixes in the data=ordered case. (It was
only in the last day or two that I switched to focusing on improving
things for data=writeback mode.)

I haven't benchmarked 2.6.29 versus 2.6.29 plus *all* of the recent
latency fixes with data=ordered, but I would expect that timings are
the same or better.

- Ted

2009-04-06 21:36:39

by Hua Zhong

[permalink] [raw]
Subject: RE: [PATCH 0/8][RFC] IO latency/throughput fixes

> I've added workarounds for 2.6.30 that provide the replace-via-rename
> and replace-via-truncate workarounds for ext3 data=writeback cases.
> See commits e7c8f507 and f7ab34ea.
>
> There won't be an implied fsync for newly created files, yes, but you
> could have crashed 5 seconds earlier, at which point you would have
> lost the newly created file anyway. Replace-via-rename and
> replace-via-truncate solves the problem for applications which are
> editing pre-existing files, which was most of people's complaints
> about depending on data=ordered semantics.

I am not talking about "most" people's complaints. There are use cases for
ext3 far beyond the desktop.

I worked on a user-space library on top of ext3 before on embedded systems.
It may not have been the case for me but I could well imagine where it could
get too clever and depend upon "ordered". You really don't want to silently
corrupt people's data by changing the default. How about security too?
Wasn't that the original reason for "ordered" mode?

Ext3 is supposed to be a stable filesystem, while ext4 is the shiny new
filesystem that gets to do all the exciting experiments. I thought it was
the idea. People do not expect such a change at this stage, for better or
worse. It's great that you can improve its performance, but the performance
problem wasn't introduced yesterday, so if people care they could always
mount it as writeback, but there is no urgency in doing it for them. A
default semantic change is just crazy talk.

Hua

2009-04-06 22:04:34

by Ray Lee

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 6, 2009 at 2:35 PM, Hua Zhong <[email protected]> wrote:
>> I've added workarounds for 2.6.30 that provide the replace-via-rename
>> and replace-via-truncate workarounds for ext3 data=writeback cases.
>> See commits e7c8f507 and f7ab34ea.
>>
>> There won't be an implied fsync for newly created files, yes, but you
>> could have crashed 5 seconds earlier, at which point you would have
>> lost the newly created file anyway.  Replace-via-rename and
>> replace-via-truncate solves the problem for applications which are
>> editing pre-existing files, which was most of people's complaints
>> about depending on data=ordered semantics.
>
> I am not talking about "most" people's complaints. There are use cases for
> ext3 far beyond the desktop.
>
> I worked on a user-space library on top of ext3 before on embedded systems.
> It may not have been the case for me but I could well imagine where it could
> get too clever and depend upon "ordered".

Speaking as another embedded Linux guy, I don't update kernels on my
embedded platforms willy-nilly, nor do I design a library that relies
upon some default behavior without specifying it explicitly. That's
just one of the prices of doing embedded development.

Your argument seems to be that someone may be relying upon default
kernel behavior and, at the same time, is willing to continually
upgrade their kernel. I'd argue that person is, y'know, nuts. If
they're willing to upgrade their kernel on something that has that
stringent of requirements, then they should be willing to force a
mount option at the same time.

If they're willing to upgrade their kernel blindly, then they
shouldn't be doing embedded development.

2009-04-06 22:21:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes



On Mon, 6 Apr 2009, Ray Lee wrote:
>
> Your argument seems to be that someone may be relying upon default
> kernel behavior and, at the same time, is willing to continually
> upgrade their kernel.

No, I do think that the argument is valid per se: the crash behavior of
"data=ordered" wrt "data=writeback" _is_ very different. And it's true
that "data=ordered" has nicer behavior in the case of a crash.

But it is also true that with all the recent patches, "data=writeback" has
become a _lot_ more attractive. It is slightly more ordered than it used
to be, in a very particular way that makes one of the biggest downsides go
away. And the latency improvements are really quite stunning (*).

So I think Hua's argument is real, but at the same time we should balance
it against the fact that a lot of people will just use whatever is the
default - and as a result we should strive to make the default be the
thing that we think people would be happiest with.

I think "ordered" was a reasonable default, but that was at least partly
because _both_ ordered and writeback sucked (partly in different ways).

I do think we could make it a config option.

Linus

(*) Admittedly with a very specific workload, but I suspect it's not that
uncommon, and there will be people who never realized what their problem
was and blamed jerky behavior on scheduler or something.

2009-04-06 22:26:18

by Hua Zhong

[permalink] [raw]
Subject: RE: [PATCH 0/8][RFC] IO latency/throughput fixes

> Speaking as another embedded Linux guy, I don't update kernels on my
> embedded platforms willy-nilly, nor do I design a library that relies
> upon some default behavior without specifying it explicitly. That's
> just one of the prices of doing embedded development.
>
> Your argument seems to be that someone may be relying upon default
> kernel behavior and, at the same time, is willing to continually
> upgrade their kernel. I'd argue that person is, y'know, nuts. If
> they're willing to upgrade their kernel on something that has that
> stringent of requirements, then they should be willing to force a
> mount option at the same time.

You are implying that if someone upgrades their kernel, then he must
1) upgrade it continuously and 2) without any scrutiny. Both are untrue.

There are certain things people expect from the kernel, such as
no ABI changes. To me ext3 default option falls into this category.

And even if they are nuts, you can't prove they don't exist, especially
given how many software already depends on the ordered mode.

You also conveniently forgot to quote my question about security. Both
are valid questions, not something I just totally made up.

> If they're willing to upgrade their kernel blindly, then they
> shouldn't be doing embedded development.

Linus once said that if you don't understand "not breaking user space" then
you should not be doing kernel development. Or something to that extent.

2009-04-06 22:48:39

by Ray Lee

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 6, 2009 at 3:25 PM, Hua Zhong <[email protected]> wrote:
> You are implying that if someone upgrades their kernel, then he must
> 1) upgrade it continuously and 2) without any scrutiny. Both are untrue.

Great, then they'll notice the default ext3 data mode changed, and
adjust their scripts accordingly. Problem solved.

> There are certain things people expect from the kernel, such as
> no ABI changes. To me ext3 default option falls into this category.

As the past several hundred messages on this topic has just shown,
while people may expected metadata vs data ordering to be a guarantee,
apparently it never was unless you specifically mounted your ext3
filesystem with data=ordered. The default was only that, a default,
and one that can still be specified explicitly.

> And even if they are nuts, you can't prove they don't exist, especially
> given how many software already depends on the ordered mode.

There are an unbounded number of possible incorrect setups in the
world. Should we live in fear of their existence without any proof?

> You also conveniently forgot to quote my question about security. Both
> are valid questions, not something I just totally made up.

Security on an embedded device starts with controlling physical
access. If they have access to the storage media all bets are off,
whether it's data=ordered or not. (Access to the storage media is
really what we're talking about here -- medical data, for example,
hitting the platter before the metadata updates that then make that
data unaccessible to other userspace processes.)

Because *if* they have access to the media, they can replace any
running code on that box, and your security is worthless.

So no, I don't see how that's a valid argument.

>> If they're willing to upgrade their kernel blindly, then they
>> shouldn't be doing embedded development.
>
> Linus once said that if you don't understand "not breaking user space" then
> you should not be doing kernel development. Or something to that extent.

Luckily, I do understand, and have argued on user-space's behalf for
years now. But your argument was hypothetical embedded developers who
upgrade their kernels without seeing what defaults changed, and then
getting burned by the different semantics. I can't muster up enough
give-a-damn to care about a minute percentage of the population (us
embedded developers) who aren't doing their job of reading
kernelnewbies to see what changed.

I accept Linus' argument that the change in crash behavior for random
folk is going to be very different, and that perhaps that alone should
drive keeping the default as it is today, but now we're talking about
the desktop space.

2009-04-06 22:53:30

by Hua Zhong

[permalink] [raw]
Subject: RE: [PATCH 0/8][RFC] IO latency/throughput fixes

> Security on an embedded device starts with controlling physical
> access. If they have access to the storage media all bets are off,
> whether it's data=ordered or not. (Access to the storage media is
> really what we're talking about here -- medical data, for example,
> hitting the platter before the metadata updates that then make that
> data unaccessible to other userspace processes.)
>
> Because *if* they have access to the media, they can replace any
> running code on that box, and your security is worthless.
>
> So no, I don't see how that's a valid argument.

The problem with security has nothing to do with embedded. It's
that when you commit metadata first and crash before you write
the data, then you get to see random blocks which might contain
sensitive information from other users.

Hua

2009-04-06 23:13:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes



On Mon, 6 Apr 2009, Linus Torvalds wrote:
> thing that we think people would be happiest with.
>
> I think "ordered" was a reasonable default, but that was at least partly
> because _both_ ordered and writeback sucked (partly in different ways).
>
> I do think we could make it a config option.

A patch _something_ like this.

A few notes:

- This is UNTESTED (of course)

- If I did this right, this _only_ overrides the data mode if it's not
explicitly specified on disk in the superblock mount options.

IOW, if you have done a

tune2fs -o journal_data_ordered

then this will _not_ override that. Only in the absense of any explicit
flags should this trigger and then make the choice be 'writeback'.

And just to be _extra_ backwards compatible, if you really want the old
behavior, and don't want to set the ordering flag explicitly, just answer
'y' to the EXT3_DEFAULTS_TO_ORDERED Kconfig question.

What do people think? Anybody want to test?

Linus

---
fs/ext3/Kconfig | 19 +++++++++++++++++++
fs/ext3/super.c | 8 +++++++-
2 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/Kconfig b/fs/ext3/Kconfig
index 8e0cfe4..fb3c1a2 100644
--- a/fs/ext3/Kconfig
+++ b/fs/ext3/Kconfig
@@ -28,6 +28,25 @@ config EXT3_FS
To compile this file system support as a module, choose M here: the
module will be called ext3.

+config EXT3_DEFAULTS_TO_ORDERED
+ bool "Default to 'data=ordered' in ext3 (legacy option)"
+ depends on EXT3_FS
+ help
+ If a filesystem does not explicitly specify a data ordering
+ mode, and the journal capability allowed it, ext3 used to
+ historically default to 'data=ordered'.
+
+ That was a rather unfortunate choice, because it leads to all
+ kinds of latency problems, and the 'data=writeback' mode is more
+ appropriate these days.
+
+ You should probably always answer 'n' here, and if you really
+ want to use 'data=ordered' mode, set it in the filesystem itself
+ with 'tune2fs -o journal_data_ordered'.
+
+ But if you really want to enable the legacy default, you can do
+ so by answering 'y' to this question.
+
config EXT3_FS_XATTR
bool "Ext3 extended attributes"
depends on EXT3_FS
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 9e5b8e3..599dbfe 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -44,6 +44,12 @@
#include "acl.h"
#include "namei.h"

+#ifdef CONFIG_EXT3_DEFAULTS_TO_ORDERED
+ #define EXT3_MOUNT_DEFAULT_DATA_MODE EXT3_MOUNT_ORDERED_DATA
+#else
+ #define EXT3_MOUNT_DEFAULT_DATA_MODE EXT3_MOUNT_WRITEBACK_DATA
+#endif
+
static int ext3_load_journal(struct super_block *, struct ext3_super_block *,
unsigned long journal_devnum);
static int ext3_create_journal(struct super_block *, struct ext3_super_block *,
@@ -1919,7 +1925,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
cope, else JOURNAL_DATA */
if (journal_check_available_features
(sbi->s_journal, 0, 0, JFS_FEATURE_INCOMPAT_REVOKE))
- set_opt(sbi->s_mount_opt, ORDERED_DATA);
+ set_opt(sbi->s_mount_opt, DEFAULT_DATA_MODE);
else
set_opt(sbi->s_mount_opt, JOURNAL_DATA);
break;

2009-04-06 23:19:48

by Alan

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

> apparently it never was unless you specifically mounted your ext3
> filesystem with data=ordered. The default was only that, a default,
> and one that can still be specified explicitly.

Look up "principle of least suprise"

> There are an unbounded number of possible incorrect setups in the
> world. Should we live in fear of their existence without any proof?

No but creating new ones for users is not good.

> I accept Linus' argument that the change in crash behavior for random
> folk is going to be very different, and that perhaps that alone should
> drive keeping the default as it is today, but now we're talking about
> the desktop space.

Which regularly handles data of great value -- like PhD theses.

Ext4 provides a good natural migration point, ext3 mid series does not.

2009-04-07 03:29:22

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, 2009-04-06 at 08:37 -0700, Linus Torvalds wrote:
>
>
> One thing that I find intriguing is how the fsync time seems so
> _consistent_ across a wild variety of drives. It's interesting how you see
> delays that are roughly the same order of magnitude, even though you are
> using an old SATA drive, and I'm using the Intel SSD. And when you turn
> off TCQ, your numbers go down even more.
>

It may help if we rule out the ext3/jbd transaction growing code. I'd
be pretty surprised if it made for long pauses, but its worth checking:

-chris

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index e6a1174..a7c4f79 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1417,7 +1417,7 @@ int journal_stop(handle_t *handle)
* for joiners in that case.
*/
pid = current->pid;
- if (handle->h_sync && journal->j_last_sync_writer != pid) {
+ if (0 && handle->h_sync && journal->j_last_sync_writer != pid) {
u64 commit_time, trans_time;

journal->j_last_sync_writer = pid;

2009-04-07 03:54:34

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, 2009-04-06 at 17:19 -0400, Theodore Tso wrote:
> On Mon, Apr 06, 2009 at 01:12:08PM -0700, Hua Zhong wrote:
> > Grr..making writeback as default would break people's setup that relies on
> > the ordered semantics, especially in the embedded world. It's a big no-no.
>
> I've added workarounds for 2.6.30 that provide the replace-via-rename
> and replace-via-truncate workarounds for ext3 data=writeback cases.
> See commits e7c8f507 and f7ab34ea.
>
> There won't be an implied fsync for newly created files, yes, but you
> could have crashed 5 seconds earlier, at which point you would have
> lost the newly created file anyway. Replace-via-rename and
> replace-via-truncate solves the problem for applications which are
> editing pre-existing files, which was most of people's complaints
> about depending on data=ordered semantics.

XFS got a reputation for losing data largely based on null bytes in
files after a crash. In the XFS case the files always had zeros, and
not garbage that used to be on disk.

Years later xfs still gets beaten up for null bytes in files even though
they added code to prevent a long time ago.

Please leave data=ordered as the default for ext3. If we really need to
fix ext3, it makes sense to switch to the xfs style i_size update at IO
end instead of data=writeback.

-chris

2009-04-07 04:14:16

by Trenton D. Adams

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

Okay, so a config option is a benefit in what way, for these
particular circumstances? If someone wants to change the behaviour of
the kernel, for this particular case, they can just use tune2fs, no?

I'm glad I'm not making the decision. To break one person's computer
or another person's computer, hmm, not very good options. hehe.

2009-04-07 04:31:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes



On Mon, 6 Apr 2009, Trenton D. Adams wrote:
>
> Okay, so a config option is a benefit in what way, for these
> particular circumstances? If someone wants to change the behaviour of
> the kernel, for this particular case, they can just use tune2fs, no?

Basically, I have a very simple policy in my kernel life:

- I don't touch distribution settings. I update the kernel, and NOTHING
else.

(Ok, I lie. I do my own git version too, but I'm really unhappy when I
feel like I need to maintain anything else)

And I have that policy because quite frankly, if I start tuning distro
settings, I'll

(a) forget about them and not do it on my next machine and
(b) do some magic that _I_ may know how to do but nobody else does, so
now what I'm doing is no longer relevant for anybody else
(c) I'm no longer helping anybody else.

So I have a few bits and pieces that I develop (mainly the kernel but also
git etc), and I don't make site-specific changes to them even if I could.

In other words: if I make a change that helps me, I want to make sure
that I make that option available to everybody else, because quite
frankly that's a big portion of the whole point of open source.

The whole "scratch your itch" isn't just about scratching _your_ itch,
it's about getting it (almost by mistake) fixed for a lot of other people
too.

Linus

2009-04-07 04:49:09

by Trenton D. Adams

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 6, 2009 at 10:27 PM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Mon, 6 Apr 2009, Trenton D. Adams wrote:
>>
>> Okay, so a config option is a benefit in what way, for these
>> particular circumstances? ?If someone wants to change the behaviour of
>> the kernel, for this particular case, they can just use tune2fs, no?
>
> Basically, I have a very simple policy in my kernel life:
>
> ?- I don't touch distribution settings. I update the kernel, and NOTHING
> ? else.
>
> (Ok, I lie. I do my own git version too, but I'm really unhappy when I
> feel like I need to maintain anything else)
>
> And I have that policy because quite frankly, if I start tuning distro
> settings, I'll
>
> ?(a) forget about them and not do it on my next machine and
> ?(b) do some magic that _I_ may know how to do but nobody else does, so
> ? ? now what I'm doing is no longer relevant for anybody else
> ?(c) I'm no longer helping anybody else.
>
> So I have a few bits and pieces that I develop (mainly the kernel but also
> git etc), and I don't make site-specific changes to them even if I could.
>
> In other words: if I make a change that helps me, I want to make sure
> that I make that option available to everybody else, because quite
> frankly that's a big portion of the whole point of open source.
>
> The whole "scratch your itch" isn't just about scratching _your_ itch,
> it's about getting it (almost by mistake) fixed for a lot of other people
> too.
>
> ? ? ? ? ? ? ? ? ? ? ? ?Linus
>

Interesting. I suppose there is also the fact that every drive you
hook to the system would then have the kernel configured default. So
that makes sense. Otherwise it's 2, 5, 10, 20 tune2fs commands,
instead of one kernel config.

What about a procfs setting instead? Is there a policy about why
something should be in procfs or /sys, or as a kernel config option?
That's basically as small as the patch you just made, right?

I'm just thinking that something like this, where people want one
thing or the other, but may not know it when they install Linux, might
like to change it realtime. Especially if they are a Linux newbie,
and don't know how to compile their own kernel. Or don't have time to
maintain their own kernel installs.

2009-04-07 05:06:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes



On Mon, 6 Apr 2009, Trenton D. Adams wrote:
>
> What about a procfs setting instead? Is there a policy about why
> something should be in procfs or /sys, or as a kernel config option?
> That's basically as small as the patch you just made, right?

I'm never really against making things dynamically tunable, but this
already was, and that wasn't really the issue.

Sure, you can just re-mount your filesystem with different options. That's
what I did while testing - my /home is on a drive of its own, so I would
just log out and as root unmount and re-mount with data=ordered/writeback,
and log in and test again.

So dynamic tuning is good. But at the same time, having a tuning option is
_never_ an excuse for not getting the default right in the first place.
It's just a cop-out to say "hey, the default may be wrong for you, but you
can always tune it locally with XYZ".

The thing is, almost nobody does that. Partly because it needs effort and
knowledge, partly because after a few years the number of tuning knobs are
in the hundreds for every little thing.

So instead, leave the tuning for the _really_ odd cases (if you use your
machine as an IP router, you hopefully know enough to tune it if you
really care). Not for random general-purpose "use for whatever" kind of
thing.

> I'm just thinking that something like this, where people want one
> thing or the other, but may not know it when they install Linux, might
> like to change it realtime. Especially if they are a Linux newbie,
> and don't know how to compile their own kernel. Or don't have time to
> maintain their own kernel installs.

Oh absolutely. I'm not expecting people to compile their own kernels. I'm
expecting that within a few months, most modern distributions will have
(almost by mistake) gotten a new set of saner defaults, and anybody who
keeps their machine up-to-date will see a smoother experience without ever
even realizing _why_.

Linus

2009-04-07 05:24:20

by Hua Zhong

[permalink] [raw]
Subject: RE: [PATCH 0/8][RFC] IO latency/throughput fixes

The (small) set of people that rely on "ordered" understand
the problem, as long as they are aware of the change (no, I
don't think reading through all changelogs from their old
kernel to the new one is a realistic option).

So a config option should be good enough to get them to notice
the change (I assume a missing default will force them to
choose an option), and therefore explicitly add the -o ordered
option to their scripts.

On the other hand a run-time tunable has no real point.

> -----Original Message-----
> From: Linus Torvalds [mailto:[email protected]]
> Sent: Monday, April 06, 2009 10:02 PM
> To: Trenton D. Adams
> Cc: Chris Mason; Theodore Tso; Hua Zhong; Jens Axboe; Linux Kernel
> Mailing List
> Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes
>
>
>
> On Mon, 6 Apr 2009, Trenton D. Adams wrote:
> >
> > What about a procfs setting instead? Is there a policy about why
> > something should be in procfs or /sys, or as a kernel config option?
> > That's basically as small as the patch you just made, right?
>
> I'm never really against making things dynamically tunable, but this
> already was, and that wasn't really the issue.
>
> Sure, you can just re-mount your filesystem with different options.
> That's
> what I did while testing - my /home is on a drive of its own, so I
> would
> just log out and as root unmount and re-mount with
> data=ordered/writeback,
> and log in and test again.
>
> So dynamic tuning is good. But at the same time, having a tuning option
> is
> _never_ an excuse for not getting the default right in the first place.
> It's just a cop-out to say "hey, the default may be wrong for you, but
> you
> can always tune it locally with XYZ".
>
> The thing is, almost nobody does that. Partly because it needs effort
> and
> knowledge, partly because after a few years the number of tuning knobs
> are
> in the hundreds for every little thing.
>
> So instead, leave the tuning for the _really_ odd cases (if you use
> your
> machine as an IP router, you hopefully know enough to tune it if you
> really care). Not for random general-purpose "use for whatever" kind of
> thing.
>
> > I'm just thinking that something like this, where people want one
> > thing or the other, but may not know it when they install Linux,
> might
> > like to change it realtime. Especially if they are a Linux newbie,
> > and don't know how to compile their own kernel. Or don't have time
> to
> > maintain their own kernel installs.
>
> Oh absolutely. I'm not expecting people to compile their own kernels.
> I'm
> expecting that within a few months, most modern distributions will have
> (almost by mistake) gotten a new set of saner defaults, and anybody who
> keeps their machine up-to-date will see a smoother experience without
> ever
> even realizing _why_.
>
> Linus

2009-04-07 06:27:23

by Trenton D. Adams

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Mon, Apr 6, 2009 at 11:02 PM, Linus Torvalds
<[email protected]> wrote:
>
> So dynamic tuning is good. But at the same time, having a tuning option is
> _never_ an excuse for not getting the default right in the first place.
> It's just a cop-out to say "hey, the default may be wrong for you, but you
> can always tune it locally with XYZ".

Oh, right, you had mentioned that a few days ago about not liking to
have to tune stuff. I agree, the kernel should just work, except in
rare cases. But, with procfs, you could still have the default to
data=writeback, and realtime tuneable. That is what I meant.

Anyhow, I don't want to take up too much of your time. I wouldn't
want to know how busy you are. :P No need to reply. :D

2009-04-07 07:51:44

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Tue, Apr 7, 2009 at 01:10, Linus Torvalds
<[email protected]> wrote:
> On Mon, 6 Apr 2009, Linus Torvalds wrote:
>> thing that we think people would be happiest with.
>>
>> I think "ordered" was a reasonable default, but that was at least partly
>> because _both_ ordered and writeback sucked (partly in different ways).
>>
>> I do think we could make it a config option.
>
> A patch _something_ like this.
>
> A few notes:
>
>  - This is UNTESTED (of course)
>
>  - If I did this right, this _only_ overrides the data mode if it's not
>   explicitly specified on disk in the superblock mount options.
>
> IOW, if you have done a
>
>        tune2fs -o journal_data_ordered
>
> then this will _not_ override that. Only in the absense of any explicit
> flags should this trigger and then make the choice be 'writeback'.
>
> And just to be _extra_ backwards compatible, if you really want the old
> behavior, and don't want to set the ordering flag explicitly, just answer
> 'y' to the EXT3_DEFAULTS_TO_ORDERED Kconfig question.
>
> What do people think? Anybody want to test?
>
>                Linus
>
> ---
>  fs/ext3/Kconfig |   19 +++++++++++++++++++
>  fs/ext3/super.c |    8 +++++++-
>  2 files changed, 26 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext3/Kconfig b/fs/ext3/Kconfig
> index 8e0cfe4..fb3c1a2 100644
> --- a/fs/ext3/Kconfig
> +++ b/fs/ext3/Kconfig
> @@ -28,6 +28,25 @@ config EXT3_FS
>          To compile this file system support as a module, choose M here: the
>          module will be called ext3.
>
> +config EXT3_DEFAULTS_TO_ORDERED
> +       bool "Default to 'data=ordered' in ext3 (legacy option)"
> +       depends on EXT3_FS
> +       help
> +         If a filesystem does not explicitly specify a data ordering
> +         mode, and the journal capability allowed it, ext3 used to
> +         historically default to 'data=ordered'.
> +
> +         That was a rather unfortunate choice, because it leads to all
> +         kinds of latency problems, and the 'data=writeback' mode is more
> +         appropriate these days.
> +
> +         You should probably always answer 'n' here, and if you really
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +         want to use 'data=ordered' mode, set it in the filesystem itself
> +         with 'tune2fs -o journal_data_ordered'.
> +
> +         But if you really want to enable the legacy default, you can do
> +         so by answering 'y' to this question.
> +

So `allmodconfig' will enable it? Is that the right thing to do, or
should it be inverted?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2009-04-07 10:37:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes


* Geert Uytterhoeven <[email protected]> wrote:

> On Tue, Apr 7, 2009 at 01:10, Linus Torvalds
> <[email protected]> wrote:
> > On Mon, 6 Apr 2009, Linus Torvalds wrote:
> >> thing that we think people would be happiest with.
> >>
> >> I think "ordered" was a reasonable default, but that was at least partly
> >> because _both_ ordered and writeback sucked (partly in different ways).
> >>
> >> I do think we could make it a config option.
> >
> > A patch _something_ like this.
> >
> > A few notes:
> >
> > ?- This is UNTESTED (of course)
> >
> > ?- If I did this right, this _only_ overrides the data mode if it's not
> > ? explicitly specified on disk in the superblock mount options.
> >
> > IOW, if you have done a
> >
> > ? ? ? ?tune2fs -o journal_data_ordered
> >
> > then this will _not_ override that. Only in the absense of any explicit
> > flags should this trigger and then make the choice be 'writeback'.
> >
> > And just to be _extra_ backwards compatible, if you really want the old
> > behavior, and don't want to set the ordering flag explicitly, just answer
> > 'y' to the EXT3_DEFAULTS_TO_ORDERED Kconfig question.
> >
> > What do people think? Anybody want to test?
> >
> > ? ? ? ? ? ? ? ?Linus
> >
> > ---
> > ?fs/ext3/Kconfig | ? 19 +++++++++++++++++++
> > ?fs/ext3/super.c | ? ?8 +++++++-
> > ?2 files changed, 26 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/ext3/Kconfig b/fs/ext3/Kconfig
> > index 8e0cfe4..fb3c1a2 100644
> > --- a/fs/ext3/Kconfig
> > +++ b/fs/ext3/Kconfig
> > @@ -28,6 +28,25 @@ config EXT3_FS
> > ? ? ? ? ?To compile this file system support as a module, choose M here: the
> > ? ? ? ? ?module will be called ext3.
> >
> > +config EXT3_DEFAULTS_TO_ORDERED
> > + ? ? ? bool "Default to 'data=ordered' in ext3 (legacy option)"
> > + ? ? ? depends on EXT3_FS
> > + ? ? ? help
> > + ? ? ? ? If a filesystem does not explicitly specify a data ordering
> > + ? ? ? ? mode, and the journal capability allowed it, ext3 used to
> > + ? ? ? ? historically default to 'data=ordered'.
> > +
> > + ? ? ? ? That was a rather unfortunate choice, because it leads to all
> > + ? ? ? ? kinds of latency problems, and the 'data=writeback' mode is more
> > + ? ? ? ? appropriate these days.
> > +
> > + ? ? ? ? You should probably always answer 'n' here, and if you really
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > + ? ? ? ? want to use 'data=ordered' mode, set it in the filesystem itself
> > + ? ? ? ? with 'tune2fs -o journal_data_ordered'.
> > +
> > + ? ? ? ? But if you really want to enable the legacy default, you can do
> > + ? ? ? ? so by answering 'y' to this question.
> > +
>
> So `allmodconfig' will enable it? Is that the right thing to do,
> or should it be inverted?
>
> Gr{oetje,eeting}s,

allmod/allyes will enable all sorts of legacy options.

Since besides myself i'm not aware of any other person on this
planet actually _booting_ allyes/allmod Linux kernels, i guess this
is not a big issue anyway :-)

One small detail only: i'd suggest to name it
CONFIG_COMPAT_EXT3_DEFAULTS_TO_ORDERED, to move it more in line with
all the CONFIG_COMPAT_* legacy options.

Ingo

2009-04-07 13:36:06

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

Linus Torvalds wrote:
..
> - If I did this right, this _only_ overrides the data mode if it's not
> explicitly specified on disk in the superblock mount options.
>
> IOW, if you have done a
>
> tune2fs -o journal_data_ordered
>
> then this will _not_ override that.
..

Thanks for saving me time to look up that command! :)

With the stuff I work on here, kernel oops and sudden reboots
are not at all uncommon during development, and I really prefer
to keep "ordered" as the default on these machines for a tiny
extra margin of safety.

What happens with ext3 "writeback", and ext4 "whatever",
when one does the quickie reboot method:

ALT-SYSRQ-S ALT-SYSRQ-U ALT-SYSRQ-S ALT-SYSRQ-B

???

Cheers

2009-04-07 14:07:30

by Diego Calleja

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Martes 07 Abril 2009 12:36:11 Ingo Molnar escribi?:
> Since besides myself i'm not aware of any other person on this
> planet actually _booting_ allyes/allmod Linux kernels, i guess this
> is not a big issue anyway :-)

IIRC once I tried to boot an allyes kernel and it oopsed, because (i think)
the probing routines of some drivers have never been tested in machines
that do not have the corresponding hardware. So I doubt someoneone is
using it ;)

2009-04-07 14:40:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes



On Tue, 7 Apr 2009, Mark Lord wrote:
>
> What happens with ext3 "writeback", and ext4 "whatever",
> when one does the quickie reboot method:
>
> ALT-SYSRQ-S ALT-SYSRQ-U ALT-SYSRQ-S ALT-SYSRQ-B
>
> ???

Since 's' syncs (I think 'u' does too, as part of making things
read-only), the data blocks will be on disk after the boot regardless of
any other ordering.

Of course, it will leave all your lock-files files alone, and I can almost
guarantee that some daemons (read: "NetworkManager") will then fail on the
next boot because they think they are already running.

Linus

2009-04-07 19:24:59

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

Linus Torvalds wrote:
>
> On Tue, 7 Apr 2009, Mark Lord wrote:
>> What happens with ext3 "writeback", and ext4 "whatever",
>> when one does the quickie reboot method:
>>
>> ALT-SYSRQ-S ALT-SYSRQ-U ALT-SYSRQ-S ALT-SYSRQ-B
>>
>> ???
>
> Since 's' syncs (I think 'u' does too, as part of making things
> read-only), the data blocks will be on disk after the boot regardless of
> any other ordering.
..

I was thinking more about delayed allocation in ext4, though.
If it hasn't allocated the blocks, then sync() has nothing to write out.
Or do they have hooks into the block layer to force alloc/commit when
somebody does a sync() ??

> Of course, it will leave all your lock-files files alone, and I can almost
> guarantee that some daemons (read: "NetworkManager") will then fail on the
> next boot because they think they are already running.
..

No, it behaves fine on reboot here. But actually, NM is one big reason
why I end up having to use the ALT-SYSRQ-S/U/S/B.

The Ubunutu reboot scripts seem broken at times w.r.t. NM --
it hangs the reboot sequence on some of my machines here
for a very long time (during shutdown), because (I think)
the scripts disable the interface before disabling NM.. Doh!

Seems happy enough after rebooting though.

Cheers

2009-04-07 19:46:25

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

Mark Lord wrote:
> Linus Torvalds wrote:
>>
>> On Tue, 7 Apr 2009, Mark Lord wrote:
>>> What happens with ext3 "writeback", and ext4 "whatever",
>>> when one does the quickie reboot method:
>>>
>>> ALT-SYSRQ-S ALT-SYSRQ-U ALT-SYSRQ-S ALT-SYSRQ-B
>>>
>>> ???
>>
>> Since 's' syncs (I think 'u' does too, as part of making things
>> read-only), the data blocks will be on disk after the boot regardless
>> of any other ordering.
> ..
>
> I was thinking more about delayed allocation in ext4, though.
> If it hasn't allocated the blocks, then sync() has nothing to write out.
> Or do they have hooks into the block layer to force alloc/commit when
> somebody does a sync() ??

sync(2) doesn't just sync dirty buffers... it sync's inodes, which
pokes the filesystem to do something intelligent, perhaps triggering (a)
write-out of data, (b) write-out of zeroed blocks, or (c) annotation in
filesystem metadata that certain blocks are allocated, but not initialized.

Jeff


2009-04-07 20:54:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On Tue, 2009-04-07 at 07:33 -0700, Linus Torvalds wrote:
>
> On Tue, 7 Apr 2009, Mark Lord wrote:
> >
> > What happens with ext3 "writeback", and ext4 "whatever",
> > when one does the quickie reboot method:
> >
> > ALT-SYSRQ-S ALT-SYSRQ-U ALT-SYSRQ-S ALT-SYSRQ-B
> >
> > ???
>
> Since 's' syncs (I think 'u' does too, as part of making things
> read-only), the data blocks will be on disk after the boot regardless of
> any other ordering.
>
> Of course, it will leave all your lock-files files alone, and I can almost
> guarantee that some daemons (read: "NetworkManager") will then fail on the
> next boot because they think they are already running.

I don't do that _regularly_, but have had cause to do so many times over
the last couple years since switching to data=writeback. Box boots up
fine here, at least with opensuse 10.3 and 11.0 it does.

I've had to reconfigure evolution, and some desktop settings a few times
after pulling the _eject_ handle (plain sysrq-b or BRB if kernel is
having a bad hair day), but that's about it.

-Mike

2009-04-08 12:14:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes


* Diego Calleja <[email protected]> wrote:

> On Martes 07 Abril 2009 12:36:11 Ingo Molnar escribi?:
> > Since besides myself i'm not aware of any other person on this
> > planet actually _booting_ allyes/allmod Linux kernels, i guess this
> > is not a big issue anyway :-)
>
> IIRC once I tried to boot an allyes kernel and it oopsed, because
> (i think) the probing routines of some drivers have never been
> tested in machines that do not have the corresponding hardware. So
> I doubt someoneone is using it ;)

Yes, i mapped those out gradually and disabled them. I've been
test-booting such allyesconfig 32-bit and 64-bit kernels ever since
then - for the past 2 years or so.

Ingo

2009-04-08 12:57:22

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

Hi Ingo,

On Tue, Apr 7, 2009 at 12:36 PM, Ingo Molnar <[email protected]> wrote:
> allmod/allyes will enable all sorts of legacy options.
>
> Since besides myself i'm not aware of any other person on this
> planet actually _booting_ allyes/allmod Linux kernels,

Does allyesconfig boot? Any funny effects?
--
vda

2009-04-08 13:28:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes


* Denys Vlasenko <[email protected]> wrote:

> Hi Ingo,
>
> On Tue, Apr 7, 2009 at 12:36 PM, Ingo Molnar <[email protected]> wrote:
> > allmod/allyes will enable all sorts of legacy options.
> >
> > Since besides myself i'm not aware of any other person on this
> > planet actually _booting_ allyes/allmod Linux kernels,
>
> Does allyesconfig boot? Any funny effects?

Besides a long bootup time and a somewhat slow kernel (everything
built in, all debug facilities, everything ...) and the need to
filter out bad drivers first, none.

Any funny effects get fixed or reported by me ;-)

Ingo

2009-04-09 02:40:56

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

Linus Torvalds wrote:
>
> On Mon, 6 Apr 2009, Linus Torvalds wrote:
>> thing that we think people would be happiest with.
>>
>> I think "ordered" was a reasonable default, but that was at least partly
>> because _both_ ordered and writeback sucked (partly in different ways).
>>
>> I do think we could make it a config option.
>
> A patch _something_ like this.
>
> A few notes:
>
> - This is UNTESTED (of course)
>
> - If I did this right, this _only_ overrides the data mode if it's not
> explicitly specified on disk in the superblock mount options.
>
> IOW, if you have done a
>
> tune2fs -o journal_data_ordered
>
> then this will _not_ override that. Only in the absense of any explicit
> flags should this trigger and then make the choice be 'writeback'.
>
> And just to be _extra_ backwards compatible, if you really want the old
> behavior, and don't want to set the ordering flag explicitly, just answer
> 'y' to the EXT3_DEFAULTS_TO_ORDERED Kconfig question.
>
> What do people think? Anybody want to test?

I think this is a terrible idea. I ran the following test with
data=writeback on 2.6.29.1 (which doesn't have the rename & truncate
hacks, but they would not help in this case, either):

tar xvjf linux-2.6.29.1.tar.bz2; echo b > /proc/sysrq-trigger

This simulates a crash on a busy system. I got back 8000+ files
containing other people's data.

data=ordered isn't just "nicer" behavior than writeback on a crash, it's
necessary today for security. Making data=writeback default is a
security flaw.

Are we really considering (wait, not considering; it's checked in
already!) - blowing a huge security hole in the filesystem used on the
vast majority of installations in the name of speed?

Chris suggested earlier in this thread that we should use the XFS trick
of not extending the i_size until io completion, and I agree that it
makes sense. Chris even offered to take a stab at it and I hope I can
work with him on this. It's a -much- better answer than this
reactionary change.

-Eric

2009-04-09 14:04:15

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 0/8][RFC] IO latency/throughput fixes

On 04/08/2009 10:40 PM, Eric Sandeen wrote:
> Linus Torvalds wrote:
>
>> On Mon, 6 Apr 2009, Linus Torvalds wrote:
>>
>>> thing that we think people would be happiest with.
>>>
>>> I think "ordered" was a reasonable default, but that was at least partly
>>> because _both_ ordered and writeback sucked (partly in different ways).
>>>
>>> I do think we could make it a config option.
>>>
>> A patch _something_ like this.
>>
>> A few notes:
>>
>> - This is UNTESTED (of course)
>>
>> - If I did this right, this _only_ overrides the data mode if it's not
>> explicitly specified on disk in the superblock mount options.
>>
>> IOW, if you have done a
>>
>> tune2fs -o journal_data_ordered
>>
>> then this will _not_ override that. Only in the absense of any explicit
>> flags should this trigger and then make the choice be 'writeback'.
>>
>> And just to be _extra_ backwards compatible, if you really want the old
>> behavior, and don't want to set the ordering flag explicitly, just answer
>> 'y' to the EXT3_DEFAULTS_TO_ORDERED Kconfig question.
>>
>> What do people think? Anybody want to test?
>>
>
> I think this is a terrible idea. I ran the following test with
> data=writeback on 2.6.29.1 (which doesn't have the rename& truncate
> hacks, but they would not help in this case, either):
>
> tar xvjf linux-2.6.29.1.tar.bz2; echo b> /proc/sysrq-trigger
>
> This simulates a crash on a busy system. I got back 8000+ files
> containing other people's data.
>
> data=ordered isn't just "nicer" behavior than writeback on a crash, it's
> necessary today for security. Making data=writeback default is a
> security flaw.
>
> Are we really considering (wait, not considering; it's checked in
> already!) - blowing a huge security hole in the filesystem used on the
> vast majority of installations in the name of speed?
>
> Chris suggested earlier in this thread that we should use the XFS trick
> of not extending the i_size until io completion, and I agree that it
> makes sense. Chris even offered to take a stab at it and I hope I can
> work with him on this. It's a -much- better answer than this
> reactionary change.
>
> -Eric
>

I agree - we definitely should work to fix the fsync latency problems,
but this seems to jump back in time to the early 80's for UNIX file systems.

Writeback mode is just not a safe default for naive users or even more
sophisticated users who don't understand the risks here. Definitely not
a journal mode that any distribution would be able to ship as a default.

Wouldn't it make much more sense to leave the default at the safe data
ordered mode and let the few people who understand the tradeoff remount
file systems in writeback mode?

Regards,

Ric