2013-03-19 17:17:08

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 00/18] RFC: Non blocking submit for activity log misses

The Issues

Since the beginning DRBD was written with the assumption that the write
pattern has spacial locality. (This assumption was driven from the fact,
that rotating media performs better if you do not send the head too far too
often)

Backed by this assumption a caller that submits a request that is outside of
the current active set, was blocked until the active set was changed.
(Changing the active set is a synchronous write operation to the meta-data
area on the backing storage = "an AL-update" in DRBD-speak)

A second effect was that DRBD's meta-data was located in a very narrow
area. When DRBD is used on top of a RAID0 stripe set, this causes all
AL-updates to got to the same disk.


The Proposed Solution

This patch series improves DRBD's behavior. A submitter is no longer blocked
in the case of a AL-miss. For this a dedicated submitter worker is introduced
(patch 13).

In order to better distribute the AL-updates to more disks in a stripe set
this patch series also introduces an optional striped layout of the part
of the meta-data that holds the AL-updates (patch 4).


The Results

This of course drastically improves DRBD's performance if the write pattern
does not have any spacial locality. E.g. random writes spread out over the
whole device.

In the test systems we have SSDs with are able to do up to 50000 writes per
second. The test does random distributed writes over a work set size of
128GiB with IO depths from 1 to 1024.

At an IO depth of 64:
without this patch we observed ~100 IOPs.
With this patches we observed about 20000 IOPs.

Please find charts of the results here:
http://blogs.linbit.com/p/469/843-random-writes-faster/


Lars Ellenberg (18):
drbd: cleanup bogus assert message
drbd: cleanup ondisk meta data layout calculations and defines
drbd: prepare for new striped layout of activity log
drbd: use the cached meta_dev_idx
drbd: mechanically rename la_size to la_size_sect
drbd: read meta data early, base on-disk offsets on super block
drbd: Clarify when activity log I/O is delegated to the worker thread
drbd: drbd_al_being_io: short circuit to reduce latency
drbd: split __drbd_make_request in before and after drbd_al_begin_io
drbd: prepare to queue write requests on a submit worker
drbd: split drbd_al_begin_io into fastpath, prepare, and commit
drbd: split out some helper functions to drbd_al_begin_io
drbd: queue writes on submitter thread, unless they pass the activity
log fastpath
lru_cache: introduce lc_get_cumulative()
drbd: consolidate as many updates as possible into one AL transaction
drbd: move start io accounting before activity log transaction
drbd: try hard to max out the updates per AL transaction
drbd: adjust upper limit for activity log extents

drivers/block/drbd/drbd_actlog.c | 246 +++++++++++++++++++++++++++---------
drivers/block/drbd/drbd_bitmap.c | 13 +-
drivers/block/drbd/drbd_int.h | 179 +++++++++++++-------------
drivers/block/drbd/drbd_main.c | 243 +++++++++++++++++++++++++++++------
drivers/block/drbd/drbd_nl.c | 129 ++++++++++++-------
drivers/block/drbd/drbd_receiver.c | 4 +-
drivers/block/drbd/drbd_req.c | 166 +++++++++++++++++++++---
drivers/block/drbd/drbd_worker.c | 5 +-
include/linux/drbd_limits.h | 11 +-
include/linux/lru_cache.h | 1 +
lib/lru_cache.c | 55 ++++++--
11 files changed, 782 insertions(+), 270 deletions(-)

--
1.7.9.5


2013-03-19 17:17:09

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 04/18] drbd: use the cached meta_dev_idx

From: Lars Ellenberg <[email protected]>

Now we have the cached meta_dev_idx member,
we can get rid of a few rcu_read_lock() sections and rcu_dereference().

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_int.h | 32 +++++---------------------------
drivers/block/drbd/drbd_nl.c | 7 +------
2 files changed, 6 insertions(+), 33 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index ee19ba2..6eecdec 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1777,9 +1777,9 @@ static inline void drbd_chk_io_error_(struct drbd_conf *mdev,
* BTW, for internal meta data, this happens to be the maximum capacity
* we could agree upon with our peer node.
*/
-static inline sector_t _drbd_md_first_sector(int meta_dev_idx, struct drbd_backing_dev *bdev)
+static inline sector_t drbd_md_first_sector(struct drbd_backing_dev *bdev)
{
- switch (meta_dev_idx) {
+ switch (bdev->md.meta_dev_idx) {
case DRBD_MD_INDEX_INTERNAL:
case DRBD_MD_INDEX_FLEX_INT:
return bdev->md.md_offset + bdev->md.bm_offset;
@@ -1789,30 +1789,13 @@ static inline sector_t _drbd_md_first_sector(int meta_dev_idx, struct drbd_backi
}
}

-static inline sector_t drbd_md_first_sector(struct drbd_backing_dev *bdev)
-{
- int meta_dev_idx;
-
- rcu_read_lock();
- meta_dev_idx = rcu_dereference(bdev->disk_conf)->meta_dev_idx;
- rcu_read_unlock();
-
- return _drbd_md_first_sector(meta_dev_idx, bdev);
-}
-
/**
* drbd_md_last_sector() - Return the last sector number of the meta data area
* @bdev: Meta data block device.
*/
static inline sector_t drbd_md_last_sector(struct drbd_backing_dev *bdev)
{
- int meta_dev_idx;
-
- rcu_read_lock();
- meta_dev_idx = rcu_dereference(bdev->disk_conf)->meta_dev_idx;
- rcu_read_unlock();
-
- switch (meta_dev_idx) {
+ switch (bdev->md.meta_dev_idx) {
case DRBD_MD_INDEX_INTERNAL:
case DRBD_MD_INDEX_FLEX_INT:
return bdev->md.md_offset + MD_4kB_SECT -1;
@@ -1840,18 +1823,13 @@ static inline sector_t drbd_get_capacity(struct block_device *bdev)
static inline sector_t drbd_get_max_capacity(struct drbd_backing_dev *bdev)
{
sector_t s;
- int meta_dev_idx;
-
- rcu_read_lock();
- meta_dev_idx = rcu_dereference(bdev->disk_conf)->meta_dev_idx;
- rcu_read_unlock();

- switch (meta_dev_idx) {
+ switch (bdev->md.meta_dev_idx) {
case DRBD_MD_INDEX_INTERNAL:
case DRBD_MD_INDEX_FLEX_INT:
s = drbd_get_capacity(bdev->backing_bdev)
? min_t(sector_t, DRBD_MAX_SECTORS_FLEX,
- _drbd_md_first_sector(meta_dev_idx, bdev))
+ drbd_md_first_sector(bdev))
: 0;
break;
case DRBD_MD_INDEX_FLEX_EXT:
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 104b7ce..5621df8 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -722,14 +722,10 @@ static void drbd_md_set_sector_offsets(struct drbd_conf *mdev,
{
sector_t md_size_sect = 0;
unsigned int al_size_sect = MD_32kB_SECT;
- int meta_dev_idx;
-
- rcu_read_lock();
- meta_dev_idx = rcu_dereference(bdev->disk_conf)->meta_dev_idx;

bdev->md.md_offset = drbd_md_ss(bdev);

- switch (meta_dev_idx) {
+ switch (bdev->md.meta_dev_idx) {
default:
/* v07 style fixed size indexed meta data */
bdev->md.md_size_sect = MD_128MB_SECT;
@@ -761,7 +757,6 @@ static void drbd_md_set_sector_offsets(struct drbd_conf *mdev,
bdev->md.bm_offset = -md_size_sect + MD_4kB_SECT;
break;
}
- rcu_read_unlock();
}

/* input size is expected to be in KB */
--
1.7.9.5

2013-03-19 17:17:54

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 18/18] drbd: adjust upper limit for activity log extents

From: Lars Ellenberg <[email protected]>

Now that the on-disk activity-log ring buffer size is adjustable,
the maximum active set can become larger, and is now limited by
the use of 16bit "labels".

This increases the maximum working set from 6433 to 65534 extents,
each of which covers an area of 4MiB.
Which means that if you use the maximum, you'd have to resync
more than 250 GiB after an unclean Primary shutdown.
With capable backend storage and replication links,
this is entirely feasible.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_nl.c | 47 ++++++++++++++++++++++++++++++++++--------
include/linux/drbd_limits.h | 11 +++++-----
2 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index bcf900b..42fda4a 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1141,15 +1141,32 @@ static bool should_set_defaults(struct genl_info *info)
return 0 != (flags & DRBD_GENL_F_SET_DEFAULTS);
}

-static void enforce_disk_conf_limits(struct disk_conf *dc)
+static unsigned int drbd_al_extents_max(struct drbd_backing_dev *bdev)
{
- if (dc->al_extents < DRBD_AL_EXTENTS_MIN)
- dc->al_extents = DRBD_AL_EXTENTS_MIN;
- if (dc->al_extents > DRBD_AL_EXTENTS_MAX)
- dc->al_extents = DRBD_AL_EXTENTS_MAX;
+ /* This is limited by 16 bit "slot" numbers,
+ * and by available on-disk context storage.
+ *
+ * Also (u16)~0 is special (denotes a "free" extent).
+ *
+ * One transaction occupies one 4kB on-disk block,
+ * we have n such blocks in the on disk ring buffer,
+ * the "current" transaction may fail (n-1),
+ * and there is 919 slot numbers context information per transaction.
+ *
+ * 72 transaction blocks amounts to more than 2**16 context slots,
+ * so cap there first.
+ */
+ const unsigned int max_al_nr = DRBD_AL_EXTENTS_MAX;
+ const unsigned int sufficient_on_disk =
+ (max_al_nr + AL_CONTEXT_PER_TRANSACTION -1)
+ /AL_CONTEXT_PER_TRANSACTION;

- if (dc->c_plan_ahead > DRBD_C_PLAN_AHEAD_MAX)
- dc->c_plan_ahead = DRBD_C_PLAN_AHEAD_MAX;
+ unsigned int al_size_4k = bdev->md.al_size_4k;
+
+ if (al_size_4k > sufficient_on_disk)
+ return max_al_nr;
+
+ return (al_size_4k - 1) * AL_CONTEXT_PER_TRANSACTION;
}

int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
@@ -1196,7 +1213,13 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
if (!expect(new_disk_conf->resync_rate >= 1))
new_disk_conf->resync_rate = 1;

- enforce_disk_conf_limits(new_disk_conf);
+ if (new_disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
+ new_disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
+ if (new_disk_conf->al_extents > drbd_al_extents_max(mdev->ldev))
+ new_disk_conf->al_extents = drbd_al_extents_max(mdev->ldev);
+
+ if (new_disk_conf->c_plan_ahead > DRBD_C_PLAN_AHEAD_MAX)
+ new_disk_conf->c_plan_ahead = DRBD_C_PLAN_AHEAD_MAX;

fifo_size = (new_disk_conf->c_plan_ahead * 10 * SLEEP_TIME) / HZ;
if (fifo_size != mdev->rs_plan_s->size) {
@@ -1344,7 +1367,8 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
goto fail;
}

- enforce_disk_conf_limits(new_disk_conf);
+ if (new_disk_conf->c_plan_ahead > DRBD_C_PLAN_AHEAD_MAX)
+ new_disk_conf->c_plan_ahead = DRBD_C_PLAN_AHEAD_MAX;

new_plan = fifo_alloc((new_disk_conf->c_plan_ahead * 10 * SLEEP_TIME) / HZ);
if (!new_plan) {
@@ -1419,6 +1443,11 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
if (retcode != NO_ERROR)
goto fail;

+ if (new_disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
+ new_disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
+ if (new_disk_conf->al_extents > drbd_al_extents_max(nbc))
+ new_disk_conf->al_extents = drbd_al_extents_max(nbc);
+
if (drbd_get_max_capacity(nbc) < new_disk_conf->disk_size) {
dev_err(DEV, "max capacity %llu smaller than disk size %llu\n",
(unsigned long long) drbd_get_max_capacity(nbc),
diff --git a/include/linux/drbd_limits.h b/include/linux/drbd_limits.h
index 1fa19c5..1fedf2b 100644
--- a/include/linux/drbd_limits.h
+++ b/include/linux/drbd_limits.h
@@ -126,13 +126,12 @@
#define DRBD_RESYNC_RATE_DEF 250
#define DRBD_RESYNC_RATE_SCALE 'k' /* kilobytes */

- /* less than 7 would hit performance unnecessarily.
- * 919 slots context information per transaction,
- * 32k activity log, 4k transaction size,
- * one transaction in flight:
- * 919 * 7 = 6433 */
+ /* less than 7 would hit performance unnecessarily. */
#define DRBD_AL_EXTENTS_MIN 7
-#define DRBD_AL_EXTENTS_MAX 6433
+ /* we use u16 as "slot number", (u16)~0 is "FREE".
+ * If you use >= 292 kB on-disk ring buffer,
+ * this is the maximum you can use: */
+#define DRBD_AL_EXTENTS_MAX 0xfffe
#define DRBD_AL_EXTENTS_DEF 1237
#define DRBD_AL_EXTENTS_SCALE '1'

--
1.7.9.5

2013-03-19 17:17:53

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 17/18] drbd: try hard to max out the updates per AL transaction

From: Lars Ellenberg <[email protected]>

There may have been more incoming requests while we where preparing
the current transaction. Try to consolidate more updates into this
transaction until we make no more progres.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_req.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index d72f2fe..9f7ff1c 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1226,6 +1226,37 @@ void do_submit(struct work_struct *ws)
break;

wait_event(mdev->al_wait, prepare_al_transaction_nonblock(mdev, &incoming, &pending));
+ /* Maybe more was queued, while we prepared the transaction?
+ * Try to stuff them into this transaction as well.
+ * Be strictly non-blocking here, no wait_event, we already
+ * have something to commit.
+ * Stop if we don't make any more progres.
+ */
+ for (;;) {
+ LIST_HEAD(more_pending);
+ LIST_HEAD(more_incoming);
+ bool made_progress;
+
+ /* It is ok to look outside the lock,
+ * it's only an optimization anyways */
+ if (list_empty(&mdev->submit.writes))
+ break;
+
+ spin_lock(&mdev->submit.lock);
+ list_splice_tail_init(&mdev->submit.writes, &more_incoming);
+ spin_unlock(&mdev->submit.lock);
+
+ if (list_empty(&more_incoming))
+ break;
+
+ made_progress = prepare_al_transaction_nonblock(mdev, &more_incoming, &more_pending);
+
+ list_splice_tail_init(&more_pending, &pending);
+ list_splice_tail_init(&more_incoming, &incoming);
+
+ if (!made_progress)
+ break;
+ }
drbd_al_begin_io_commit(mdev, false);

list_for_each_entry_safe(req, tmp, &pending, tl_requests) {
--
1.7.9.5

2013-03-19 17:18:32

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 12/18] drbd: split out some helper functions to drbd_al_begin_io

From: Lars Ellenberg <[email protected]>

To make the code easier to follow,
use an explicit find_active_resync_extent(),
and add a "nonblock" parameter to _al_get().

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 49 ++++++++++++++++++++++----------------
1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index e4f1231..ff03f90 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -222,25 +222,37 @@ int drbd_md_sync_page_io(struct drbd_conf *mdev, struct drbd_backing_dev *bdev,
return err;
}

-static struct lc_element *_al_get(struct drbd_conf *mdev, unsigned int enr)
+static struct bm_extent *find_active_resync_extent(struct drbd_conf *mdev, unsigned int enr)
{
- struct lc_element *al_ext;
struct lc_element *tmp;
- int wake;
-
- spin_lock_irq(&mdev->al_lock);
tmp = lc_find(mdev->resync, enr/AL_EXT_PER_BM_SECT);
if (unlikely(tmp != NULL)) {
struct bm_extent *bm_ext = lc_entry(tmp, struct bm_extent, lce);
- if (test_bit(BME_NO_WRITES, &bm_ext->flags)) {
- wake = !test_and_set_bit(BME_PRIORITY, &bm_ext->flags);
- spin_unlock_irq(&mdev->al_lock);
- if (wake)
- wake_up(&mdev->al_wait);
- return NULL;
- }
+ if (test_bit(BME_NO_WRITES, &bm_ext->flags))
+ return bm_ext;
}
- al_ext = lc_get(mdev->act_log, enr);
+ return NULL;
+}
+
+static struct lc_element *_al_get(struct drbd_conf *mdev, unsigned int enr, bool nonblock)
+{
+ struct lc_element *al_ext;
+ struct bm_extent *bm_ext;
+ int wake;
+
+ spin_lock_irq(&mdev->al_lock);
+ bm_ext = find_active_resync_extent(mdev, enr);
+ if (bm_ext) {
+ wake = !test_and_set_bit(BME_PRIORITY, &bm_ext->flags);
+ spin_unlock_irq(&mdev->al_lock);
+ if (wake)
+ wake_up(&mdev->al_wait);
+ return NULL;
+ }
+ if (nonblock)
+ al_ext = lc_try_get(mdev->act_log, enr);
+ else
+ al_ext = lc_get(mdev->act_log, enr);
spin_unlock_irq(&mdev->al_lock);
return al_ext;
}
@@ -251,7 +263,6 @@ bool drbd_al_begin_io_fastpath(struct drbd_conf *mdev, struct drbd_interval *i)
* we may need to activate two extents in one go */
unsigned first = i->sector >> (AL_EXTENT_SHIFT-9);
unsigned last = i->size == 0 ? first : (i->sector + (i->size >> 9) - 1) >> (AL_EXTENT_SHIFT-9);
- bool fastpath_ok = true;

D_ASSERT((unsigned)(last - first) <= 1);
D_ASSERT(atomic_read(&mdev->local_cnt) > 0);
@@ -260,12 +271,7 @@ bool drbd_al_begin_io_fastpath(struct drbd_conf *mdev, struct drbd_interval *i)
if (first != last)
return false;

- spin_lock_irq(&mdev->al_lock);
- fastpath_ok =
- lc_find(mdev->resync, first/AL_EXT_PER_BM_SECT) == NULL &&
- lc_try_get(mdev->act_log, first) != NULL;
- spin_unlock_irq(&mdev->al_lock);
- return fastpath_ok;
+ return _al_get(mdev, first, true);
}

bool drbd_al_begin_io_prepare(struct drbd_conf *mdev, struct drbd_interval *i)
@@ -282,7 +288,8 @@ bool drbd_al_begin_io_prepare(struct drbd_conf *mdev, struct drbd_interval *i)

for (enr = first; enr <= last; enr++) {
struct lc_element *al_ext;
- wait_event(mdev->al_wait, (al_ext = _al_get(mdev, enr)) != NULL);
+ wait_event(mdev->al_wait,
+ (al_ext = _al_get(mdev, enr, false)) != NULL);
if (al_ext->lc_number != enr)
need_transaction = true;
}
--
1.7.9.5

2013-03-19 17:18:31

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 11/18] drbd: split drbd_al_begin_io into fastpath, prepare, and commit

From: Lars Ellenberg <[email protected]>

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 104 ++++++++++++++++++++++++++------------
drivers/block/drbd/drbd_int.h | 1 +
2 files changed, 72 insertions(+), 33 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 1d7244d..e4f1231 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -104,7 +104,6 @@ struct update_al_work {
int err;
};

-static int al_write_transaction(struct drbd_conf *mdev, bool delegate);

void *drbd_md_get_buffer(struct drbd_conf *mdev)
{
@@ -246,30 +245,37 @@ static struct lc_element *_al_get(struct drbd_conf *mdev, unsigned int enr)
return al_ext;
}

-/*
- * @delegate: delegate activity log I/O to the worker thread
- */
-void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool delegate)
+bool drbd_al_begin_io_fastpath(struct drbd_conf *mdev, struct drbd_interval *i)
{
/* for bios crossing activity log extent boundaries,
* we may need to activate two extents in one go */
unsigned first = i->sector >> (AL_EXTENT_SHIFT-9);
unsigned last = i->size == 0 ? first : (i->sector + (i->size >> 9) - 1) >> (AL_EXTENT_SHIFT-9);
- unsigned enr;
- bool need_transaction = false;
- bool locked = false;
+ bool fastpath_ok = true;

- /* When called through generic_make_request(), we must delegate
- * activity log I/O to the worker thread: a further request
- * submitted via generic_make_request() within the same task
- * would be queued on current->bio_list, and would only start
- * after this function returns (see generic_make_request()).
- *
- * However, if we *are* the worker, we must not delegate to ourselves.
- */
+ D_ASSERT((unsigned)(last - first) <= 1);
+ D_ASSERT(atomic_read(&mdev->local_cnt) > 0);
+
+ /* FIXME figure out a fast path for bios crossing AL extent boundaries */
+ if (first != last)
+ return false;
+
+ spin_lock_irq(&mdev->al_lock);
+ fastpath_ok =
+ lc_find(mdev->resync, first/AL_EXT_PER_BM_SECT) == NULL &&
+ lc_try_get(mdev->act_log, first) != NULL;
+ spin_unlock_irq(&mdev->al_lock);
+ return fastpath_ok;
+}

- if (delegate)
- BUG_ON(current == mdev->tconn->worker.task);
+bool drbd_al_begin_io_prepare(struct drbd_conf *mdev, struct drbd_interval *i)
+{
+ /* for bios crossing activity log extent boundaries,
+ * we may need to activate two extents in one go */
+ unsigned first = i->sector >> (AL_EXTENT_SHIFT-9);
+ unsigned last = i->size == 0 ? first : (i->sector + (i->size >> 9) - 1) >> (AL_EXTENT_SHIFT-9);
+ unsigned enr;
+ bool need_transaction = false;

D_ASSERT(first <= last);
D_ASSERT(atomic_read(&mdev->local_cnt) > 0);
@@ -280,11 +286,28 @@ void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool dele
if (al_ext->lc_number != enr)
need_transaction = true;
}
+ return need_transaction;
+}

- /* If *this* request was to an already active extent,
- * we're done, even if there are pending changes. */
- if (!need_transaction)
- return;
+static int al_write_transaction(struct drbd_conf *mdev, bool delegate);
+
+/* When called through generic_make_request(), we must delegate
+ * activity log I/O to the worker thread: a further request
+ * submitted via generic_make_request() within the same task
+ * would be queued on current->bio_list, and would only start
+ * after this function returns (see generic_make_request()).
+ *
+ * However, if we *are* the worker, we must not delegate to ourselves.
+ */
+
+/*
+ * @delegate: delegate activity log I/O to the worker thread
+ */
+void drbd_al_begin_io_commit(struct drbd_conf *mdev, bool delegate)
+{
+ bool locked = false;
+
+ BUG_ON(delegate && current == mdev->tconn->worker.task);

/* Serialize multiple transactions.
* This uses test_and_set_bit, memory barrier is implicit.
@@ -303,11 +326,8 @@ void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool dele
write_al_updates = rcu_dereference(mdev->ldev->disk_conf)->al_updates;
rcu_read_unlock();

- if (write_al_updates) {
+ if (write_al_updates)
al_write_transaction(mdev, delegate);
- mdev->al_writ_cnt++;
- }
-
spin_lock_irq(&mdev->al_lock);
/* FIXME
if (err)
@@ -321,6 +341,17 @@ void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool dele
}
}

+/*
+ * @delegate: delegate activity log I/O to the worker thread
+ */
+void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool delegate)
+{
+ BUG_ON(delegate && current == mdev->tconn->worker.task);
+
+ if (drbd_al_begin_io_prepare(mdev, i))
+ drbd_al_begin_io_commit(mdev, delegate);
+}
+
void drbd_al_complete_io(struct drbd_conf *mdev, struct drbd_interval *i)
{
/* for bios crossing activity log extent boundaries,
@@ -478,15 +509,22 @@ _al_write_transaction(struct drbd_conf *mdev)
crc = crc32c(0, buffer, 4096);
buffer->crc32c = cpu_to_be32(crc);

- /* normal execution path goes through all three branches */
if (drbd_bm_write_hinted(mdev))
err = -EIO;
- /* drbd_chk_io_error done already */
- else if (drbd_md_sync_page_io(mdev, mdev->ldev, sector, WRITE)) {
- err = -EIO;
- drbd_chk_io_error(mdev, 1, DRBD_META_IO_ERROR);
- } else {
- mdev->al_tr_number++;
+ else {
+ bool write_al_updates;
+ rcu_read_lock();
+ write_al_updates = rcu_dereference(mdev->ldev->disk_conf)->al_updates;
+ rcu_read_unlock();
+ if (write_al_updates) {
+ if (drbd_md_sync_page_io(mdev, mdev->ldev, sector, WRITE)) {
+ err = -EIO;
+ drbd_chk_io_error(mdev, 1, DRBD_META_IO_ERROR);
+ } else {
+ mdev->al_tr_number++;
+ mdev->al_writ_cnt++;
+ }
+ }
}

drbd_md_put_buffer(mdev);
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index a6b71b6..b7b52dd 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1611,6 +1611,7 @@ extern const char *drbd_conn_str(enum drbd_conns s);
extern const char *drbd_role_str(enum drbd_role s);

/* drbd_actlog.c */
+extern bool drbd_al_begin_io_fastpath(struct drbd_conf *mdev, struct drbd_interval *i);
extern void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool delegate);
extern void drbd_al_complete_io(struct drbd_conf *mdev, struct drbd_interval *i);
extern void drbd_rs_complete_io(struct drbd_conf *mdev, sector_t sector);
--
1.7.9.5

2013-03-19 17:18:30

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 15/18] drbd: consolidate as many updates as possible into one AL transaction

From: Lars Ellenberg <[email protected]>

Depending on current IO depth, try to consolidate as many updates
as possible into one activity log transaction.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 49 ++++++++++++++++++++++++++
drivers/block/drbd/drbd_int.h | 2 ++
drivers/block/drbd/drbd_req.c | 70 ++++++++++++++++++++++++++++++--------
3 files changed, 107 insertions(+), 14 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index ff03f90..6afe173 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -359,6 +359,55 @@ void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool dele
drbd_al_begin_io_commit(mdev, delegate);
}

+int drbd_al_begin_io_nonblock(struct drbd_conf *mdev, struct drbd_interval *i)
+{
+ struct lru_cache *al = mdev->act_log;
+ /* for bios crossing activity log extent boundaries,
+ * we may need to activate two extents in one go */
+ unsigned first = i->sector >> (AL_EXTENT_SHIFT-9);
+ unsigned last = i->size == 0 ? first : (i->sector + (i->size >> 9) - 1) >> (AL_EXTENT_SHIFT-9);
+ unsigned nr_al_extents;
+ unsigned available_update_slots;
+ unsigned enr;
+
+ D_ASSERT(first <= last);
+
+ nr_al_extents = 1 + last - first; /* worst case: all touched extends are cold. */
+ available_update_slots = min(al->nr_elements - al->used,
+ al->max_pending_changes - al->pending_changes);
+
+ /* We want all necessary updates for a given request within the same transaction
+ * We could first check how many updates are *actually* needed,
+ * and use that instead of the worst-case nr_al_extents */
+ if (available_update_slots < nr_al_extents)
+ return -EWOULDBLOCK;
+
+ /* Is resync active in this area? */
+ for (enr = first; enr <= last; enr++) {
+ struct lc_element *tmp;
+ tmp = lc_find(mdev->resync, enr/AL_EXT_PER_BM_SECT);
+ if (unlikely(tmp != NULL)) {
+ struct bm_extent *bm_ext = lc_entry(tmp, struct bm_extent, lce);
+ if (test_bit(BME_NO_WRITES, &bm_ext->flags)) {
+ if (!test_and_set_bit(BME_PRIORITY, &bm_ext->flags));
+ return -EBUSY;
+ return -EWOULDBLOCK;
+ }
+ }
+ }
+
+ /* Checkout the refcounts.
+ * Given that we checked for available elements and update slots above,
+ * this has to be successful. */
+ for (enr = first; enr <= last; enr++) {
+ struct lc_element *al_ext;
+ al_ext = lc_get_cumulative(mdev->act_log, enr);
+ if (!al_ext)
+ dev_info(DEV, "LOGIC BUG for enr=%u\n", enr);
+ }
+ return 0;
+}
+
void drbd_al_complete_io(struct drbd_conf *mdev, struct drbd_interval *i)
{
/* for bios crossing activity log extent boundaries,
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index b7b52dd..f943aac 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1611,6 +1611,8 @@ extern const char *drbd_conn_str(enum drbd_conns s);
extern const char *drbd_role_str(enum drbd_role s);

/* drbd_actlog.c */
+extern int drbd_al_begin_io_nonblock(struct drbd_conf *mdev, struct drbd_interval *i);
+extern void drbd_al_begin_io_commit(struct drbd_conf *mdev, bool delegate);
extern bool drbd_al_begin_io_fastpath(struct drbd_conf *mdev, struct drbd_interval *i);
extern void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool delegate);
extern void drbd_al_complete_io(struct drbd_conf *mdev, struct drbd_interval *i);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 43bc1d0..b923d41 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1164,32 +1164,74 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
drbd_send_and_submit(mdev, req);
}

-void __drbd_make_request_from_worker(struct drbd_conf *mdev, struct drbd_request *req)
+static void submit_fast_path(struct drbd_conf *mdev, struct list_head *incoming)
{
- const int rw = bio_rw(req->master_bio);
+ struct drbd_request *req, *tmp;
+ list_for_each_entry_safe(req, tmp, incoming, tl_requests) {
+ const int rw = bio_data_dir(req->master_bio);

- if (rw == WRITE && req->private_bio && req->i.size
- && !test_bit(AL_SUSPENDED, &mdev->flags)) {
- drbd_al_begin_io(mdev, &req->i, false);
- req->rq_state |= RQ_IN_ACT_LOG;
+ if (rw == WRITE /* rw != WRITE should not even end up here! */
+ && req->private_bio && req->i.size
+ && !test_bit(AL_SUSPENDED, &mdev->flags)) {
+ if (!drbd_al_begin_io_fastpath(mdev, &req->i))
+ continue;
+
+ req->rq_state |= RQ_IN_ACT_LOG;
+ }
+
+ list_del_init(&req->tl_requests);
+ drbd_send_and_submit(mdev, req);
}
- drbd_send_and_submit(mdev, req);
}

+static bool prepare_al_transaction_nonblock(struct drbd_conf *mdev,
+ struct list_head *incoming,
+ struct list_head *pending)
+{
+ struct drbd_request *req, *tmp;
+ int wake = 0;
+ int err;
+
+ spin_lock_irq(&mdev->al_lock);
+ list_for_each_entry_safe(req, tmp, incoming, tl_requests) {
+ err = drbd_al_begin_io_nonblock(mdev, &req->i);
+ if (err == -EBUSY)
+ wake = 1;
+ if (err)
+ continue;
+ req->rq_state |= RQ_IN_ACT_LOG;
+ list_move_tail(&req->tl_requests, pending);
+ }
+ spin_unlock_irq(&mdev->al_lock);
+ if (wake)
+ wake_up(&mdev->al_wait);
+
+ return !list_empty(pending);
+}

void do_submit(struct work_struct *ws)
{
struct drbd_conf *mdev = container_of(ws, struct drbd_conf, submit.worker);
- LIST_HEAD(writes);
+ LIST_HEAD(incoming);
+ LIST_HEAD(pending);
struct drbd_request *req, *tmp;

- spin_lock(&mdev->submit.lock);
- list_splice_init(&mdev->submit.writes, &writes);
- spin_unlock(&mdev->submit.lock);
+ for (;;) {
+ spin_lock(&mdev->submit.lock);
+ list_splice_tail_init(&mdev->submit.writes, &incoming);
+ spin_unlock(&mdev->submit.lock);

- list_for_each_entry_safe(req, tmp, &writes, tl_requests) {
- list_del_init(&req->tl_requests);
- __drbd_make_request_from_worker(mdev, req);
+ submit_fast_path(mdev, &incoming);
+ if (list_empty(&incoming))
+ break;
+
+ wait_event(mdev->al_wait, prepare_al_transaction_nonblock(mdev, &incoming, &pending));
+ drbd_al_begin_io_commit(mdev, false);
+
+ list_for_each_entry_safe(req, tmp, &pending, tl_requests) {
+ list_del_init(&req->tl_requests);
+ drbd_send_and_submit(mdev, req);
+ }
}
}

--
1.7.9.5

2013-03-19 17:18:29

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 13/18] drbd: queue writes on submitter thread, unless they pass the activity log fastpath

From: Lars Ellenberg <[email protected]>

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_req.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 4af709e..43bc1d0 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1020,6 +1020,14 @@ drbd_submit_req_private_bio(struct drbd_request *req)
bio_endio(bio, -EIO);
}

+static void drbd_queue_write(struct drbd_conf *mdev, struct drbd_request *req)
+{
+ spin_lock(&mdev->submit.lock);
+ list_add_tail(&req->tl_requests, &mdev->submit.writes);
+ spin_unlock(&mdev->submit.lock);
+ queue_work(mdev->submit.wq, &mdev->submit.worker);
+}
+
/* returns the new drbd_request pointer, if the caller is expected to
* drbd_send_and_submit() it (to save latency), or NULL if we queued the
* request on the submitter thread.
@@ -1048,17 +1056,13 @@ drbd_request_prepare(struct drbd_conf *mdev, struct bio *bio, unsigned long star
req->private_bio = NULL;
}

- /* For WRITES going to the local disk, grab a reference on the target
- * extent. This waits for any resync activity in the corresponding
- * resync extent to finish, and, if necessary, pulls in the target
- * extent into the activity log, which involves further disk io because
- * of transactional on-disk meta data updates.
- * Empty flushes don't need to go into the activity log, they can only
- * flush data for pending writes which are already in there. */
if (rw == WRITE && req->private_bio && req->i.size
&& !test_bit(AL_SUSPENDED, &mdev->flags)) {
+ if (!drbd_al_begin_io_fastpath(mdev, &req->i)) {
+ drbd_queue_write(mdev, req);
+ return NULL;
+ }
req->rq_state |= RQ_IN_ACT_LOG;
- drbd_al_begin_io(mdev, &req->i, true);
}

return req;
--
1.7.9.5

2013-03-19 17:18:27

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 14/18] lru_cache: introduce lc_get_cumulative()

From: Lars Ellenberg <[email protected]>

New helper to be able to consolidate more updates
into a single transaction.
Without this, we can only grab a single refcount
on an updated element while preparing a transaction.

lc_get_cumulative - like lc_get; also finds to-be-changed elements
@lc: the lru cache to operate on
@enr: the label to look up

Unlike lc_get this also returns the element for @enr, if it is belonging to
a pending transaction, so the return values are like for lc_get(),
plus:

pointer to an element already on the "to_be_changed" list.
In this case, the cache was already marked %LC_DIRTY.

Caller needs to make sure that the pending transaction is completed,
before proceeding to actually use this element.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
include/linux/lru_cache.h | 1 +
lib/lru_cache.c | 55 ++++++++++++++++++++++++++++++++++++---------
2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/include/linux/lru_cache.h b/include/linux/lru_cache.h
index 4019013..4626228 100644
--- a/include/linux/lru_cache.h
+++ b/include/linux/lru_cache.h
@@ -256,6 +256,7 @@ extern void lc_destroy(struct lru_cache *lc);
extern void lc_set(struct lru_cache *lc, unsigned int enr, int index);
extern void lc_del(struct lru_cache *lc, struct lc_element *element);

+extern struct lc_element *lc_get_cumulative(struct lru_cache *lc, unsigned int enr);
extern struct lc_element *lc_try_get(struct lru_cache *lc, unsigned int enr);
extern struct lc_element *lc_find(struct lru_cache *lc, unsigned int enr);
extern struct lc_element *lc_get(struct lru_cache *lc, unsigned int enr);
diff --git a/lib/lru_cache.c b/lib/lru_cache.c
index d71d894..e8d5003 100644
--- a/lib/lru_cache.c
+++ b/lib/lru_cache.c
@@ -366,7 +366,13 @@ static int lc_unused_element_available(struct lru_cache *lc)
return 0;
}

-static struct lc_element *__lc_get(struct lru_cache *lc, unsigned int enr, bool may_change)
+/* used as internal flags to __lc_get */
+enum {
+ LC_GET_MAY_CHANGE = 1,
+ LC_GET_MAY_USE_UNCOMMITTED = 2,
+};
+
+static struct lc_element *__lc_get(struct lru_cache *lc, unsigned int enr, unsigned int flags)
{
struct lc_element *e;

@@ -381,22 +387,31 @@ static struct lc_element *__lc_get(struct lru_cache *lc, unsigned int enr, bool
* this enr is currently being pulled in already,
* and will be available once the pending transaction
* has been committed. */
- if (e && e->lc_new_number == e->lc_number) {
+ if (e) {
+ if (e->lc_new_number != e->lc_number) {
+ /* It has been found above, but on the "to_be_changed"
+ * list, not yet committed. Don't pull it in twice,
+ * wait for the transaction, then try again...
+ */
+ if (!(flags & LC_GET_MAY_USE_UNCOMMITTED))
+ RETURN(NULL);
+ /* ... unless the caller is aware of the implications,
+ * probably preparing a cumulative transaction. */
+ ++e->refcnt;
+ ++lc->hits;
+ RETURN(e);
+ }
+ /* else: lc_new_number == lc_number; a real hit. */
++lc->hits;
if (e->refcnt++ == 0)
lc->used++;
list_move(&e->list, &lc->in_use); /* Not evictable... */
RETURN(e);
}
+ /* e == NULL */

++lc->misses;
- if (!may_change)
- RETURN(NULL);
-
- /* It has been found above, but on the "to_be_changed" list, not yet
- * committed. Don't pull it in twice, wait for the transaction, then
- * try again */
- if (e)
+ if (!(flags & LC_GET_MAY_CHANGE))
RETURN(NULL);

/* To avoid races with lc_try_lock(), first, mark us dirty
@@ -478,7 +493,27 @@ static struct lc_element *__lc_get(struct lru_cache *lc, unsigned int enr, bool
*/
struct lc_element *lc_get(struct lru_cache *lc, unsigned int enr)
{
- return __lc_get(lc, enr, 1);
+ return __lc_get(lc, enr, LC_GET_MAY_CHANGE);
+}
+
+/**
+ * lc_get_cumulative - like lc_get; also finds to-be-changed elements
+ * @lc: the lru cache to operate on
+ * @enr: the label to look up
+ *
+ * Unlike lc_get this also returns the element for @enr, if it is belonging to
+ * a pending transaction, so the return values are like for lc_get(),
+ * plus:
+ *
+ * pointer to an element already on the "to_be_changed" list.
+ * In this case, the cache was already marked %LC_DIRTY.
+ *
+ * Caller needs to make sure that the pending transaction is completed,
+ * before proceeding to actually use this element.
+ */
+struct lc_element *lc_get_cumulative(struct lru_cache *lc, unsigned int enr)
+{
+ return __lc_get(lc, enr, LC_GET_MAY_CHANGE|LC_GET_MAY_USE_UNCOMMITTED);
}

/**
--
1.7.9.5

2013-03-19 17:18:25

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 16/18] drbd: move start io accounting before activity log transaction

From: Lars Ellenberg <[email protected]>

The IO accounting of the drbd "queue depth" was misleading.
We only started IO accounting once we already wrote the activity log.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_req.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index b923d41..d72f2fe 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1056,6 +1056,9 @@ drbd_request_prepare(struct drbd_conf *mdev, struct bio *bio, unsigned long star
req->private_bio = NULL;
}

+ /* Update disk stats */
+ _drbd_start_io_acct(mdev, req);
+
if (rw == WRITE && req->private_bio && req->i.size
&& !test_bit(AL_SUSPENDED, &mdev->flags)) {
if (!drbd_al_begin_io_fastpath(mdev, &req->i)) {
@@ -1095,9 +1098,6 @@ static void drbd_send_and_submit(struct drbd_conf *mdev, struct drbd_request *re
goto out;
}

- /* Update disk stats */
- _drbd_start_io_acct(mdev, req);
-
/* We fail READ/READA early, if we can not serve it.
* We must do this before req is registered on any lists.
* Otherwise, drbd_req_complete() will queue failed READ for retry. */
--
1.7.9.5

2013-03-19 17:17:06

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 03/18] drbd: prepare for new striped layout of activity log

From: Lars Ellenberg <[email protected]>

Introduce two new on-disk meta data fields: al_stripes and al_stripe_size_4k
The intended use case is activity log on RAID 0 or similar.
Logically consecutive transactions will advance their on-disk position
by al_stripe_size_4k 4kB (transaction sized) blocks.

Right now, these are still asserted to be the backward compatible
values al_stripes = 1, al_stripe_size_4k = 8 (which amounts to 32kB).

Also introduce a caching member for meta_dev_idx in the in-core
structure: even though it is initially passed in in the rcu-protected
disk_conf structure, it cannot change without a detach/attach cycle.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 6 +--
drivers/block/drbd/drbd_int.h | 46 ++++++++++-------------
drivers/block/drbd/drbd_main.c | 77 ++++++++++++++++++++++++++++++++++----
drivers/block/drbd/drbd_nl.c | 5 +--
4 files changed, 94 insertions(+), 40 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index b230d91..7e7680e 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -353,11 +353,11 @@ static unsigned int rs_extent_to_bm_page(unsigned int rs_enr)

static sector_t al_tr_number_to_on_disk_sector(struct drbd_conf *mdev)
{
- const unsigned int stripes = 1;
- const unsigned int stripe_size_4kB = MD_32kB_SECT/MD_4kB_SECT;
+ const unsigned int stripes = mdev->ldev->md.al_stripes;
+ const unsigned int stripe_size_4kB = mdev->ldev->md.al_stripe_size_4k;

/* transaction number, modulo on-disk ring buffer wrap around */
- unsigned int t = mdev->al_tr_number % (stripe_size_4kB * stripes);
+ unsigned int t = mdev->al_tr_number % (mdev->ldev->md.al_size_4k);

/* ... to aligned 4k on disk block */
t = ((t % stripes) * stripe_size_4kB) + t/stripes;
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 60c89e5..ee19ba2 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -755,6 +755,14 @@ struct drbd_md {

s32 al_offset; /* signed relative sector offset to activity log */
s32 bm_offset; /* signed relative sector offset to bitmap */
+
+ /* cached value of bdev->disk_conf->meta_dev_idx (see below) */
+ s32 meta_dev_idx;
+
+ /* see al_tr_number_to_on_disk_sector() */
+ u32 al_stripes;
+ u32 al_stripe_size_4k;
+ u32 al_size_4k; /* cached product of the above */
};

struct drbd_backing_dev {
@@ -1862,38 +1870,24 @@ static inline sector_t drbd_get_max_capacity(struct drbd_backing_dev *bdev)
}

/**
- * drbd_md_ss__() - Return the sector number of our meta data super block
- * @mdev: DRBD device.
+ * drbd_md_ss() - Return the sector number of our meta data super block
* @bdev: Meta data block device.
*/
-static inline sector_t drbd_md_ss__(struct drbd_conf *mdev,
- struct drbd_backing_dev *bdev)
+static inline sector_t drbd_md_ss(struct drbd_backing_dev *bdev)
{
- int meta_dev_idx;
+ const int meta_dev_idx = bdev->md.meta_dev_idx;

- rcu_read_lock();
- meta_dev_idx = rcu_dereference(bdev->disk_conf)->meta_dev_idx;
- rcu_read_unlock();
+ if (meta_dev_idx == DRBD_MD_INDEX_FLEX_EXT)
+ return 0;

- switch (meta_dev_idx) {
- default: /* external, some index; this is the old fixed size layout */
- return MD_128MB_SECT * meta_dev_idx;
- case DRBD_MD_INDEX_INTERNAL:
- /* with drbd08, internal meta data is always "flexible" */
- case DRBD_MD_INDEX_FLEX_INT:
- if (!bdev->backing_bdev) {
- if (__ratelimit(&drbd_ratelimit_state)) {
- dev_err(DEV, "bdev->backing_bdev==NULL\n");
- dump_stack();
- }
- return 0;
- }
- /* sizeof(struct md_on_disk_07) == 4k
- * position: last 4k aligned block of 4k size */
+ /* Since drbd08, internal meta data is always "flexible".
+ * position: last 4k aligned block of 4k size */
+ if (meta_dev_idx == DRBD_MD_INDEX_INTERNAL ||
+ meta_dev_idx == DRBD_MD_INDEX_FLEX_INT)
return (drbd_get_capacity(bdev->backing_bdev) & ~7ULL) - 8;
- case DRBD_MD_INDEX_FLEX_EXT:
- return 0;
- }
+
+ /* external, some index; this is the old fixed size layout */
+ return MD_128MB_SECT * bdev->md.meta_dev_idx;
}

static inline void
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 6c4f0ff..b9bfb10 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2853,7 +2853,11 @@ struct meta_data_on_disk {
u32 bm_bytes_per_bit; /* BM_BLOCK_SIZE */
u32 la_peer_max_bio_size; /* last peer max_bio_size */

- u8 reserved_u8[4096 - (7*8 + 8*4)];
+ /* see al_tr_number_to_on_disk_sector() */
+ u32 al_stripes;
+ u32 al_stripe_size_4k;
+
+ u8 reserved_u8[4096 - (7*8 + 10*4)];
} __packed;

/**
@@ -2901,7 +2905,10 @@ void drbd_md_sync(struct drbd_conf *mdev)
buffer->bm_offset = cpu_to_be32(mdev->ldev->md.bm_offset);
buffer->la_peer_max_bio_size = cpu_to_be32(mdev->peer_max_bio_size);

- D_ASSERT(drbd_md_ss__(mdev, mdev->ldev) == mdev->ldev->md.md_offset);
+ buffer->al_stripes = cpu_to_be32(mdev->ldev->md.al_stripes);
+ buffer->al_stripe_size_4k = cpu_to_be32(mdev->ldev->md.al_stripe_size_4k);
+
+ D_ASSERT(drbd_md_ss(mdev->ldev) == mdev->ldev->md.md_offset);
sector = mdev->ldev->md.md_offset;

if (drbd_md_sync_page_io(mdev, mdev->ldev, sector, WRITE)) {
@@ -2919,13 +2926,60 @@ out:
put_ldev(mdev);
}

+static int check_activity_log_stripe_size(struct drbd_conf *mdev,
+ struct meta_data_on_disk *on_disk,
+ struct drbd_md *in_core)
+{
+ u32 al_stripes = be32_to_cpu(on_disk->al_stripes);
+ u32 al_stripe_size_4k = be32_to_cpu(on_disk->al_stripe_size_4k);
+ u64 al_size_4k;
+
+ /* both not set: default to old fixed size activity log */
+ if (al_stripes == 0 && al_stripe_size_4k == 0) {
+ al_stripes = 1;
+ al_stripe_size_4k = MD_32kB_SECT/8;
+ }
+
+ /* some paranoia plausibility checks */
+
+ /* we need both values to be set */
+ if (al_stripes == 0 || al_stripe_size_4k == 0)
+ goto err;
+
+ al_size_4k = (u64)al_stripes * al_stripe_size_4k;
+
+ /* Upper limit of activity log area, to avoid potential overflow
+ * problems in al_tr_number_to_on_disk_sector(). As right now, more
+ * than 72 * 4k blocks total only increases the amount of history,
+ * limiting this arbitrarily to 16 GB is not a real limitation ;-) */
+ if (al_size_4k > (16 * 1024 * 1024/4))
+ goto err;
+
+ /* Lower limit: we need at least 8 transaction slots (32kB)
+ * to not break existing setups */
+ if (al_size_4k < MD_32kB_SECT/8)
+ goto err;
+
+ in_core->al_stripe_size_4k = al_stripe_size_4k;
+ in_core->al_stripes = al_stripes;
+ in_core->al_size_4k = al_size_4k;
+
+ return 0;
+err:
+ dev_err(DEV, "invalid activity log striping: al_stripes=%u, al_stripe_size_4k=%u\n",
+ al_stripes, al_stripe_size_4k);
+ return -EINVAL;
+}
+
/**
* drbd_md_read() - Reads in the meta data super block
* @mdev: DRBD device.
* @bdev: Device from which the meta data should be read in.
*
- * Return 0 (NO_ERROR) on success, and an enum drbd_ret_code in case
+ * Return NO_ERROR on success, and an enum drbd_ret_code in case
* something goes wrong.
+ *
+ * Called exactly once during drbd_adm_attach()
*/
int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
{
@@ -2940,6 +2994,10 @@ int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
if (!buffer)
goto out;

+ /* First, figure out where our meta data superblock is located. */
+ bdev->md.meta_dev_idx = bdev->disk_conf->meta_dev_idx;
+ bdev->md.md_offset = drbd_md_ss(bdev);
+
if (drbd_md_sync_page_io(mdev, bdev, bdev->md.md_offset, READ)) {
/* NOTE: can't do normal error processing here as this is
called BEFORE disk is attached */
@@ -2957,40 +3015,43 @@ int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
rv = ERR_MD_UNCLEAN;
goto err;
}
+
+ rv = ERR_MD_INVALID;
if (magic != DRBD_MD_MAGIC_08) {
if (magic == DRBD_MD_MAGIC_07)
dev_err(DEV, "Found old (0.7) meta data magic. Did you \"drbdadm create-md\"?\n");
else
dev_err(DEV, "Meta data magic not found. Did you \"drbdadm create-md\"?\n");
- rv = ERR_MD_INVALID;
goto err;
}
+
+ if (check_activity_log_stripe_size(mdev, buffer, &bdev->md))
+ goto err;
+
if (be32_to_cpu(buffer->al_offset) != bdev->md.al_offset) {
dev_err(DEV, "unexpected al_offset: %d (expected %d)\n",
be32_to_cpu(buffer->al_offset), bdev->md.al_offset);
- rv = ERR_MD_INVALID;
goto err;
}
if (be32_to_cpu(buffer->bm_offset) != bdev->md.bm_offset) {
dev_err(DEV, "unexpected bm_offset: %d (expected %d)\n",
be32_to_cpu(buffer->bm_offset), bdev->md.bm_offset);
- rv = ERR_MD_INVALID;
goto err;
}
if (be32_to_cpu(buffer->md_size_sect) != bdev->md.md_size_sect) {
dev_err(DEV, "unexpected md_size: %u (expected %u)\n",
be32_to_cpu(buffer->md_size_sect), bdev->md.md_size_sect);
- rv = ERR_MD_INVALID;
goto err;
}

if (be32_to_cpu(buffer->bm_bytes_per_bit) != BM_BLOCK_SIZE) {
dev_err(DEV, "unexpected bm_bytes_per_bit: %u (expected %u)\n",
be32_to_cpu(buffer->bm_bytes_per_bit), BM_BLOCK_SIZE);
- rv = ERR_MD_INVALID;
goto err;
}

+ rv = NO_ERROR;
+
bdev->md.la_size_sect = be64_to_cpu(buffer->la_size);
for (i = UI_CURRENT; i < UI_SIZE; i++)
bdev->md.uuid[i] = be64_to_cpu(buffer->uuid[i]);
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 581f680..104b7ce 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -727,24 +727,23 @@ static void drbd_md_set_sector_offsets(struct drbd_conf *mdev,
rcu_read_lock();
meta_dev_idx = rcu_dereference(bdev->disk_conf)->meta_dev_idx;

+ bdev->md.md_offset = drbd_md_ss(bdev);
+
switch (meta_dev_idx) {
default:
/* v07 style fixed size indexed meta data */
bdev->md.md_size_sect = MD_128MB_SECT;
- bdev->md.md_offset = drbd_md_ss__(mdev, bdev);
bdev->md.al_offset = MD_4kB_SECT;
bdev->md.bm_offset = MD_4kB_SECT + al_size_sect;
break;
case DRBD_MD_INDEX_FLEX_EXT:
/* just occupy the full device; unit: sectors */
bdev->md.md_size_sect = drbd_get_capacity(bdev->md_bdev);
- bdev->md.md_offset = 0;
bdev->md.al_offset = MD_4kB_SECT;
bdev->md.bm_offset = MD_4kB_SECT + al_size_sect;
break;
case DRBD_MD_INDEX_INTERNAL:
case DRBD_MD_INDEX_FLEX_INT:
- bdev->md.md_offset = drbd_md_ss__(mdev, bdev);
/* al size is still fixed */
bdev->md.al_offset = -al_size_sect;
/* we need (slightly less than) ~ this much bitmap sectors: */
--
1.7.9.5

2013-03-19 17:20:06

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 08/18] drbd: drbd_al_being_io: short circuit to reduce latency

From: Lars Ellenberg <[email protected]>

A request hitting an already "hot" extent should proceed right away,
even if some other requests need to wait for pending transactions.

Without that short-circuit, several simultaneous make_request contexts
race for committing the transaction, possibly penalizing the innocent.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 82199d9..1d7244d 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -256,6 +256,7 @@ void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool dele
unsigned first = i->sector >> (AL_EXTENT_SHIFT-9);
unsigned last = i->size == 0 ? first : (i->sector + (i->size >> 9) - 1) >> (AL_EXTENT_SHIFT-9);
unsigned enr;
+ bool need_transaction = false;
bool locked = false;

/* When called through generic_make_request(), we must delegate
@@ -273,8 +274,17 @@ void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool dele
D_ASSERT(first <= last);
D_ASSERT(atomic_read(&mdev->local_cnt) > 0);

- for (enr = first; enr <= last; enr++)
- wait_event(mdev->al_wait, _al_get(mdev, enr) != NULL);
+ for (enr = first; enr <= last; enr++) {
+ struct lc_element *al_ext;
+ wait_event(mdev->al_wait, (al_ext = _al_get(mdev, enr)) != NULL);
+ if (al_ext->lc_number != enr)
+ need_transaction = true;
+ }
+
+ /* If *this* request was to an already active extent,
+ * we're done, even if there are pending changes. */
+ if (!need_transaction)
+ return;

/* Serialize multiple transactions.
* This uses test_and_set_bit, memory barrier is implicit.
--
1.7.9.5

2013-03-19 17:20:04

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 07/18] drbd: Clarify when activity log I/O is delegated to the worker thread

From: Lars Ellenberg <[email protected]>

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 49 ++++++++++++++++++++----------------
drivers/block/drbd/drbd_int.h | 2 +-
drivers/block/drbd/drbd_receiver.c | 2 +-
drivers/block/drbd/drbd_req.c | 2 +-
drivers/block/drbd/drbd_worker.c | 2 +-
5 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index c79625a..82199d9 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -104,7 +104,7 @@ struct update_al_work {
int err;
};

-static int al_write_transaction(struct drbd_conf *mdev);
+static int al_write_transaction(struct drbd_conf *mdev, bool delegate);

void *drbd_md_get_buffer(struct drbd_conf *mdev)
{
@@ -246,7 +246,10 @@ static struct lc_element *_al_get(struct drbd_conf *mdev, unsigned int enr)
return al_ext;
}

-void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i)
+/*
+ * @delegate: delegate activity log I/O to the worker thread
+ */
+void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool delegate)
{
/* for bios crossing activity log extent boundaries,
* we may need to activate two extents in one go */
@@ -255,6 +258,17 @@ void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i)
unsigned enr;
bool locked = false;

+ /* When called through generic_make_request(), we must delegate
+ * activity log I/O to the worker thread: a further request
+ * submitted via generic_make_request() within the same task
+ * would be queued on current->bio_list, and would only start
+ * after this function returns (see generic_make_request()).
+ *
+ * However, if we *are* the worker, we must not delegate to ourselves.
+ */
+
+ if (delegate)
+ BUG_ON(current == mdev->tconn->worker.task);

D_ASSERT(first <= last);
D_ASSERT(atomic_read(&mdev->local_cnt) > 0);
@@ -270,13 +284,6 @@ void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i)
(locked = lc_try_lock_for_transaction(mdev->act_log)));

if (locked) {
- /* drbd_al_write_transaction(mdev,al_ext,enr);
- * recurses into generic_make_request(), which
- * disallows recursion, bios being serialized on the
- * current->bio_tail list now.
- * we have to delegate updates to the activity log
- * to the worker thread. */
-
/* Double check: it may have been committed by someone else,
* while we have been waiting for the lock. */
if (mdev->act_log->pending_changes) {
@@ -287,7 +294,7 @@ void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i)
rcu_read_unlock();

if (write_al_updates) {
- al_write_transaction(mdev);
+ al_write_transaction(mdev, delegate);
mdev->al_writ_cnt++;
}

@@ -495,20 +502,18 @@ static int w_al_write_transaction(struct drbd_work *w, int unused)
/* Calls from worker context (see w_restart_disk_io()) need to write the
transaction directly. Others came through generic_make_request(),
those need to delegate it to the worker. */
-static int al_write_transaction(struct drbd_conf *mdev)
+static int al_write_transaction(struct drbd_conf *mdev, bool delegate)
{
- struct update_al_work al_work;
-
- if (current == mdev->tconn->worker.task)
+ if (delegate) {
+ struct update_al_work al_work;
+ init_completion(&al_work.event);
+ al_work.w.cb = w_al_write_transaction;
+ al_work.w.mdev = mdev;
+ drbd_queue_work_front(&mdev->tconn->sender_work, &al_work.w);
+ wait_for_completion(&al_work.event);
+ return al_work.err;
+ } else
return _al_write_transaction(mdev);
-
- init_completion(&al_work.event);
- al_work.w.cb = w_al_write_transaction;
- al_work.w.mdev = mdev;
- drbd_queue_work_front(&mdev->tconn->sender_work, &al_work.w);
- wait_for_completion(&al_work.event);
-
- return al_work.err;
}

static int _try_lc_del(struct drbd_conf *mdev, struct lc_element *al_ext)
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 6eecdec..453fccf 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1598,7 +1598,7 @@ extern const char *drbd_conn_str(enum drbd_conns s);
extern const char *drbd_role_str(enum drbd_role s);

/* drbd_actlog.c */
-extern void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i);
+extern void drbd_al_begin_io(struct drbd_conf *mdev, struct drbd_interval *i, bool delegate);
extern void drbd_al_complete_io(struct drbd_conf *mdev, struct drbd_interval *i);
extern void drbd_rs_complete_io(struct drbd_conf *mdev, sector_t sector);
extern int drbd_rs_begin_io(struct drbd_conf *mdev, sector_t sector);
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 8172a2c..1921871 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -2265,7 +2265,7 @@ static int receive_Data(struct drbd_tconn *tconn, struct packet_info *pi)
drbd_set_out_of_sync(mdev, peer_req->i.sector, peer_req->i.size);
peer_req->flags |= EE_CALL_AL_COMPLETE_IO;
peer_req->flags &= ~EE_MAY_SET_IN_SYNC;
- drbd_al_begin_io(mdev, &peer_req->i);
+ drbd_al_begin_io(mdev, &peer_req->i, true);
}

err = drbd_submit_peer_request(mdev, peer_req, rw, DRBD_FAULT_DT_WR);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 2b8303a..7d1ff1a 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1054,7 +1054,7 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
if (rw == WRITE && req->private_bio && req->i.size
&& !test_bit(AL_SUSPENDED, &mdev->flags)) {
req->rq_state |= RQ_IN_ACT_LOG;
- drbd_al_begin_io(mdev, &req->i);
+ drbd_al_begin_io(mdev, &req->i, true);
}

spin_lock_irq(&mdev->tconn->req_lock);
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 34b5d5d..f41e224 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -1411,7 +1411,7 @@ int w_restart_disk_io(struct drbd_work *w, int cancel)
struct drbd_conf *mdev = w->mdev;

if (bio_data_dir(req->master_bio) == WRITE && req->rq_state & RQ_IN_ACT_LOG)
- drbd_al_begin_io(mdev, &req->i);
+ drbd_al_begin_io(mdev, &req->i, false);

drbd_req_make_private_bio(req, req->master_bio);
req->private_bio->bi_bdev = mdev->ldev->backing_bdev;
--
1.7.9.5

2013-03-19 17:20:01

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 10/18] drbd: prepare to queue write requests on a submit worker

From: Lars Ellenberg <[email protected]>

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_int.h | 13 +++++++++++++
drivers/block/drbd/drbd_main.c | 25 ++++++++++++++++++++++++-
drivers/block/drbd/drbd_nl.c | 1 +
drivers/block/drbd/drbd_req.c | 29 +++++++++++++++++++++++++++++
4 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 453fccf..a6b71b6 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -894,6 +894,14 @@ struct drbd_tconn { /* is a resource from the config file */
} send;
};

+struct submit_worker {
+ struct workqueue_struct *wq;
+ struct work_struct worker;
+
+ spinlock_t lock;
+ struct list_head writes;
+};
+
struct drbd_conf {
struct drbd_tconn *tconn;
int vnr; /* volume number within the connection */
@@ -1034,6 +1042,10 @@ struct drbd_conf {
atomic_t ap_in_flight; /* App sectors in flight (waiting for ack) */
unsigned int peer_max_bio_size;
unsigned int local_max_bio_size;
+
+ /* any requests that would block in drbd_make_request()
+ * are deferred to this single-threaded work queue */
+ struct submit_worker submit;
};

static inline struct drbd_conf *minor_to_mdev(unsigned int minor)
@@ -1440,6 +1452,7 @@ extern void conn_free_crypto(struct drbd_tconn *tconn);
extern int proc_details;

/* drbd_req */
+extern void do_submit(struct work_struct *ws);
extern void __drbd_make_request(struct drbd_conf *, struct bio *, unsigned long);
extern void drbd_make_request(struct request_queue *q, struct bio *bio);
extern int drbd_read_remote(struct drbd_conf *mdev, struct drbd_request *req);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 3d212b9..c84226e 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -45,7 +45,7 @@
#include <linux/reboot.h>
#include <linux/notifier.h>
#include <linux/kthread.h>
-
+#include <linux/workqueue.h>
#define __KERNEL_SYSCALLS__
#include <linux/unistd.h>
#include <linux/vmalloc.h>
@@ -2300,6 +2300,7 @@ static void drbd_cleanup(void)
idr_for_each_entry(&minors, mdev, i) {
idr_remove(&minors, mdev_to_minor(mdev));
idr_remove(&mdev->tconn->volumes, mdev->vnr);
+ destroy_workqueue(mdev->submit.wq);
del_gendisk(mdev->vdisk);
/* synchronize_rcu(); No other threads running at this point */
kref_put(&mdev->kref, &drbd_minor_destroy);
@@ -2589,6 +2590,21 @@ void conn_destroy(struct kref *kref)
kfree(tconn);
}

+int init_submitter(struct drbd_conf *mdev)
+{
+ /* opencoded create_singlethread_workqueue(),
+ * to be able to say "drbd%d", ..., minor */
+ mdev->submit.wq = alloc_workqueue("drbd%u_submit",
+ WQ_UNBOUND | WQ_MEM_RECLAIM, 1, mdev->minor);
+ if (!mdev->submit.wq)
+ return -ENOMEM;
+
+ INIT_WORK(&mdev->submit.worker, do_submit);
+ spin_lock_init(&mdev->submit.lock);
+ INIT_LIST_HEAD(&mdev->submit.writes);
+ return 0;
+}
+
enum drbd_ret_code conn_new_minor(struct drbd_tconn *tconn, unsigned int minor, int vnr)
{
struct drbd_conf *mdev;
@@ -2679,6 +2695,13 @@ enum drbd_ret_code conn_new_minor(struct drbd_tconn *tconn, unsigned int minor,
drbd_msg_put_info("requested volume exists already");
goto out_idr_remove_vol;
}
+
+ if (init_submitter(mdev)) {
+ err = ERR_NOMEM;
+ drbd_msg_put_info("unable to create submit workqueue");
+ goto out_idr_remove_vol;
+ }
+
add_disk(disk);
kref_init(&mdev->kref); /* one ref for both idrs and the the add_disk */

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 974ea47..bcf900b 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -3173,6 +3173,7 @@ static enum drbd_ret_code adm_delete_minor(struct drbd_conf *mdev)
CS_VERBOSE + CS_WAIT_COMPLETE);
idr_remove(&mdev->tconn->volumes, mdev->vnr);
idr_remove(&minors, mdev_to_minor(mdev));
+ destroy_workqueue(mdev->submit.wq);
del_gendisk(mdev->vdisk);
synchronize_rcu();
kref_put(&mdev->kref, &drbd_minor_destroy);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 96d5968..4af709e 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1160,6 +1160,35 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
drbd_send_and_submit(mdev, req);
}

+void __drbd_make_request_from_worker(struct drbd_conf *mdev, struct drbd_request *req)
+{
+ const int rw = bio_rw(req->master_bio);
+
+ if (rw == WRITE && req->private_bio && req->i.size
+ && !test_bit(AL_SUSPENDED, &mdev->flags)) {
+ drbd_al_begin_io(mdev, &req->i, false);
+ req->rq_state |= RQ_IN_ACT_LOG;
+ }
+ drbd_send_and_submit(mdev, req);
+}
+
+
+void do_submit(struct work_struct *ws)
+{
+ struct drbd_conf *mdev = container_of(ws, struct drbd_conf, submit.worker);
+ LIST_HEAD(writes);
+ struct drbd_request *req, *tmp;
+
+ spin_lock(&mdev->submit.lock);
+ list_splice_init(&mdev->submit.writes, &writes);
+ spin_unlock(&mdev->submit.lock);
+
+ list_for_each_entry_safe(req, tmp, &writes, tl_requests) {
+ list_del_init(&req->tl_requests);
+ __drbd_make_request_from_worker(mdev, req);
+ }
+}
+
void drbd_make_request(struct request_queue *q, struct bio *bio)
{
struct drbd_conf *mdev = (struct drbd_conf *) q->queuedata;
--
1.7.9.5

2013-03-19 17:19:59

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 09/18] drbd: split __drbd_make_request in before and after drbd_al_begin_io

From: Lars Ellenberg <[email protected]>

This is in preparation to be able to defer requests that need to wait
for an activity log transaction to a submitter workqueue.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_req.c | 40 ++++++++++++++++++++++++++++++----------
1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 7d1ff1a..96d5968 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -34,14 +34,14 @@
static bool drbd_may_do_local_read(struct drbd_conf *mdev, sector_t sector, int size);

/* Update disk stats at start of I/O request */
-static void _drbd_start_io_acct(struct drbd_conf *mdev, struct drbd_request *req, struct bio *bio)
+static void _drbd_start_io_acct(struct drbd_conf *mdev, struct drbd_request *req)
{
- const int rw = bio_data_dir(bio);
+ const int rw = bio_data_dir(req->master_bio);
int cpu;
cpu = part_stat_lock();
part_round_stats(cpu, &mdev->vdisk->part0);
part_stat_inc(cpu, &mdev->vdisk->part0, ios[rw]);
- part_stat_add(cpu, &mdev->vdisk->part0, sectors[rw], bio_sectors(bio));
+ part_stat_add(cpu, &mdev->vdisk->part0, sectors[rw], req->i.size >> 9);
(void) cpu; /* The macro invocations above want the cpu argument, I do not like
the compiler warning about cpu only assigned but never used... */
part_inc_in_flight(&mdev->vdisk->part0, rw);
@@ -1020,12 +1020,16 @@ drbd_submit_req_private_bio(struct drbd_request *req)
bio_endio(bio, -EIO);
}

-void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long start_time)
+/* returns the new drbd_request pointer, if the caller is expected to
+ * drbd_send_and_submit() it (to save latency), or NULL if we queued the
+ * request on the submitter thread.
+ * Returns ERR_PTR(-ENOMEM) if we cannot allocate a drbd_request.
+ */
+struct drbd_request *
+drbd_request_prepare(struct drbd_conf *mdev, struct bio *bio, unsigned long start_time)
{
- const int rw = bio_rw(bio);
- struct bio_and_error m = { NULL, };
+ const int rw = bio_data_dir(bio);
struct drbd_request *req;
- bool no_remote = false;

/* allocate outside of all locks; */
req = drbd_req_new(mdev, bio);
@@ -1035,7 +1039,7 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
* if user cannot handle io errors, that's not our business. */
dev_err(DEV, "could not kmalloc() req\n");
bio_endio(bio, -ENOMEM);
- return;
+ return ERR_PTR(-ENOMEM);
}
req->start_time = start_time;

@@ -1057,6 +1061,15 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
drbd_al_begin_io(mdev, &req->i, true);
}

+ return req;
+}
+
+static void drbd_send_and_submit(struct drbd_conf *mdev, struct drbd_request *req)
+{
+ const int rw = bio_rw(req->master_bio);
+ struct bio_and_error m = { NULL, };
+ bool no_remote = false;
+
spin_lock_irq(&mdev->tconn->req_lock);
if (rw == WRITE) {
/* This may temporarily give up the req_lock,
@@ -1079,7 +1092,7 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
}

/* Update disk stats */
- _drbd_start_io_acct(mdev, req, bio);
+ _drbd_start_io_acct(mdev, req);

/* We fail READ/READA early, if we can not serve it.
* We must do this before req is registered on any lists.
@@ -1137,7 +1150,14 @@ out:

if (m.bio)
complete_master_bio(mdev, &m);
- return;
+}
+
+void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long start_time)
+{
+ struct drbd_request *req = drbd_request_prepare(mdev, bio, start_time);
+ if (IS_ERR_OR_NULL(req))
+ return;
+ drbd_send_and_submit(mdev, req);
}

void drbd_make_request(struct request_queue *q, struct bio *bio)
--
1.7.9.5

2013-03-19 17:21:14

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 05/18] drbd: mechanically rename la_size to la_size_sect

From: Lars Ellenberg <[email protected]>

Make it obvious that this value is in units of 512 Byte sectors.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_main.c | 6 +++---
drivers/block/drbd/drbd_nl.c | 16 ++++++++--------
drivers/block/drbd/drbd_receiver.c | 2 +-
3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index b9bfb10..d8bbb41 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2839,7 +2839,7 @@ void conn_md_sync(struct drbd_tconn *tconn)

/* aligned 4kByte */
struct meta_data_on_disk {
- u64 la_size; /* last agreed size. */
+ u64 la_size_sect; /* last agreed size. */
u64 uuid[UI_SIZE]; /* UUIDs. */
u64 device_uuid;
u64 reserved_u64_1;
@@ -2890,7 +2890,7 @@ void drbd_md_sync(struct drbd_conf *mdev)

memset(buffer, 0, sizeof(*buffer));

- buffer->la_size = cpu_to_be64(drbd_get_capacity(mdev->this_bdev));
+ buffer->la_size_sect = cpu_to_be64(drbd_get_capacity(mdev->this_bdev));
for (i = UI_CURRENT; i < UI_SIZE; i++)
buffer->uuid[i] = cpu_to_be64(mdev->ldev->md.uuid[i]);
buffer->flags = cpu_to_be32(mdev->ldev->md.flags);
@@ -3052,7 +3052,7 @@ int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)

rv = NO_ERROR;

- bdev->md.la_size_sect = be64_to_cpu(buffer->la_size);
+ bdev->md.la_size_sect = be64_to_cpu(buffer->la_size_sect);
for (i = UI_CURRENT; i < UI_SIZE; i++)
bdev->md.uuid[i] = be64_to_cpu(buffer->uuid[i]);
bdev->md.flags = be32_to_cpu(buffer->flags);
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 5621df8..d5211b0 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -819,7 +819,7 @@ void drbd_resume_io(struct drbd_conf *mdev)
enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *mdev, enum dds_flags flags) __must_hold(local)
{
sector_t prev_first_sect, prev_size; /* previous meta location */
- sector_t la_size, u_size;
+ sector_t la_size_sect, u_size;
sector_t size;
char ppb[10];

@@ -842,7 +842,7 @@ enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *mdev, enum dds

prev_first_sect = drbd_md_first_sector(mdev->ldev);
prev_size = mdev->ldev->md.md_size_sect;
- la_size = mdev->ldev->md.la_size_sect;
+ la_size_sect = mdev->ldev->md.la_size_sect;

/* TODO: should only be some assert here, not (re)init... */
drbd_md_set_sector_offsets(mdev, mdev->ldev);
@@ -878,7 +878,7 @@ enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *mdev, enum dds
if (rv == dev_size_error)
goto out;

- la_size_changed = (la_size != mdev->ldev->md.la_size_sect);
+ la_size_changed = (la_size_sect != mdev->ldev->md.la_size_sect);

md_moved = prev_first_sect != drbd_md_first_sector(mdev->ldev)
|| prev_size != mdev->ldev->md.md_size_sect;
@@ -900,9 +900,9 @@ enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *mdev, enum dds
drbd_md_mark_dirty(mdev);
}

- if (size > la_size)
+ if (size > la_size_sect)
rv = grew;
- if (size < la_size)
+ if (size < la_size_sect)
rv = shrunk;
out:
lc_unlock(mdev->act_log);
@@ -917,7 +917,7 @@ drbd_new_dev_size(struct drbd_conf *mdev, struct drbd_backing_dev *bdev,
sector_t u_size, int assume_peer_has_space)
{
sector_t p_size = mdev->p_size; /* partner's disk size. */
- sector_t la_size = bdev->md.la_size_sect; /* last agreed size. */
+ sector_t la_size_sect = bdev->md.la_size_sect; /* last agreed size. */
sector_t m_size; /* my size */
sector_t size = 0;

@@ -931,8 +931,8 @@ drbd_new_dev_size(struct drbd_conf *mdev, struct drbd_backing_dev *bdev,
if (p_size && m_size) {
size = min_t(sector_t, p_size, m_size);
} else {
- if (la_size) {
- size = la_size;
+ if (la_size_sect) {
+ size = la_size_sect;
if (m_size && m_size < size)
size = m_size;
if (p_size && p_size < size)
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index a9eccfc..8172a2c 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3992,7 +3992,7 @@ static int receive_state(struct drbd_tconn *tconn, struct packet_info *pi)

clear_bit(DISCARD_MY_DATA, &mdev->flags);

- drbd_md_sync(mdev); /* update connected indicator, la_size, ... */
+ drbd_md_sync(mdev); /* update connected indicator, la_size_sect, ... */

return 0;
}
--
1.7.9.5

2013-03-19 17:21:12

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 06/18] drbd: read meta data early, base on-disk offsets on super block

From: Lars Ellenberg <[email protected]>

We used to calculate all on-disk meta data offsets, and then compare
the stored offsets, basically treating them as magic numbers.

Now with the activity log striping, the activity log size is no longer
fixed. We need to first read the super block, then base the activity
log and bitmap offsets on the stored offsets/al stripe settings.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 11 +++-
drivers/block/drbd/drbd_main.c | 131 +++++++++++++++++++++++++++++++-------
drivers/block/drbd/drbd_nl.c | 15 ++---
drivers/block/drbd/drbd_worker.c | 3 +-
4 files changed, 123 insertions(+), 37 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 7e7680e..c79625a 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -168,7 +168,11 @@ static int _drbd_md_sync_page_io(struct drbd_conf *mdev,
bio->bi_end_io = drbd_md_io_complete;
bio->bi_rw = rw;

- if (!get_ldev_if_state(mdev, D_ATTACHING)) { /* Corresponding put_ldev in drbd_md_io_complete() */
+ if (!(rw & WRITE) && mdev->state.disk == D_DISKLESS && mdev->ldev == NULL)
+ /* special case, drbd_md_read() during drbd_adm_attach(): no get_ldev */
+ ;
+ else if (!get_ldev_if_state(mdev, D_ATTACHING)) {
+ /* Corresponding put_ldev in drbd_md_io_complete() */
dev_err(DEV, "ASSERT FAILED: get_ldev_if_state() == 1 in _drbd_md_sync_page_io()\n");
err = -ENODEV;
goto out;
@@ -199,9 +203,10 @@ int drbd_md_sync_page_io(struct drbd_conf *mdev, struct drbd_backing_dev *bdev,

BUG_ON(!bdev->md_bdev);

- dev_dbg(DEV, "meta_data io: %s [%d]:%s(,%llus,%s)\n",
+ dev_dbg(DEV, "meta_data io: %s [%d]:%s(,%llus,%s) %pS\n",
current->comm, current->pid, __func__,
- (unsigned long long)sector, (rw & WRITE) ? "WRITE" : "READ");
+ (unsigned long long)sector, (rw & WRITE) ? "WRITE" : "READ",
+ (void*)_RET_IP_ );

if (sector < drbd_md_first_sector(bdev) ||
sector + 7 > drbd_md_last_sector(bdev))
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index d8bbb41..3d212b9 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2971,6 +2971,86 @@ err:
return -EINVAL;
}

+static int check_offsets_and_sizes(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
+{
+ sector_t capacity = drbd_get_capacity(bdev->md_bdev);
+ struct drbd_md *in_core = &bdev->md;
+ s32 on_disk_al_sect;
+ s32 on_disk_bm_sect;
+
+ /* The on-disk size of the activity log, calculated from offsets, and
+ * the size of the activity log calculated from the stripe settings,
+ * should match.
+ * Though we could relax this a bit: it is ok, if the striped activity log
+ * fits in the available on-disk activity log size.
+ * Right now, that would break how resize is implemented.
+ * TODO: make drbd_determine_dev_size() (and the drbdmeta tool) aware
+ * of possible unused padding space in the on disk layout. */
+ if (in_core->al_offset < 0) {
+ if (in_core->bm_offset > in_core->al_offset)
+ goto err;
+ on_disk_al_sect = -in_core->al_offset;
+ on_disk_bm_sect = in_core->al_offset - in_core->bm_offset;
+ } else {
+ if (in_core->al_offset != MD_4kB_SECT)
+ goto err;
+ if (in_core->bm_offset < in_core->al_offset + in_core->al_size_4k * MD_4kB_SECT)
+ goto err;
+
+ on_disk_al_sect = in_core->bm_offset - MD_4kB_SECT;
+ on_disk_bm_sect = in_core->md_size_sect - in_core->bm_offset;
+ }
+
+ /* old fixed size meta data is exactly that: fixed. */
+ if (in_core->meta_dev_idx >= 0) {
+ if (in_core->md_size_sect != MD_128MB_SECT
+ || in_core->al_offset != MD_4kB_SECT
+ || in_core->bm_offset != MD_4kB_SECT + MD_32kB_SECT
+ || in_core->al_stripes != 1
+ || in_core->al_stripe_size_4k != MD_32kB_SECT/8)
+ goto err;
+ }
+
+ if (capacity < in_core->md_size_sect)
+ goto err;
+ if (capacity - in_core->md_size_sect < drbd_md_first_sector(bdev))
+ goto err;
+
+ /* should be aligned, and at least 32k */
+ if ((on_disk_al_sect & 7) || (on_disk_al_sect < MD_32kB_SECT))
+ goto err;
+
+ /* should fit (for now: exactly) into the available on-disk space;
+ * overflow prevention is in check_activity_log_stripe_size() above. */
+ if (on_disk_al_sect != in_core->al_size_4k * MD_4kB_SECT)
+ goto err;
+
+ /* again, should be aligned */
+ if (in_core->bm_offset & 7)
+ goto err;
+
+ /* FIXME check for device grow with flex external meta data? */
+
+ /* can the available bitmap space cover the last agreed device size? */
+ if (on_disk_bm_sect < (in_core->la_size_sect+7)/MD_4kB_SECT/8/512)
+ goto err;
+
+ return 0;
+
+err:
+ dev_err(DEV, "meta data offsets don't make sense: idx=%d "
+ "al_s=%u, al_sz4k=%u, al_offset=%d, bm_offset=%d, "
+ "md_size_sect=%u, la_size=%llu, md_capacity=%llu\n",
+ in_core->meta_dev_idx,
+ in_core->al_stripes, in_core->al_stripe_size_4k,
+ in_core->al_offset, in_core->bm_offset, in_core->md_size_sect,
+ (unsigned long long)in_core->la_size_sect,
+ (unsigned long long)capacity);
+
+ return -EINVAL;
+}
+
+
/**
* drbd_md_read() - Reads in the meta data super block
* @mdev: DRBD device.
@@ -2979,7 +3059,8 @@ err:
* Return NO_ERROR on success, and an enum drbd_ret_code in case
* something goes wrong.
*
- * Called exactly once during drbd_adm_attach()
+ * Called exactly once during drbd_adm_attach(), while still being D_DISKLESS,
+ * even before @bdev is assigned to @mdev->ldev.
*/
int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
{
@@ -2987,14 +3068,15 @@ int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
u32 magic, flags;
int i, rv = NO_ERROR;

- if (!get_ldev_if_state(mdev, D_ATTACHING))
- return ERR_IO_MD_DISK;
+ if (mdev->state.disk != D_DISKLESS)
+ return ERR_DISK_CONFIGURED;

buffer = drbd_md_get_buffer(mdev);
if (!buffer)
- goto out;
+ return ERR_NOMEM;

- /* First, figure out where our meta data superblock is located. */
+ /* First, figure out where our meta data superblock is located,
+ * and read it. */
bdev->md.meta_dev_idx = bdev->disk_conf->meta_dev_idx;
bdev->md.md_offset = drbd_md_ss(bdev);

@@ -3025,14 +3107,29 @@ int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
goto err;
}

- if (check_activity_log_stripe_size(mdev, buffer, &bdev->md))
+ if (be32_to_cpu(buffer->bm_bytes_per_bit) != BM_BLOCK_SIZE) {
+ dev_err(DEV, "unexpected bm_bytes_per_bit: %u (expected %u)\n",
+ be32_to_cpu(buffer->bm_bytes_per_bit), BM_BLOCK_SIZE);
goto err;
+ }

- if (be32_to_cpu(buffer->al_offset) != bdev->md.al_offset) {
- dev_err(DEV, "unexpected al_offset: %d (expected %d)\n",
- be32_to_cpu(buffer->al_offset), bdev->md.al_offset);
+
+ /* convert to in_core endian */
+ bdev->md.la_size_sect = be64_to_cpu(buffer->la_size_sect);
+ for (i = UI_CURRENT; i < UI_SIZE; i++)
+ bdev->md.uuid[i] = be64_to_cpu(buffer->uuid[i]);
+ bdev->md.flags = be32_to_cpu(buffer->flags);
+ bdev->md.device_uuid = be64_to_cpu(buffer->device_uuid);
+
+ bdev->md.md_size_sect = be32_to_cpu(buffer->md_size_sect);
+ bdev->md.al_offset = be32_to_cpu(buffer->al_offset);
+ bdev->md.bm_offset = be32_to_cpu(buffer->bm_offset);
+
+ if (check_activity_log_stripe_size(mdev, buffer, &bdev->md))
goto err;
- }
+ if (check_offsets_and_sizes(mdev, bdev))
+ goto err;
+
if (be32_to_cpu(buffer->bm_offset) != bdev->md.bm_offset) {
dev_err(DEV, "unexpected bm_offset: %d (expected %d)\n",
be32_to_cpu(buffer->bm_offset), bdev->md.bm_offset);
@@ -3044,20 +3141,8 @@ int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
goto err;
}

- if (be32_to_cpu(buffer->bm_bytes_per_bit) != BM_BLOCK_SIZE) {
- dev_err(DEV, "unexpected bm_bytes_per_bit: %u (expected %u)\n",
- be32_to_cpu(buffer->bm_bytes_per_bit), BM_BLOCK_SIZE);
- goto err;
- }
-
rv = NO_ERROR;

- bdev->md.la_size_sect = be64_to_cpu(buffer->la_size_sect);
- for (i = UI_CURRENT; i < UI_SIZE; i++)
- bdev->md.uuid[i] = be64_to_cpu(buffer->uuid[i]);
- bdev->md.flags = be32_to_cpu(buffer->flags);
- bdev->md.device_uuid = be64_to_cpu(buffer->device_uuid);
-
spin_lock_irq(&mdev->tconn->req_lock);
if (mdev->state.conn < C_CONNECTED) {
unsigned int peer;
@@ -3069,8 +3154,6 @@ int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)

err:
drbd_md_put_buffer(mdev);
- out:
- put_ldev(mdev);

return rv;
}
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index d5211b0..974ea47 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -721,7 +721,7 @@ static void drbd_md_set_sector_offsets(struct drbd_conf *mdev,
struct drbd_backing_dev *bdev)
{
sector_t md_size_sect = 0;
- unsigned int al_size_sect = MD_32kB_SECT;
+ unsigned int al_size_sect = bdev->md.al_size_4k * 8;

bdev->md.md_offset = drbd_md_ss(bdev);

@@ -1413,8 +1413,11 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
goto fail;
}

- /* RT - for drbd_get_max_capacity() DRBD_MD_INDEX_FLEX_INT */
- drbd_md_set_sector_offsets(mdev, nbc);
+ /* Read our meta data super block early.
+ * This also sets other on-disk offsets. */
+ retcode = drbd_md_read(mdev, nbc);
+ if (retcode != NO_ERROR)
+ goto fail;

if (drbd_get_max_capacity(nbc) < new_disk_conf->disk_size) {
dev_err(DEV, "max capacity %llu smaller than disk size %llu\n",
@@ -1481,8 +1484,6 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
if (!get_ldev_if_state(mdev, D_ATTACHING))
goto force_diskless;

- drbd_md_set_sector_offsets(mdev, nbc);
-
if (!mdev->bitmap) {
if (drbd_bm_init(mdev)) {
retcode = ERR_NOMEM;
@@ -1490,10 +1491,6 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
}
}

- retcode = drbd_md_read(mdev, nbc);
- if (retcode != NO_ERROR)
- goto force_diskless_dec;
-
if (mdev->state.conn < C_CONNECTED &&
mdev->state.role == R_PRIMARY &&
(mdev->ed_uuid & ~((u64)1)) != (nbc->md.uuid[UI_CURRENT] & ~((u64)1))) {
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 424dc7b..34b5d5d 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -89,7 +89,8 @@ void drbd_md_io_complete(struct bio *bio, int error)
md_io->done = 1;
wake_up(&mdev->misc_wait);
bio_put(bio);
- put_ldev(mdev);
+ if (mdev->ldev) /* special case: drbd_md_read() during drbd_adm_attach() */
+ put_ldev(mdev);
}

/* reads on behalf of the partner,
--
1.7.9.5

2013-03-19 17:21:47

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 01/18] drbd: cleanup bogus assert message

From: Lars Ellenberg <[email protected]>

This fixes ASSERT( mdev->state.disk == D_FAILED ) in drivers/block/drbd/drbd_main.c

When we detach from local disk, we let the local refcount hit zero twice.

First, we transition to D_FAILED, so we won't give out new references
to incoming requests; we still may give out *internal* references, though.
Once the refcount hits zero [1] while in D_FAILED, we queue a transition
to D_DISKLESS to our worker. We need to queue it, because we may be in
atomic context when putting the reference.
Once the transition to D_DISKLESS actually happened [2] from worker context,
we don't give out new internal references either.

Between hitting zero the first time [1] and actually transition to
D_DISKLESS [2], there may be a few very short lived internal get/put,
so we may hit zero more than once while being in D_FAILED, or even see a
race where a an internal get_ldev() happened while D_FAILED, but the
corresponding put_ldev() happens just after the transition to D_DISKLESS.

That's why we have the additional test_and_set_bit(GO_DISKLESS,);
and that's why the assert was placed wrong.
Since there was exactly one code path left to drbd_go_diskless(),
and that checks already for D_FAILED, drop that assert,
and fold in the drbd_queue_work().

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_int.h | 7 ++++---
drivers/block/drbd/drbd_main.c | 7 -------
2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 6b51afa..db504d0 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1148,7 +1148,6 @@ extern int drbd_bitmap_io_from_worker(struct drbd_conf *mdev,
char *why, enum bm_flag flags);
extern int drbd_bmio_set_n_write(struct drbd_conf *mdev);
extern int drbd_bmio_clear_n_write(struct drbd_conf *mdev);
-extern void drbd_go_diskless(struct drbd_conf *mdev);
extern void drbd_ldev_destroy(struct drbd_conf *mdev);

/* Meta data layout
@@ -2053,9 +2052,11 @@ static inline void put_ldev(struct drbd_conf *mdev)
if (mdev->state.disk == D_DISKLESS)
/* even internal references gone, safe to destroy */
drbd_ldev_destroy(mdev);
- if (mdev->state.disk == D_FAILED)
+ if (mdev->state.disk == D_FAILED) {
/* all application IO references gone. */
- drbd_go_diskless(mdev);
+ if (!test_and_set_bit(GO_DISKLESS, &mdev->flags))
+ drbd_queue_work(&mdev->tconn->sender_work, &mdev->go_diskless);
+ }
wake_up(&mdev->misc_wait);
}
}
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 8c13eeb..6224963 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -3255,13 +3255,6 @@ static int w_go_diskless(struct drbd_work *w, int unused)
return 0;
}

-void drbd_go_diskless(struct drbd_conf *mdev)
-{
- D_ASSERT(mdev->state.disk == D_FAILED);
- if (!test_and_set_bit(GO_DISKLESS, &mdev->flags))
- drbd_queue_work(&mdev->tconn->sender_work, &mdev->go_diskless);
-}
-
/**
* drbd_queue_bitmap_io() - Queues an IO operation on the whole bitmap
* @mdev: DRBD device.
--
1.7.9.5

2013-03-19 17:21:45

by Philipp Reisner

[permalink] [raw]
Subject: [PATCH 02/18] drbd: cleanup ondisk meta data layout calculations and defines

From: Lars Ellenberg <[email protected]>

Add a comment about our meta data layout variants,
and rename a few defines (e.g. MD_RESERVED_SECT -> MD_128MB_SECT)
to make it clear that they are short hand for fixed constants,
and not arbitrarily to be redefined as one may see fit.

Properly pad struct meta_data_on_disk to 4kB,
and initialize to zero not only the first 512 Byte,
but all of it in drbd_md_sync().

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 28 ++++++++++---
drivers/block/drbd/drbd_bitmap.c | 13 +++++-
drivers/block/drbd/drbd_int.h | 86 ++++++++++++++++++++++----------------
drivers/block/drbd/drbd_main.c | 11 +++--
drivers/block/drbd/drbd_nl.c | 42 ++++++++++++++-----
5 files changed, 123 insertions(+), 57 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 92510f8..b230d91 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -209,7 +209,8 @@ int drbd_md_sync_page_io(struct drbd_conf *mdev, struct drbd_backing_dev *bdev,
current->comm, current->pid, __func__,
(unsigned long long)sector, (rw & WRITE) ? "WRITE" : "READ");

- err = _drbd_md_sync_page_io(mdev, bdev, iop, sector, rw, MD_BLOCK_SIZE);
+ /* we do all our meta data IO in aligned 4k blocks. */
+ err = _drbd_md_sync_page_io(mdev, bdev, iop, sector, rw, 4096);
if (err) {
dev_err(DEV, "drbd_md_sync_page_io(,%llus,%s) failed with error %d\n",
(unsigned long long)sector, (rw & WRITE) ? "WRITE" : "READ", err);
@@ -350,6 +351,24 @@ static unsigned int rs_extent_to_bm_page(unsigned int rs_enr)
(BM_EXT_SHIFT - BM_BLOCK_SHIFT));
}

+static sector_t al_tr_number_to_on_disk_sector(struct drbd_conf *mdev)
+{
+ const unsigned int stripes = 1;
+ const unsigned int stripe_size_4kB = MD_32kB_SECT/MD_4kB_SECT;
+
+ /* transaction number, modulo on-disk ring buffer wrap around */
+ unsigned int t = mdev->al_tr_number % (stripe_size_4kB * stripes);
+
+ /* ... to aligned 4k on disk block */
+ t = ((t % stripes) * stripe_size_4kB) + t/stripes;
+
+ /* ... to 512 byte sector in activity log */
+ t *= 8;
+
+ /* ... plus offset to the on disk position */
+ return mdev->ldev->md.md_offset + mdev->ldev->md.al_offset + t;
+}
+
static int
_al_write_transaction(struct drbd_conf *mdev)
{
@@ -432,13 +451,12 @@ _al_write_transaction(struct drbd_conf *mdev)
if (mdev->al_tr_cycle >= mdev->act_log->nr_elements)
mdev->al_tr_cycle = 0;

- sector = mdev->ldev->md.md_offset
- + mdev->ldev->md.al_offset
- + mdev->al_tr_pos * (MD_BLOCK_SIZE>>9);
+ sector = al_tr_number_to_on_disk_sector(mdev);

crc = crc32c(0, buffer, 4096);
buffer->crc32c = cpu_to_be32(crc);

+ /* normal execution path goes through all three branches */
if (drbd_bm_write_hinted(mdev))
err = -EIO;
/* drbd_chk_io_error done already */
@@ -446,8 +464,6 @@ _al_write_transaction(struct drbd_conf *mdev)
err = -EIO;
drbd_chk_io_error(mdev, 1, DRBD_META_IO_ERROR);
} else {
- /* advance ringbuffer position and transaction counter */
- mdev->al_tr_pos = (mdev->al_tr_pos + 1) % (MD_AL_SECTORS*512/MD_BLOCK_SIZE);
mdev->al_tr_number++;
}

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 8dc2950..64fbb83 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -612,6 +612,17 @@ static void bm_memset(struct drbd_bitmap *b, size_t offset, int c, size_t len)
}
}

+/* For the layout, see comment above drbd_md_set_sector_offsets(). */
+static u64 drbd_md_on_disk_bits(struct drbd_backing_dev *ldev)
+{
+ u64 bitmap_sectors;
+ if (ldev->md.al_offset == 8)
+ bitmap_sectors = ldev->md.md_size_sect - ldev->md.bm_offset;
+ else
+ bitmap_sectors = ldev->md.al_offset - ldev->md.bm_offset;
+ return bitmap_sectors << (9 + 3);
+}
+
/*
* make sure the bitmap has enough room for the attached storage,
* if necessary, resize.
@@ -668,7 +679,7 @@ int drbd_bm_resize(struct drbd_conf *mdev, sector_t capacity, int set_new_bits)
words = ALIGN(bits, 64) >> LN2_BPL;

if (get_ldev(mdev)) {
- u64 bits_on_disk = ((u64)mdev->ldev->md.md_size_sect-MD_BM_OFFSET) << 12;
+ u64 bits_on_disk = drbd_md_on_disk_bits(mdev->ldev);
put_ldev(mdev);
if (bits > bits_on_disk) {
dev_info(DEV, "bits = %lu\n", bits);
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index db504d0..60c89e5 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -753,13 +753,8 @@ struct drbd_md {
u32 flags;
u32 md_size_sect;

- s32 al_offset; /* signed relative sector offset to al area */
+ s32 al_offset; /* signed relative sector offset to activity log */
s32 bm_offset; /* signed relative sector offset to bitmap */
-
- /* u32 al_nr_extents; important for restoring the AL
- * is stored into ldev->dc.al_extents, which in turn
- * gets applied to act_log->nr_elements
- */
};

struct drbd_backing_dev {
@@ -1009,7 +1004,6 @@ struct drbd_conf {
struct lru_cache *act_log; /* activity log */
unsigned int al_tr_number;
int al_tr_cycle;
- int al_tr_pos; /* position of the next transaction in the journal */
wait_queue_head_t seq_wait;
atomic_t packet_seq;
unsigned int peer_seq;
@@ -1151,21 +1145,41 @@ extern int drbd_bmio_clear_n_write(struct drbd_conf *mdev);
extern void drbd_ldev_destroy(struct drbd_conf *mdev);

/* Meta data layout
- We reserve a 128MB Block (4k aligned)
- * either at the end of the backing device
- * or on a separate meta data device. */
+ *
+ * We currently have two possible layouts.
+ * Offsets in (512 byte) sectors.
+ * external:
+ * |----------- md_size_sect ------------------|
+ * [ 4k superblock ][ activity log ][ Bitmap ]
+ * | al_offset == 8 |
+ * | bm_offset = al_offset + X |
+ * ==> bitmap sectors = md_size_sect - bm_offset
+ *
+ * Variants:
+ * old, indexed fixed size meta data:
+ *
+ * internal:
+ * |----------- md_size_sect ------------------|
+ * [data.....][ Bitmap ][ activity log ][ 4k superblock ][padding*]
+ * | al_offset < 0 |
+ * | bm_offset = al_offset - Y |
+ * ==> bitmap sectors = Y = al_offset - bm_offset
+ *
+ * [padding*] are zero or up to 7 unused 512 Byte sectors to the
+ * end of the device, so that the [4k superblock] will be 4k aligned.
+ *
+ * The activity log consists of 4k transaction blocks,
+ * which are written in a ring-buffer, or striped ring-buffer like fashion,
+ * which are writtensize used to be fixed 32kB,
+ * but is about to become configurable.
+ */

-/* The following numbers are sectors */
-/* Allows up to about 3.8TB, so if you want more,
+/* Our old fixed size meta data layout
+ * allows up to about 3.8TB, so if you want more,
* you need to use the "flexible" meta data format. */
-#define MD_RESERVED_SECT (128LU << 11) /* 128 MB, unit sectors */
-#define MD_AL_OFFSET 8 /* 8 Sectors after start of meta area */
-#define MD_AL_SECTORS 64 /* = 32 kB on disk activity log ring buffer */
-#define MD_BM_OFFSET (MD_AL_OFFSET + MD_AL_SECTORS)
-
-/* we do all meta data IO in 4k blocks */
-#define MD_BLOCK_SHIFT 12
-#define MD_BLOCK_SIZE (1<<MD_BLOCK_SHIFT)
+#define MD_128MB_SECT (128LLU << 11) /* 128 MB, unit sectors */
+#define MD_4kB_SECT 8
+#define MD_32kB_SECT 64

/* One activity log extent represents 4M of storage */
#define AL_EXTENT_SHIFT 22
@@ -1255,7 +1269,6 @@ struct bm_extent {

/* in one sector of the bitmap, we have this many activity_log extents. */
#define AL_EXT_PER_BM_SECT (1 << (BM_EXT_SHIFT - AL_EXTENT_SHIFT))
-#define BM_WORDS_PER_AL_EXT (1 << (AL_EXTENT_SHIFT-BM_BLOCK_SHIFT-LN2_BPL))

#define BM_BLOCKS_PER_BM_EXT_B (BM_EXT_SHIFT - BM_BLOCK_SHIFT)
#define BM_BLOCKS_PER_BM_EXT_MASK ((1<<BM_BLOCKS_PER_BM_EXT_B) - 1)
@@ -1275,16 +1288,18 @@ struct bm_extent {
*/

#define DRBD_MAX_SECTORS_32 (0xffffffffLU)
-#define DRBD_MAX_SECTORS_BM \
- ((MD_RESERVED_SECT - MD_BM_OFFSET) * (1LL<<(BM_EXT_SHIFT-9)))
-#if DRBD_MAX_SECTORS_BM < DRBD_MAX_SECTORS_32
-#define DRBD_MAX_SECTORS DRBD_MAX_SECTORS_BM
-#define DRBD_MAX_SECTORS_FLEX DRBD_MAX_SECTORS_BM
-#elif !defined(CONFIG_LBDAF) && BITS_PER_LONG == 32
+/* we have a certain meta data variant that has a fixed on-disk size of 128
+ * MiB, of which 4k are our "superblock", and 32k are the fixed size activity
+ * log, leaving this many sectors for the bitmap.
+ */
+
+#define DRBD_MAX_SECTORS_FIXED_BM \
+ ((MD_128MB_SECT - MD_32kB_SECT - MD_4kB_SECT) * (1LL<<(BM_EXT_SHIFT-9)))
+#if !defined(CONFIG_LBDAF) && BITS_PER_LONG == 32
#define DRBD_MAX_SECTORS DRBD_MAX_SECTORS_32
#define DRBD_MAX_SECTORS_FLEX DRBD_MAX_SECTORS_32
#else
-#define DRBD_MAX_SECTORS DRBD_MAX_SECTORS_BM
+#define DRBD_MAX_SECTORS DRBD_MAX_SECTORS_FIXED_BM
/* 16 TB in units of sectors */
#if BITS_PER_LONG == 32
/* adjust by one page worth of bitmap,
@@ -1792,10 +1807,10 @@ static inline sector_t drbd_md_last_sector(struct drbd_backing_dev *bdev)
switch (meta_dev_idx) {
case DRBD_MD_INDEX_INTERNAL:
case DRBD_MD_INDEX_FLEX_INT:
- return bdev->md.md_offset + MD_AL_OFFSET - 1;
+ return bdev->md.md_offset + MD_4kB_SECT -1;
case DRBD_MD_INDEX_FLEX_EXT:
default:
- return bdev->md.md_offset + bdev->md.md_size_sect;
+ return bdev->md.md_offset + bdev->md.md_size_sect -1;
}
}

@@ -1861,13 +1876,11 @@ static inline sector_t drbd_md_ss__(struct drbd_conf *mdev,
rcu_read_unlock();

switch (meta_dev_idx) {
- default: /* external, some index */
- return MD_RESERVED_SECT * meta_dev_idx;
+ default: /* external, some index; this is the old fixed size layout */
+ return MD_128MB_SECT * meta_dev_idx;
case DRBD_MD_INDEX_INTERNAL:
/* with drbd08, internal meta data is always "flexible" */
case DRBD_MD_INDEX_FLEX_INT:
- /* sizeof(struct md_on_disk_07) == 4k
- * position: last 4k aligned block of 4k size */
if (!bdev->backing_bdev) {
if (__ratelimit(&drbd_ratelimit_state)) {
dev_err(DEV, "bdev->backing_bdev==NULL\n");
@@ -1875,8 +1888,9 @@ static inline sector_t drbd_md_ss__(struct drbd_conf *mdev,
}
return 0;
}
- return (drbd_get_capacity(bdev->backing_bdev) & ~7ULL)
- - MD_AL_OFFSET;
+ /* sizeof(struct md_on_disk_07) == 4k
+ * position: last 4k aligned block of 4k size */
+ return (drbd_get_capacity(bdev->backing_bdev) & ~7ULL) - 8;
case DRBD_MD_INDEX_FLEX_EXT:
return 0;
}
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 6224963..6c4f0ff 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2837,6 +2837,7 @@ void conn_md_sync(struct drbd_tconn *tconn)
rcu_read_unlock();
}

+/* aligned 4kByte */
struct meta_data_on_disk {
u64 la_size; /* last agreed size. */
u64 uuid[UI_SIZE]; /* UUIDs. */
@@ -2846,13 +2847,13 @@ struct meta_data_on_disk {
u32 magic;
u32 md_size_sect;
u32 al_offset; /* offset to this block */
- u32 al_nr_extents; /* important for restoring the AL */
+ u32 al_nr_extents; /* important for restoring the AL (userspace) */
/* `-- act_log->nr_elements <-- ldev->dc.al_extents */
u32 bm_offset; /* offset to the bitmap, from here */
u32 bm_bytes_per_bit; /* BM_BLOCK_SIZE */
u32 la_peer_max_bio_size; /* last peer max_bio_size */
- u32 reserved_u32[3];

+ u8 reserved_u8[4096 - (7*8 + 8*4)];
} __packed;

/**
@@ -2865,6 +2866,10 @@ void drbd_md_sync(struct drbd_conf *mdev)
sector_t sector;
int i;

+ /* Don't accidentally change the DRBD meta data layout. */
+ BUILD_BUG_ON(UI_SIZE != 4);
+ BUILD_BUG_ON(sizeof(struct meta_data_on_disk) != 4096);
+
del_timer(&mdev->md_sync_timer);
/* timer may be rearmed by drbd_md_mark_dirty() now. */
if (!test_and_clear_bit(MD_DIRTY, &mdev->flags))
@@ -2879,7 +2884,7 @@ void drbd_md_sync(struct drbd_conf *mdev)
if (!buffer)
goto out;

- memset(buffer, 0, 512);
+ memset(buffer, 0, sizeof(*buffer));

buffer->la_size = cpu_to_be64(drbd_get_capacity(mdev->this_bdev));
for (i = UI_CURRENT; i < UI_SIZE; i++)
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 2af26fc..581f680 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -696,12 +696,32 @@ out:
return 0;
}

-/* initializes the md.*_offset members, so we are able to find
- * the on disk meta data */
+/* Initializes the md.*_offset members, so we are able to find
+ * the on disk meta data.
+ *
+ * We currently have two possible layouts:
+ * external:
+ * |----------- md_size_sect ------------------|
+ * [ 4k superblock ][ activity log ][ Bitmap ]
+ * | al_offset == 8 |
+ * | bm_offset = al_offset + X |
+ * ==> bitmap sectors = md_size_sect - bm_offset
+ *
+ * internal:
+ * |----------- md_size_sect ------------------|
+ * [data.....][ Bitmap ][ activity log ][ 4k superblock ]
+ * | al_offset < 0 |
+ * | bm_offset = al_offset - Y |
+ * ==> bitmap sectors = Y = al_offset - bm_offset
+ *
+ * Activity log size used to be fixed 32kB,
+ * but is about to become configurable.
+ */
static void drbd_md_set_sector_offsets(struct drbd_conf *mdev,
struct drbd_backing_dev *bdev)
{
sector_t md_size_sect = 0;
+ unsigned int al_size_sect = MD_32kB_SECT;
int meta_dev_idx;

rcu_read_lock();
@@ -710,23 +730,23 @@ static void drbd_md_set_sector_offsets(struct drbd_conf *mdev,
switch (meta_dev_idx) {
default:
/* v07 style fixed size indexed meta data */
- bdev->md.md_size_sect = MD_RESERVED_SECT;
+ bdev->md.md_size_sect = MD_128MB_SECT;
bdev->md.md_offset = drbd_md_ss__(mdev, bdev);
- bdev->md.al_offset = MD_AL_OFFSET;
- bdev->md.bm_offset = MD_BM_OFFSET;
+ bdev->md.al_offset = MD_4kB_SECT;
+ bdev->md.bm_offset = MD_4kB_SECT + al_size_sect;
break;
case DRBD_MD_INDEX_FLEX_EXT:
/* just occupy the full device; unit: sectors */
bdev->md.md_size_sect = drbd_get_capacity(bdev->md_bdev);
bdev->md.md_offset = 0;
- bdev->md.al_offset = MD_AL_OFFSET;
- bdev->md.bm_offset = MD_BM_OFFSET;
+ bdev->md.al_offset = MD_4kB_SECT;
+ bdev->md.bm_offset = MD_4kB_SECT + al_size_sect;
break;
case DRBD_MD_INDEX_INTERNAL:
case DRBD_MD_INDEX_FLEX_INT:
bdev->md.md_offset = drbd_md_ss__(mdev, bdev);
/* al size is still fixed */
- bdev->md.al_offset = -MD_AL_SECTORS;
+ bdev->md.al_offset = -al_size_sect;
/* we need (slightly less than) ~ this much bitmap sectors: */
md_size_sect = drbd_get_capacity(bdev->backing_bdev);
md_size_sect = ALIGN(md_size_sect, BM_SECT_PER_EXT);
@@ -735,11 +755,11 @@ static void drbd_md_set_sector_offsets(struct drbd_conf *mdev,

/* plus the "drbd meta data super block",
* and the activity log; */
- md_size_sect += MD_BM_OFFSET;
+ md_size_sect += MD_4kB_SECT + al_size_sect;

bdev->md.md_size_sect = md_size_sect;
/* bitmap offset is adjusted by 'super' block size */
- bdev->md.bm_offset = -md_size_sect + MD_AL_OFFSET;
+ bdev->md.bm_offset = -md_size_sect + MD_4kB_SECT;
break;
}
rcu_read_unlock();
@@ -1416,7 +1436,7 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
min_md_device_sectors = (2<<10);
} else {
max_possible_sectors = DRBD_MAX_SECTORS;
- min_md_device_sectors = MD_RESERVED_SECT * (new_disk_conf->meta_dev_idx + 1);
+ min_md_device_sectors = MD_128MB_SECT * (new_disk_conf->meta_dev_idx + 1);
}

if (drbd_get_capacity(nbc->md_bdev) < min_md_device_sectors) {
--
1.7.9.5

2013-03-22 17:17:14

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 00/18] RFC: Non blocking submit for activity log misses

On Tue, Mar 19 2013, Philipp Reisner wrote:
> The Issues
>
> Since the beginning DRBD was written with the assumption that the write
> pattern has spacial locality. (This assumption was driven from the fact,
> that rotating media performs better if you do not send the head too far too
> often)
>
> Backed by this assumption a caller that submits a request that is outside of
> the current active set, was blocked until the active set was changed.
> (Changing the active set is a synchronous write operation to the meta-data
> area on the backing storage = "an AL-update" in DRBD-speak)
>
> A second effect was that DRBD's meta-data was located in a very narrow
> area. When DRBD is used on top of a RAID0 stripe set, this causes all
> AL-updates to got to the same disk.
>
>
> The Proposed Solution
>
> This patch series improves DRBD's behavior. A submitter is no longer blocked
> in the case of a AL-miss. For this a dedicated submitter worker is introduced
> (patch 13).
>
> In order to better distribute the AL-updates to more disks in a stripe set
> this patch series also introduces an optional striped layout of the part
> of the meta-data that holds the AL-updates (patch 4).
>
>
> The Results
>
> This of course drastically improves DRBD's performance if the write pattern
> does not have any spacial locality. E.g. random writes spread out over the
> whole device.
>
> In the test systems we have SSDs with are able to do up to 50000 writes per
> second. The test does random distributed writes over a work set size of
> 128GiB with IO depths from 1 to 1024.
>
> At an IO depth of 64:
> without this patch we observed ~100 IOPs.
> With this patches we observed about 20000 IOPs.
>
> Please find charts of the results here:
> http://blogs.linbit.com/p/469/843-random-writes-faster/

Patchset doesn't apply to current tree. 10/18 gets into problems, last
hunk of drbd_main.c. What is this based on?

--
Jens Axboe

2013-03-22 20:03:37

by Lars Ellenberg

[permalink] [raw]
Subject: Re: [Drbd-dev] [PATCH 00/18] RFC: Non blocking submit for activity log misses

On Fri, Mar 22, 2013 at 11:17:06AM -0600, Jens Axboe wrote:
> On Tue, Mar 19 2013, Philipp Reisner wrote:

> > This patch series improves DRBD's behavior. A submitter is no longer blocked
> > in the case of a AL-miss. For this a dedicated submitter worker is introduced
> > (patch 13).

> Patchset doesn't apply to current tree. 10/18 gets into problems, last
> hunk of drbd_main.c. What is this based on?

I think it is just missing

commit 56de210245487ef1f1416c8ec9e581ebdd0d32ec
Author: Tejun Heo <[email protected]>
Date: Wed Feb 27 17:04:01 2013 -0800

drbd: convert to idr_alloc()

Convert to the much saner new idr interface.

I'll post an updated 10/18 right away.

This RfC batch was produced from a tree
based more-or-less on your for-3.9-drivers, I think.

We will rebase to your for-3.10/drivers shortly,
and then also post an additional batch of general fixes.

Lars

2013-03-22 20:22:30

by Lars Ellenberg

[permalink] [raw]
Subject: Re: [Drbd-dev] [PATCH 10/18, update] drbd: prepare to queue write requests on a submit worker

From: Lars Ellenberg <[email protected]>

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_int.h | 13 +++++++++++++
drivers/block/drbd/drbd_main.c | 26 +++++++++++++++++++++++++-
drivers/block/drbd/drbd_nl.c | 1 +
drivers/block/drbd/drbd_req.c | 29 +++++++++++++++++++++++++++++
4 files changed, 68 insertions(+), 1 deletion(-)

Updated
=======
to apply against current upstream,
because of 56de210 drbd: convert to idr_alloc()
affected the context of the last chunk of drbd_main.c; now last two
chunks: needed to re-add the out_idr_remove_vol cleanup goto.

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 453fccf..a6b71b6 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -894,6 +894,14 @@ struct drbd_tconn { /* is a resource from the config file */
} send;
};

+struct submit_worker {
+ struct workqueue_struct *wq;
+ struct work_struct worker;
+
+ spinlock_t lock;
+ struct list_head writes;
+};
+
struct drbd_conf {
struct drbd_tconn *tconn;
int vnr; /* volume number within the connection */
@@ -1034,6 +1042,10 @@ struct drbd_conf {
atomic_t ap_in_flight; /* App sectors in flight (waiting for ack) */
unsigned int peer_max_bio_size;
unsigned int local_max_bio_size;
+
+ /* any requests that would block in drbd_make_request()
+ * are deferred to this single-threaded work queue */
+ struct submit_worker submit;
};

static inline struct drbd_conf *minor_to_mdev(unsigned int minor)
@@ -1440,6 +1452,7 @@ extern void conn_free_crypto(struct drbd_tconn *tconn);
extern int proc_details;

/* drbd_req */
+extern void do_submit(struct work_struct *ws);
extern void __drbd_make_request(struct drbd_conf *, struct bio *, unsigned long);
extern void drbd_make_request(struct request_queue *q, struct bio *bio);
extern int drbd_read_remote(struct drbd_conf *mdev, struct drbd_request *req);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 3571ea1..8faf33e 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -45,7 +45,7 @@
#include <linux/reboot.h>
#include <linux/notifier.h>
#include <linux/kthread.h>
-
+#include <linux/workqueue.h>
#define __KERNEL_SYSCALLS__
#include <linux/unistd.h>
#include <linux/vmalloc.h>
@@ -2300,6 +2300,7 @@ static void drbd_cleanup(void)
idr_for_each_entry(&minors, mdev, i) {
idr_remove(&minors, mdev_to_minor(mdev));
idr_remove(&mdev->tconn->volumes, mdev->vnr);
+ destroy_workqueue(mdev->submit.wq);
del_gendisk(mdev->vdisk);
/* synchronize_rcu(); No other threads running at this point */
kref_put(&mdev->kref, &drbd_minor_destroy);
@@ -2589,6 +2590,21 @@ void conn_destroy(struct kref *kref)
kfree(tconn);
}

+int init_submitter(struct drbd_conf *mdev)
+{
+ /* opencoded create_singlethread_workqueue(),
+ * to be able to say "drbd%d", ..., minor */
+ mdev->submit.wq = alloc_workqueue("drbd%u_submit",
+ WQ_UNBOUND | WQ_MEM_RECLAIM, 1, mdev->minor);
+ if (!mdev->submit.wq)
+ return -ENOMEM;
+
+ INIT_WORK(&mdev->submit.worker, do_submit);
+ spin_lock_init(&mdev->submit.lock);
+ INIT_LIST_HEAD(&mdev->submit.writes);
+ return 0;
+}
+
enum drbd_ret_code conn_new_minor(struct drbd_tconn *tconn, unsigned int minor, int vnr)
{
struct drbd_conf *mdev;
@@ -2678,6 +2694,12 @@ enum drbd_ret_code conn_new_minor(struct drbd_tconn *tconn, unsigned int minor,
goto out_idr_remove_minor;
}

+ if (init_submitter(mdev)) {
+ err = ERR_NOMEM;
+ drbd_msg_put_info("unable to create submit workqueue");
+ goto out_idr_remove_vol;
+ }
+
add_disk(disk);
kref_init(&mdev->kref); /* one ref for both idrs and the the add_disk */

@@ -2688,6 +2710,8 @@ enum drbd_ret_code conn_new_minor(struct drbd_tconn *tconn, unsigned int minor,

return NO_ERROR;

+out_idr_remove_vol:
+ idr_remove(&tconn->volumes, vnr_got);
out_idr_remove_minor:
idr_remove(&minors, minor_got);
synchronize_rcu();
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 6606e94..0e3b5e5 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -3174,6 +3174,7 @@ static enum drbd_ret_code adm_delete_minor(struct drbd_conf *mdev)
CS_VERBOSE + CS_WAIT_COMPLETE);
idr_remove(&mdev->tconn->volumes, mdev->vnr);
idr_remove(&minors, mdev_to_minor(mdev));
+ destroy_workqueue(mdev->submit.wq);
del_gendisk(mdev->vdisk);
synchronize_rcu();
kref_put(&mdev->kref, &drbd_minor_destroy);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 96d5968..4af709e 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1160,6 +1160,35 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
drbd_send_and_submit(mdev, req);
}

+void __drbd_make_request_from_worker(struct drbd_conf *mdev, struct drbd_request *req)
+{
+ const int rw = bio_rw(req->master_bio);
+
+ if (rw == WRITE && req->private_bio && req->i.size
+ && !test_bit(AL_SUSPENDED, &mdev->flags)) {
+ drbd_al_begin_io(mdev, &req->i, false);
+ req->rq_state |= RQ_IN_ACT_LOG;
+ }
+ drbd_send_and_submit(mdev, req);
+}
+
+
+void do_submit(struct work_struct *ws)
+{
+ struct drbd_conf *mdev = container_of(ws, struct drbd_conf, submit.worker);
+ LIST_HEAD(writes);
+ struct drbd_request *req, *tmp;
+
+ spin_lock(&mdev->submit.lock);
+ list_splice_init(&mdev->submit.writes, &writes);
+ spin_unlock(&mdev->submit.lock);
+
+ list_for_each_entry_safe(req, tmp, &writes, tl_requests) {
+ list_del_init(&req->tl_requests);
+ __drbd_make_request_from_worker(mdev, req);
+ }
+}
+
void drbd_make_request(struct request_queue *q, struct bio *bio)
{
struct drbd_conf *mdev = (struct drbd_conf *) q->queuedata;
--
1.7.9.5

2013-03-23 00:16:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [Drbd-dev] [PATCH 10/18, update] drbd: prepare to queue write requests on a submit worker

On Fri, Mar 22 2013, [email protected] wrote:
> From: Lars Ellenberg <[email protected]>
>
> Signed-off-by: Philipp Reisner <[email protected]>
> Signed-off-by: Lars Ellenberg <[email protected]>
> ---
> drivers/block/drbd/drbd_int.h | 13 +++++++++++++
> drivers/block/drbd/drbd_main.c | 26 +++++++++++++++++++++++++-
> drivers/block/drbd/drbd_nl.c | 1 +
> drivers/block/drbd/drbd_req.c | 29 +++++++++++++++++++++++++++++
> 4 files changed, 68 insertions(+), 1 deletion(-)
>
> Updated
> =======
> to apply against current upstream,
> because of 56de210 drbd: convert to idr_alloc()
> affected the context of the last chunk of drbd_main.c; now last two
> chunks: needed to re-add the out_idr_remove_vol cleanup goto.

That's better - updated series and pushed out for testing.

--
Jens Axboe