LinuxLists.cc - [PATCH 00/30] DBRD updates

2016-04-25 12:23:00

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 00/30] DBRD updates

Hi Jens,

apart from the usual maintenance and bug fixes this time comes
support for WRITE_SAME and lots of improvemnts for DISCARD.

Overview:
As replication technology we want to use DISCARDs only if they
really zero the backing storage. Thin LVM does it but claims
not to do it. To make that reasonably usable with DRBD we
added this "discard_zeroes_if_aligned" hack^H^H^H^H configure
option. Please see the commit messages for all the details.

Please add it to your for-4.7/drivers branch.
Thanks!

Fabian Frederick (1):
drbd: code cleanups without semantic changes

Lars Ellenberg (24):
drbd: bitmap bulk IO: do not always suspend IO
drbd: change bitmap write-out when leaving resync states
drbd: adjust assert in w_bitmap_io to account for
BM_LOCKED_CHANGE_ALLOWED
drbd: fix regression: protocol A sometimes synchronous, C sometimes
double-latency
drbd: fix for truncated minor number in callback command line
drbd: allow parallel flushes for multi-volume resources
drbd: when receiving P_TRIM, zero-out partial unaligned chunks
drbd: possibly disable discard support, if backend has
discard_zeroes_data=0
drbd: zero-out partial unaligned discards on local backend
drbd: allow larger max_discard_sectors
drbd: finish resync on sync source only by notification from sync
target
drbd: introduce unfence-peer handler
drbd: don't forget error completion when "unsuspending" IO
drbd: if there is no good data accessible, writes should be IO errors
drbd: only restart frozen disk io when D_UP_TO_DATE
drbd: discard_zeroes_if_aligned allows "thin" resync for
discard_zeroes_data=0
drbd: report sizes if rejecting too small peer disk
drbd: introduce WRITE_SAME support
drbd: sync_handshake: handle identical uuids with current (frozen)
Primary
drbd: disallow promotion during resync handshake, avoid deadlock and
hard reset
drbd: bump current uuid when resuming IO with diskless peer
drbd: finally report ms, not jiffies, in log message
drbd: al_write_transaction: skip re-scanning of bitmap page pointer
array
drbd: correctly handle failed crypto_alloc_hash

Philipp Reisner (4):
drbd: Kill code duplication
drbd: Implement handling of thinly provisioned storage on resync
target nodes
drbd: Introduce new disk config option rs-discard-granularity
drbd: Create the protocol feature THIN_RESYNC

Roland Kammerer (1):
drbd: get rid of empty statement in is_valid_state

drivers/block/drbd/drbd_actlog.c | 29 +-
drivers/block/drbd/drbd_bitmap.c | 84 ++++--
drivers/block/drbd/drbd_debugfs.c | 13 +-
drivers/block/drbd/drbd_int.h | 49 +++-
drivers/block/drbd/drbd_interval.h | 14 +-
drivers/block/drbd/drbd_main.c | 115 +++++++-
drivers/block/drbd/drbd_nl.c | 282 +++++++++++++++-----
drivers/block/drbd/drbd_proc.c | 30 +--
drivers/block/drbd/drbd_protocol.h | 77 +++++-
drivers/block/drbd/drbd_receiver.c | 534 ++++++++++++++++++++++++++++++-------
drivers/block/drbd/drbd_req.c | 84 ++++--
drivers/block/drbd/drbd_req.h | 5 +-
drivers/block/drbd/drbd_state.c | 61 ++++-
drivers/block/drbd/drbd_state.h | 2 +-
drivers/block/drbd/drbd_strings.c | 8 +-
drivers/block/drbd/drbd_worker.c | 85 +++++-
include/linux/drbd.h | 10 +-
include/linux/drbd_genl.h | 7 +-
include/linux/drbd_limits.h | 15 +-
19 files changed, 1205 insertions(+), 299 deletions(-)

--
1.9.1

2016-04-25 12:16:27

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 21/30] drbd: report sizes if rejecting too small peer disk

From: Lars Ellenberg <[email protected]>

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_receiver.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 078c4d98..99f4519 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3939,6 +3939,7 @@ static int receive_sizes(struct drbd_connection *connection, struct packet_info
device->p_size = p_size;

if (get_ldev(device)) {
+ sector_t new_size, cur_size;
rcu_read_lock();
my_usize = rcu_dereference(device->ldev->disk_conf)->disk_size;
rcu_read_unlock();
@@ -3955,11 +3956,13 @@ static int receive_sizes(struct drbd_connection *connection, struct packet_info

/* Never shrink a device with usable data during connect.
But allow online shrinking if we are connected. */
- if (drbd_new_dev_size(device, device->ldev, p_usize, 0) <
- drbd_get_capacity(device->this_bdev) &&
+ new_size = drbd_new_dev_size(device, device->ldev, p_usize, 0);
+ cur_size = drbd_get_capacity(device->this_bdev);
+ if (new_size < cur_size &&
device->state.disk >= D_OUTDATED &&
device->state.conn < C_CONNECTED) {
- drbd_err(device, "The peer's disk size is too small!\n");
+ drbd_err(device, "The peer's disk size is too small! (%llu < %llu sectors)\n",
+ (unsigned long long)new_size, (unsigned long long)cur_size);
conn_request_state(peer_device->connection, NS(conn, C_DISCONNECTING), CS_HARD);
put_ldev(device);
return -EIO;
--
1.9.1

2016-04-25 12:16:33

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 20/30] drbd: discard_zeroes_if_aligned allows "thin" resync for discard_zeroes_data=0

From: Lars Ellenberg <[email protected]>

Even if discard_zeroes_data != 0,
if discard_zeroes_if_aligned is set, we assume we can reliably
zero-out/discard using the drbd_issue_peer_discard() helper.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_nl.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index a703a0e..48fa6c3 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1408,9 +1408,12 @@ static void sanitize_disk_conf(struct drbd_device *device, struct disk_conf *dis
if (disk_conf->al_extents > drbd_al_extents_max(nbc))
disk_conf->al_extents = drbd_al_extents_max(nbc);

- if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data) {
- disk_conf->rs_discard_granularity = 0; /* disable feature */
- drbd_info(device, "rs_discard_granularity feature disabled\n");
+ if (!blk_queue_discard(q)
+ || (!q->limits.discard_zeroes_data && !disk_conf->discard_zeroes_if_aligned)) {
+ if (disk_conf->rs_discard_granularity) {
+ disk_conf->rs_discard_granularity = 0; /* disable feature */
+ drbd_info(device, "rs_discard_granularity feature disabled\n");
+ }
}

if (disk_conf->rs_discard_granularity) {
--
1.9.1

2016-04-25 12:16:40

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 27/30] drbd: get rid of empty statement in is_valid_state

From: Roland Kammerer <[email protected]>

This should silence a warning about an empty statement. Thanks to Fabian
Frederick <[email protected]> who sent a patch I modified to be smaller and
avoids an additional indent level.

Signed-off-by: Roland Kammerer <[email protected]>
Signed-off-by: Philipp Reisner <[email protected]>
---
drivers/block/drbd/drbd_state.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index aca68a5..eea0c4a 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -814,7 +814,7 @@ is_valid_state(struct drbd_device *device, union drbd_state ns)
}

if (rv <= 0)
- /* already found a reason to abort */;
+ goto out; /* already found a reason to abort */
else if (ns.role == R_SECONDARY && device->open_cnt)
rv = SS_DEVICE_IN_USE;

@@ -862,6 +862,7 @@ is_valid_state(struct drbd_device *device, union drbd_state ns)
else if (ns.conn >= C_CONNECTED && ns.pdsk == D_UNKNOWN)
rv = SS_CONNECTED_OUTDATES;

+out:
rcu_read_unlock();

return rv;
--
1.9.1

2016-04-25 12:16:38

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 01/30] drbd: bitmap bulk IO: do not always suspend IO

From: Lars Ellenberg <[email protected]>

The intention was to only suspend IO if some normal bitmap operation is
supposed to be locked out, not always. If the bulk operation is flaged
as BM_LOCKED_CHANGE_ALLOWED, we do not need to suspend IO.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_main.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index fa20977..802d729 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -3585,18 +3585,20 @@ void drbd_queue_bitmap_io(struct drbd_device *device,
int drbd_bitmap_io(struct drbd_device *device, int (*io_fn)(struct drbd_device *),
char *why, enum bm_flag flags)
{
+ /* Only suspend io, if some operation is supposed to be locked out */
+ const bool do_suspend_io = flags & (BM_DONT_CLEAR|BM_DONT_SET|BM_DONT_TEST);
int rv;

D_ASSERT(device, current != first_peer_device(device)->connection->worker.task);

- if ((flags & BM_LOCKED_SET_ALLOWED) == 0)
+ if (do_suspend_io)
drbd_suspend_io(device);

drbd_bm_lock(device, why, flags);
rv = io_fn(device);
drbd_bm_unlock(device);

- if ((flags & BM_LOCKED_SET_ALLOWED) == 0)
+ if (do_suspend_io)
drbd_resume_io(device);

return rv;
--
1.9.1

2016-04-25 12:17:16

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 14/30] drbd: allow larger max_discard_sectors

From: Lars Ellenberg <[email protected]>

Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
al-extents available, and allow up to half of that to be
discarded in one bio.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 2 +-
drivers/block/drbd/drbd_int.h | 8 ++++----
include/linux/drbd_limits.h | 3 +--
3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 10459a1..1664762 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -256,7 +256,7 @@ bool drbd_al_begin_io_fastpath(struct drbd_device *device, struct drbd_interval
unsigned first = i->sector >> (AL_EXTENT_SHIFT-9);
unsigned last = i->size == 0 ? first : (i->sector + (i->size >> 9) - 1) >> (AL_EXTENT_SHIFT-9);

- D_ASSERT(device, (unsigned)(last - first) <= 1);
+ D_ASSERT(device, first <= last);
D_ASSERT(device, atomic_read(&device->local_cnt) > 0);

/* FIXME figure out a fast path for bios crossing AL extent boundaries */
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index d818e7d..d82e531 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1347,10 +1347,10 @@ struct bm_extent {
#define DRBD_MAX_SIZE_H80_PACKET (1U << 15) /* Header 80 only allows packets up to 32KiB data */
#define DRBD_MAX_BIO_SIZE_P95 (1U << 17) /* Protocol 95 to 99 allows bios up to 128KiB */

-/* For now, don't allow more than one activity log extent worth of data
- * to be discarded in one go. We may need to rework drbd_al_begin_io()
- * to allow for even larger discard ranges */
-#define DRBD_MAX_DISCARD_SIZE AL_EXTENT_SIZE
+/* For now, don't allow more than half of what we can "activate" in one
+ * activity log transaction to be discarded in one go. We may need to rework
+ * drbd_al_begin_io() to allow for even larger discard ranges */
+#define DRBD_MAX_DISCARD_SIZE (AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
#define DRBD_MAX_DISCARD_SECTORS (DRBD_MAX_DISCARD_SIZE >> 9)

extern int drbd_bm_init(struct drbd_device *device);
diff --git a/include/linux/drbd_limits.h b/include/linux/drbd_limits.h
index a351c40..ddac684 100644
--- a/include/linux/drbd_limits.h
+++ b/include/linux/drbd_limits.h
@@ -126,8 +126,7 @@
#define DRBD_RESYNC_RATE_DEF 250
#define DRBD_RESYNC_RATE_SCALE 'k' /* kilobytes */

- /* less than 7 would hit performance unnecessarily. */
-#define DRBD_AL_EXTENTS_MIN 7
+#define DRBD_AL_EXTENTS_MIN 67
/* we use u16 as "slot number", (u16)~0 is "FREE".
* If you use >= 292 kB on-disk ring buffer,
* this is the maximum you can use: */
--
1.9.1

2016-04-25 12:17:14

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 25/30] drbd: bump current uuid when resuming IO with diskless peer

From: Lars Ellenberg <[email protected]>

Scenario, starting with normal operation
Connected Primary/Secondary UpToDate/UpToDate
NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
... more failures happen, secondary loses it's disk,
but eventually is able to re-establish the replication link ...
Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)

We used to just resume/resent suspended requests,
without bumping the UUID.

Which will lead to problems later, when we want to re-attach the disk on
the peer, without first disconnecting, or if we experience additional
failures, because we now have diverging data without being able to
recognize it.

Make sure we also bump the current data generation UUID,
if we notice "peer disk unknown" -> "peer disk known bad".

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_state.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 7562c5c..a1b5e6c9 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1637,6 +1637,26 @@ static void broadcast_state_change(struct drbd_state_change *state_change)
#undef REMEMBER_STATE_CHANGE
}

+/* takes old and new peer disk state */
+static bool lost_contact_to_peer_data(enum drbd_disk_state os, enum drbd_disk_state ns)
+{
+ if ((os >= D_INCONSISTENT && os != D_UNKNOWN && os != D_OUTDATED)
+ && (ns < D_INCONSISTENT || ns == D_UNKNOWN || ns == D_OUTDATED))
+ return true;
+
+ /* Scenario, starting with normal operation
+ * Connected Primary/Secondary UpToDate/UpToDate
+ * NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
+ * ...
+ * Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)
+ */
+ if (os == D_UNKNOWN
+ && (ns == D_DISKLESS || ns == D_FAILED || ns == D_OUTDATED))
+ return true;
+
+ return false;
+}
+
/**
* after_state_ch() - Perform after state change actions that may sleep
* @device: DRBD device.
@@ -1708,6 +1728,13 @@ static void after_state_ch(struct drbd_device *device, union drbd_state os,
idr_for_each_entry(&connection->peer_devices, peer_device, vnr)
clear_bit(NEW_CUR_UUID, &peer_device->device->flags);
rcu_read_unlock();
+
+ /* We should actively create a new uuid, _before_
+ * we resume/resent, if the peer is diskless
+ * (recovery from a multiple error scenario).
+ * Currently, this happens with a slight delay
+ * below when checking lost_contact_to_peer_data() ...
+ */
_tl_restart(connection, RESEND);
_conn_request_state(connection,
(union drbd_state) { { .susp_fen = 1 } },
@@ -1751,12 +1778,7 @@ static void after_state_ch(struct drbd_device *device, union drbd_state os,
BM_LOCKED_TEST_ALLOWED);

/* Lost contact to peer's copy of the data */
- if ((os.pdsk >= D_INCONSISTENT &&
- os.pdsk != D_UNKNOWN &&
- os.pdsk != D_OUTDATED)
- && (ns.pdsk < D_INCONSISTENT ||
- ns.pdsk == D_UNKNOWN ||
- ns.pdsk == D_OUTDATED)) {
+ if (lost_contact_to_peer_data(os.pdsk, ns.pdsk)) {
if (get_ldev(device)) {
if ((ns.role == R_PRIMARY || ns.peer == R_PRIMARY) &&
device->ldev->md.uuid[UI_BITMAP] == 0 && ns.disk >= D_UP_TO_DATE) {
--
1.9.1

2016-04-25 12:17:11

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 18/30] drbd: if there is no good data accessible, writes should be IO errors

From: Lars Ellenberg <[email protected]>

If DRBD lost all path to good data,
and the on-no-data-accessible policy is OND_SUSPEND_IO,
all pending and new IO requests are suspended (will block).

If that setting is OND_IO_ERROR, IO will still be completed.
READ to "clean" areas (e.g. on an D_INCONSISTENT device,
and bitmap indicates a block is already in sync) will succeed.
READ to "unclean" areas (bitmap indicates block is out-of-sync),
will return EIO.

If we are already D_DISKLESS (or D_FAILED), we also return EIO.

Unfortunately, on a former R_PRIMARY C_SYNC_TARGET D_INCONSISTENT,
after replication link loss, new WRITE requests still went through OK.

The would also set the "out-of-sync" bit on their way, so READ after
WRITE would still return EIO. Also, the data generation UUIDs had not
been bumped, we would cause data divergence, without being able to
detect it on the next sync handshake, given the right sequence of events
in a multiple error scenario and "improper" order of recovery actions.

The right thing to do is to return EIO for all new writes,
unless we have access to good, current, D_UP_TO_DATE data.

The "established best practices" way to avoid these situations in the
first place is to set OND_SUSPEND_IO, or even do a hard-reset from
the pri-on-incon-degr policy helper hook.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_req.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 7e441ff..8260dfa6 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1258,6 +1258,22 @@ drbd_request_prepare(struct drbd_device *device, struct bio *bio, unsigned long
return NULL;
}

+/* Require at least one path to current data.
+ * We don't want to allow writes on C_STANDALONE D_INCONSISTENT:
+ * We would not allow to read what was written,
+ * we would not have bumped the data generation uuids,
+ * we would cause data divergence for all the wrong reasons.
+ *
+ * If we don't see at least one D_UP_TO_DATE, we will fail this request,
+ * which either returns EIO, or, if OND_SUSPEND_IO is set, suspends IO,
+ * and queues for retry later.
+ */
+static bool may_do_writes(struct drbd_device *device)
+{
+ const union drbd_dev_state s = device->state;
+ return s.disk == D_UP_TO_DATE || s.pdsk == D_UP_TO_DATE;
+}
+
static void drbd_send_and_submit(struct drbd_device *device, struct drbd_request *req)
{
struct drbd_resource *resource = device->resource;
@@ -1312,6 +1328,12 @@ static void drbd_send_and_submit(struct drbd_device *device, struct drbd_request
}

if (rw == WRITE) {
+ if (req->private_bio && !may_do_writes(device)) {
+ bio_put(req->private_bio);
+ req->private_bio = NULL;
+ put_ldev(device);
+ goto nodata;
+ }
if (!drbd_process_write_request(req))
no_remote = true;
} else {
--
1.9.1

2016-04-25 12:17:10

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 09/30] drbd: fix for truncated minor number in callback command line

From: Lars Ellenberg <[email protected]>

The command line parameter the kernel module uses to communicate the
device minor to userland helper is flawed in a way that the device
indentifier "minor-%d" is being truncated to minors with a maximum
of 5 digits.

But DRBD 8.4 allows 2^20 == 1048576 minors,
thus a minimum of 7 digits must be supported.

Reported by Veit Wahlich on drbd-dev.

Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_nl.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index ee4642f..e63c5c4 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -343,7 +343,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
(char[20]) { }, /* address family */
(char[60]) { }, /* address */
NULL };
- char mb[12];
+ char mb[14];
char *argv[] = {usermode_helper, cmd, mb, NULL };
struct drbd_connection *connection = first_peer_device(device)->connection;
struct sib_info sib;
@@ -352,7 +352,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
if (current == connection->worker.task)
set_bit(CALLBACK_PENDING, &connection->flags);

- snprintf(mb, 12, "minor-%d", device_to_minor(device));
+ snprintf(mb, 14, "minor-%d", device_to_minor(device));
setup_khelper_env(connection, envp);

/* The helper may take some time.
--
1.9.1

2016-04-25 12:17:08

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 16/30] drbd: introduce unfence-peer handler

From: Lars Ellenberg <[email protected]>

When resync is finished, we already call the "after-resync-target"
handler (on the former sync target, obviously), once per volume.

Paired with the before-resync-target handler, you can create snapshots,
before the resync causes the volumes to become inconsistent,
and discard those snapshots again, once they are no longer needed.

It was also overloaded to be paired with the "fence-peer" handler,
to "unfence" once the volumes are up-to-date and known good.

This has some disadvantages, though: we call "fence-peer" for the whole
connection (once for the group of volumes), but would call unfence as
side-effect of after-resync-target once for each volume.

Also, we fence on a (current, or about to become) Primary,
which will later become the sync-source.

Calling unfence only as a side effect of the after-resync-target
handler opens a race window, between a new fence on the Primary
(SyncTarget) and the unfence on the SyncTarget, which is difficult to
close without some kind of "cluster wide lock" in those handlers.

We would not need those handlers if we could still communicate.
Which makes trying to aquire a cluster wide lock from those handlers
seem like a very bad idea.

This introduces the "unfence-peer" handler, which will be called
per connection (once for the group of volumes), just like the fence
handler, only once all volumes are back in sync, and on the SyncSource.

Which is expected to be the node that previously called "fence", the
node that is currently allowed to be Primary, and thus the only node
that could trigger a new "fence" that could race with this unfence.

Which makes us not need any cluster wide synchronization here,
serializing two scripts running on the same node is trivial.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_int.h | 1 +
drivers/block/drbd/drbd_nl.c | 2 +-
drivers/block/drbd/drbd_worker.c | 28 ++++++++++++++++++++++++++--
3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 451a745..cb42f6c 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1494,6 +1494,7 @@ extern enum drbd_state_rv drbd_set_role(struct drbd_device *device,
int force);
extern bool conn_try_outdate_peer(struct drbd_connection *connection);
extern void conn_try_outdate_peer_async(struct drbd_connection *connection);
+extern int conn_khelper(struct drbd_connection *connection, char *cmd);
extern int drbd_khelper(struct drbd_device *device, char *cmd);

/* drbd_worker.c */
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index f335549..f16084a 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -387,7 +387,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
return ret;
}

-static int conn_khelper(struct drbd_connection *connection, char *cmd)
+int conn_khelper(struct drbd_connection *connection, char *cmd)
{
char *envp[] = { "HOME=/",
"TERM=linux",
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index fa63c22..f9e142d 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -839,6 +839,7 @@ static void ping_peer(struct drbd_device *device)

int drbd_resync_finished(struct drbd_device *device)
{
+ struct drbd_connection *connection = first_peer_device(device)->connection;
unsigned long db, dt, dbdt;
unsigned long n_oos;
union drbd_state os, ns;
@@ -860,8 +861,7 @@ int drbd_resync_finished(struct drbd_device *device)
if (dw) {
dw->w.cb = w_resync_finished;
dw->device = device;
- drbd_queue_work(&first_peer_device(device)->connection->sender_work,
- &dw->w);
+ drbd_queue_work(&connection->sender_work, &dw->w);
return 1;
}
drbd_err(device, "Warn failed to drbd_rs_del_all() and to kmalloc(dw).\n");
@@ -974,6 +974,30 @@ int drbd_resync_finished(struct drbd_device *device)
_drbd_set_state(device, ns, CS_VERBOSE, NULL);
out_unlock:
spin_unlock_irq(&device->resource->req_lock);
+
+ /* If we have been sync source, and have an effective fencing-policy,
+ * once *all* volumes are back in sync, call "unfence". */
+ if (os.conn == C_SYNC_SOURCE) {
+ enum drbd_disk_state disk_state = D_MASK;
+ enum drbd_disk_state pdsk_state = D_MASK;
+ enum drbd_fencing_p fp = FP_DONT_CARE;
+
+ rcu_read_lock();
+ fp = rcu_dereference(device->ldev->disk_conf)->fencing;
+ if (fp != FP_DONT_CARE) {
+ struct drbd_peer_device *peer_device;
+ int vnr;
+ idr_for_each_entry(&connection->peer_devices, peer_device, vnr) {
+ struct drbd_device *device = peer_device->device;
+ disk_state = min_t(enum drbd_disk_state, disk_state, device->state.disk);
+ pdsk_state = min_t(enum drbd_disk_state, pdsk_state, device->state.pdsk);
+ }
+ }
+ rcu_read_unlock();
+ if (disk_state == D_UP_TO_DATE && pdsk_state == D_UP_TO_DATE)
+ conn_khelper(connection, "unfence-peer");
+ }
+
put_ldev(device);
out:
device->rs_total = 0;
--
1.9.1

2016-04-25 12:16:30

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 06/30] drbd: Create the protocol feature THIN_RESYNC

If thinly provisioned volumes are used, during a resync the sync source
tries to find out if a block is deallocated. If it is deallocated, then
the resync target uses block_dev_issue_zeroout() on the range in
question.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_protocol.h | 1 +
drivers/block/drbd/drbd_receiver.c | 5 ++++-
drivers/block/drbd/drbd_worker.c | 13 ++++++++++++-
3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_protocol.h b/drivers/block/drbd/drbd_protocol.h
index e5e74e3..7acc2e0 100644
--- a/drivers/block/drbd/drbd_protocol.h
+++ b/drivers/block/drbd/drbd_protocol.h
@@ -165,6 +165,7 @@ struct p_block_req {
*/

#define FF_TRIM 1
+#define FF_THIN_RESYNC 2

struct p_connection_features {
u32 protocol_min;
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 3a6c2ec..4cfc721 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -48,7 +48,7 @@
#include "drbd_req.h"
#include "drbd_vli.h"

-#define PRO_FEATURES (FF_TRIM)
+#define PRO_FEATURES (FF_TRIM | FF_THIN_RESYNC)

struct packet_info {
enum drbd_packet cmd;
@@ -4979,6 +4979,9 @@ static int drbd_do_features(struct drbd_connection *connection)
drbd_info(connection, "Agreed to%ssupport TRIM on protocol level\n",
connection->agreed_features & FF_TRIM ? " " : " not ");

+ drbd_info(connection, "Agreed to%ssupport THIN_RESYNC on protocol level\n",
+ connection->agreed_features & FF_THIN_RESYNC ? " " : " not ");
+
return 1;

incompat:
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 01d74ee2..fa63c22 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -582,6 +582,7 @@ static int make_resync_request(struct drbd_device *const device, int cancel)
int number, rollback_i, size;
int align, requeue = 0;
int i = 0;
+ int discard_granularity = 0;

if (unlikely(cancel))
return 0;
@@ -601,6 +602,12 @@ static int make_resync_request(struct drbd_device *const device, int cancel)
return 0;
}

+ if (connection->agreed_features & FF_THIN_RESYNC) {
+ rcu_read_lock();
+ discard_granularity = rcu_dereference(device->ldev->disk_conf)->rs_discard_granularity;
+ rcu_read_unlock();
+ }
+
max_bio_size = queue_max_hw_sectors(device->rq_queue) << 9;
number = drbd_rs_number_requests(device);
if (number <= 0)
@@ -665,6 +672,9 @@ next_sector:
if (sector & ((1<<(align+3))-1))
break;

+ if (discard_granularity && size == discard_granularity)
+ break;
+
/* do not cross extent boundaries */
if (((bit+1) & BM_BLOCKS_PER_BM_EXT_MASK) == 0)
break;
@@ -711,7 +721,8 @@ next_sector:
int err;

inc_rs_pending(device);
- err = drbd_send_drequest(peer_device, P_RS_DATA_REQUEST,
+ err = drbd_send_drequest(peer_device,
+ size == discard_granularity ? P_RS_THIN_REQ : P_RS_DATA_REQUEST,
sector, size, ID_SYNCER);
if (err) {
drbd_err(device, "drbd_send_drequest() failed, aborting...\n");
--
1.9.1

2016-04-25 12:18:31

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 03/30] drbd: Kill code duplication

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_nl.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 1fd1dcc..bb71e73 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1348,6 +1348,14 @@ static bool write_ordering_changed(struct disk_conf *a, struct disk_conf *b)
a->disk_drain != b->disk_drain;
}

+static void sanitize_disk_conf(struct disk_conf *disk_conf, struct drbd_backing_dev *nbc)
+{
+ if (disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
+ disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
+ if (disk_conf->al_extents > drbd_al_extents_max(nbc))
+ disk_conf->al_extents = drbd_al_extents_max(nbc);
+}
+
int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
{
struct drbd_config_context adm_ctx;
@@ -1395,10 +1403,7 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
if (!expect(new_disk_conf->resync_rate >= 1))
new_disk_conf->resync_rate = 1;

- if (new_disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
- new_disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
- if (new_disk_conf->al_extents > drbd_al_extents_max(device->ldev))
- new_disk_conf->al_extents = drbd_al_extents_max(device->ldev);
+ sanitize_disk_conf(new_disk_conf, device->ldev);

if (new_disk_conf->c_plan_ahead > DRBD_C_PLAN_AHEAD_MAX)
new_disk_conf->c_plan_ahead = DRBD_C_PLAN_AHEAD_MAX;
@@ -1693,10 +1698,7 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
if (retcode != NO_ERROR)
goto fail;

- if (new_disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
- new_disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
- if (new_disk_conf->al_extents > drbd_al_extents_max(nbc))
- new_disk_conf->al_extents = drbd_al_extents_max(nbc);
+ sanitize_disk_conf(new_disk_conf, nbc);

if (drbd_get_max_capacity(nbc) < new_disk_conf->disk_size) {
drbd_err(device, "max capacity %llu smaller than disk size %llu\n",
--
1.9.1

2016-04-25 12:18:34

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 26/30] drbd: code cleanups without semantic changes

From: Fabian Frederick <[email protected]>

This contains various cosmetic fixes ranging from simple typos to
const-ifying, and using booleans properly.

Original commit messages from Fabian's patch set:
drbd: debugfs: constify drbd_version_fops
drbd: use seq_put instead of seq_print where possible
drbd: include linux/uaccess.h instead of asm/uaccess.h
drbd: use const char * const for drbd strings
drbd: kerneldoc warning fix in w_e_end_data_req()
drbd: use unsigned for one bit fields
drbd: use bool for peer is_ states
drbd: fix typo
drbd: use | for bitmask combination
drbd: use true/false for bool
drbd: fix drbd_bm_init() comments
drbd: introduce peer state union
drbd: fix maybe_pull_ahead() locking comments
drbd: use bool for growing
drbd: remove redundant declarations
drbd: replace if/BUG by BUG_ON

Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Roland Kammerer <[email protected]>
---
drivers/block/drbd/drbd_bitmap.c | 6 +++---
drivers/block/drbd/drbd_debugfs.c | 2 +-
drivers/block/drbd/drbd_int.h | 4 +---
drivers/block/drbd/drbd_interval.h | 14 +++++++-------
drivers/block/drbd/drbd_main.c | 2 +-
drivers/block/drbd/drbd_nl.c | 14 ++++++++------
drivers/block/drbd/drbd_proc.c | 30 +++++++++++++++---------------
drivers/block/drbd/drbd_receiver.c | 8 ++++----
drivers/block/drbd/drbd_req.c | 2 +-
drivers/block/drbd/drbd_state.c | 4 +---
drivers/block/drbd/drbd_state.h | 2 +-
drivers/block/drbd/drbd_strings.c | 8 ++++----
drivers/block/drbd/drbd_worker.c | 9 ++++-----
include/linux/drbd.h | 8 ++++++++
14 files changed, 59 insertions(+), 54 deletions(-)

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 92d6fc0..17e5e60 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -427,8 +427,7 @@ static struct page **bm_realloc_pages(struct drbd_bitmap *b, unsigned long want)
}

/*
- * called on driver init only. TODO call when a device is created.
- * allocates the drbd_bitmap, and stores it in device->bitmap.
+ * allocates the drbd_bitmap and stores it in device->bitmap.
*/
int drbd_bm_init(struct drbd_device *device)
{
@@ -633,7 +632,8 @@ int drbd_bm_resize(struct drbd_device *device, sector_t capacity, int set_new_bi
unsigned long bits, words, owords, obits;
unsigned long want, have, onpages; /* number of pages */
struct page **npages, **opages = NULL;
- int err = 0, growing;
+ int err = 0;
+ bool growing;

if (!expect(b))
return -ENOMEM;
diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c
index 8a90812..be91a8d 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -903,7 +903,7 @@ static int drbd_version_open(struct inode *inode, struct file *file)
return single_open(file, drbd_version_show, NULL);
}

-static struct file_operations drbd_version_fops = {
+static const struct file_operations drbd_version_fops = {
.owner = THIS_MODULE,
.open = drbd_version_open,
.llseek = seq_lseek,
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index cb47809..acb1462 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1499,7 +1499,7 @@ extern enum drbd_state_rv drbd_set_role(struct drbd_device *device,
int force);
extern bool conn_try_outdate_peer(struct drbd_connection *connection);
extern void conn_try_outdate_peer_async(struct drbd_connection *connection);
-extern int conn_khelper(struct drbd_connection *connection, char *cmd);
+extern enum drbd_peer_state conn_khelper(struct drbd_connection *connection, char *cmd);
extern int drbd_khelper(struct drbd_device *device, char *cmd);

/* drbd_worker.c */
@@ -1648,8 +1648,6 @@ void drbd_bump_write_ordering(struct drbd_resource *resource, struct drbd_backin
/* drbd_proc.c */
extern struct proc_dir_entry *drbd_proc;
extern const struct file_operations drbd_proc_fops;
-extern const char *drbd_conn_str(enum drbd_conns s);
-extern const char *drbd_role_str(enum drbd_role s);

/* drbd_actlog.c */
extern bool drbd_al_begin_io_prepare(struct drbd_device *device, struct drbd_interval *i);
diff --git a/drivers/block/drbd/drbd_interval.h b/drivers/block/drbd/drbd_interval.h
index f210543..23c5a94 100644
--- a/drivers/block/drbd/drbd_interval.h
+++ b/drivers/block/drbd/drbd_interval.h
@@ -6,13 +6,13 @@

struct drbd_interval {
struct rb_node rb;
- sector_t sector; /* start sector of the interval */
- unsigned int size; /* size in bytes */
- sector_t end; /* highest interval end in subtree */
- int local:1 /* local or remote request? */;
- int waiting:1; /* someone is waiting for this to complete */
- int completed:1; /* this has been completed already;
- * ignore for conflict detection */
+ sector_t sector; /* start sector of the interval */
+ unsigned int size; /* size in bytes */
+ sector_t end; /* highest interval end in subtree */
+ unsigned int local:1 /* local or remote request? */;
+ unsigned int waiting:1; /* someone is waiting for completion */
+ unsigned int completed:1; /* this has been completed already;
+ * ignore for conflict detection */
};

static inline void drbd_clear_interval(struct drbd_interval *i)
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index e3aff25..5c8f6c6 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -31,7 +31,7 @@
#include <linux/module.h>
#include <linux/jiffies.h>
#include <linux/drbd.h>
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
#include <asm/types.h>
#include <net/sock.h>
#include <linux/ctype.h>
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 3704fad..adcb8b0 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -387,7 +387,7 @@ int drbd_khelper(struct drbd_device *device, char *cmd)
return ret;
}

-int conn_khelper(struct drbd_connection *connection, char *cmd)
+enum drbd_peer_state conn_khelper(struct drbd_connection *connection, char *cmd)
{
char *envp[] = { "HOME=/",
"TERM=linux",
@@ -503,17 +503,17 @@ bool conn_try_outdate_peer(struct drbd_connection *connection)
r = conn_khelper(connection, "fence-peer");

switch ((r>>8) & 0xff) {
- case 3: /* peer is inconsistent */
+ case P_INCONSISTENT: /* peer is inconsistent */
ex_to_string = "peer is inconsistent or worse";
mask.pdsk = D_MASK;
val.pdsk = D_INCONSISTENT;
break;
- case 4: /* peer got outdated, or was already outdated */
+ case P_OUTDATED: /* peer got outdated, or was already outdated */
ex_to_string = "peer was fenced";
mask.pdsk = D_MASK;
val.pdsk = D_OUTDATED;
break;
- case 5: /* peer was down */
+ case P_DOWN: /* peer was down */
if (conn_highest_disk(connection) == D_UP_TO_DATE) {
/* we will(have) create(d) a new UUID anyways... */
ex_to_string = "peer is unreachable, assumed to be dead";
@@ -523,7 +523,7 @@ bool conn_try_outdate_peer(struct drbd_connection *connection)
ex_to_string = "peer unreachable, doing nothing since disk != UpToDate";
}
break;
- case 6: /* Peer is primary, voluntarily outdate myself.
+ case P_PRIMARY: /* Peer is primary, voluntarily outdate myself.
* This is useful when an unconnected R_SECONDARY is asked to
* become R_PRIMARY, but finds the other peer being active. */
ex_to_string = "peer is active";
@@ -531,7 +531,9 @@ bool conn_try_outdate_peer(struct drbd_connection *connection)
mask.disk = D_MASK;
val.disk = D_OUTDATED;
break;
- case 7:
+ case P_FENCING:
+ /* THINK: do we need to handle this
+ * like case 4, or more like case 5? */
if (fp != FP_STONITH)
drbd_err(connection, "fence-peer() = 7 && fencing != Stonith !!!\n");
ex_to_string = "peer was stonithed";
diff --git a/drivers/block/drbd/drbd_proc.c b/drivers/block/drbd/drbd_proc.c
index 6537b25..be2b93f 100644
--- a/drivers/block/drbd/drbd_proc.c
+++ b/drivers/block/drbd/drbd_proc.c
@@ -25,7 +25,7 @@

#include <linux/module.h>

-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
#include <linux/fs.h>
#include <linux/file.h>
#include <linux/proc_fs.h>
@@ -122,18 +122,18 @@ static void drbd_syncer_progress(struct drbd_device *device, struct seq_file *se

x = res/50;
y = 20-x;
- seq_printf(seq, "\t[");
+ seq_puts(seq, "\t[");
for (i = 1; i < x; i++)
- seq_printf(seq, "=");
- seq_printf(seq, ">");
+ seq_putc(seq, '=');
+ seq_putc(seq, '>');
for (i = 0; i < y; i++)
seq_printf(seq, ".");
- seq_printf(seq, "] ");
+ seq_puts(seq, "] ");

if (state.conn == C_VERIFY_S || state.conn == C_VERIFY_T)
- seq_printf(seq, "verified:");
+ seq_puts(seq, "verified:");
else
- seq_printf(seq, "sync'ed:");
+ seq_puts(seq, "sync'ed:");
seq_printf(seq, "%3u.%u%% ", res / 10, res % 10);

/* if more than a few GB, display in MB */
@@ -146,7 +146,7 @@ static void drbd_syncer_progress(struct drbd_device *device, struct seq_file *se
(unsigned long) Bit2KB(rs_left),
(unsigned long) Bit2KB(rs_total));

- seq_printf(seq, "\n\t");
+ seq_puts(seq, "\n\t");

/* see drivers/md/md.c
* We do not want to overflow, so the order of operands and
@@ -175,9 +175,9 @@ static void drbd_syncer_progress(struct drbd_device *device, struct seq_file *se
rt / 3600, (rt % 3600) / 60, rt % 60);

dbdt = Bit2KB(db/dt);
- seq_printf(seq, " speed: ");
+ seq_puts(seq, " speed: ");
seq_printf_with_thousands_grouping(seq, dbdt);
- seq_printf(seq, " (");
+ seq_puts(seq, " (");
/* ------------------------- ~3s average ------------------------ */
if (proc_details >= 1) {
/* this is what drbd_rs_should_slow_down() uses */
@@ -188,7 +188,7 @@ static void drbd_syncer_progress(struct drbd_device *device, struct seq_file *se
db = device->rs_mark_left[i] - rs_left;
dbdt = Bit2KB(db/dt);
seq_printf_with_thousands_grouping(seq, dbdt);
- seq_printf(seq, " -- ");
+ seq_puts(seq, " -- ");
}

/* --------------------- long term average ---------------------- */
@@ -200,11 +200,11 @@ static void drbd_syncer_progress(struct drbd_device *device, struct seq_file *se
db = rs_total - rs_left;
dbdt = Bit2KB(db/dt);
seq_printf_with_thousands_grouping(seq, dbdt);
- seq_printf(seq, ")");
+ seq_putc(seq, ')');

if (state.conn == C_SYNC_TARGET ||
state.conn == C_VERIFY_S) {
- seq_printf(seq, " want: ");
+ seq_puts(seq, " want: ");
seq_printf_with_thousands_grouping(seq, device->c_sync_rate);
}
seq_printf(seq, " K/sec%s\n", stalled ? " (stalled)" : "");
@@ -231,7 +231,7 @@ static void drbd_syncer_progress(struct drbd_device *device, struct seq_file *se
(unsigned long long)bm_bits * BM_SECT_PER_BIT);
if (stop_sector != 0 && stop_sector != ULLONG_MAX)
seq_printf(seq, " stop sector: %llu", stop_sector);
- seq_printf(seq, "\n");
+ seq_putc(seq, '\n');
}
}

@@ -276,7 +276,7 @@ static int drbd_seq_show(struct seq_file *seq, void *v)
rcu_read_lock();
idr_for_each_entry(&drbd_devices, device, i) {
if (prev_i != i - 1)
- seq_printf(seq, "\n");
+ seq_putc(seq, '\n');
prev_i = i;

state = device->state;
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 8e7afa3..5c06286 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -25,7 +25,7 @@

#include <linux/module.h>

-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
#include <net/sock.h>

#include <linux/drbd.h>
@@ -2286,13 +2286,13 @@ static inline int overlaps(sector_t s1, int l1, sector_t s2, int l2)
static bool overlapping_resync_write(struct drbd_device *device, struct drbd_peer_request *peer_req)
{
struct drbd_peer_request *rs_req;
- bool rv = 0;
+ bool rv = false;

spin_lock_irq(&device->resource->req_lock);
list_for_each_entry(rs_req, &device->sync_ee, w.list) {
if (overlaps(peer_req->i.sector, peer_req->i.size,
rs_req->i.sector, rs_req->i.size)) {
- rv = 1;
+ rv = true;
break;
}
}
@@ -2666,7 +2666,7 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
}

out_interrupted:
- drbd_may_finish_epoch(connection, peer_req->epoch, EV_PUT + EV_CLEANUP);
+ drbd_may_finish_epoch(connection, peer_req->epoch, EV_PUT | EV_CLEANUP);
put_ldev(device);
drbd_free_peer_req(device, peer_req);
return err;
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 72bb1b1..784b02a 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1000,7 +1000,7 @@ static void complete_conflicting_writes(struct drbd_request *req)
finish_wait(&device->misc_wait, &wait);
}

-/* called within req_lock and rcu_read_lock() */
+/* called within req_lock */
static void maybe_pull_ahead(struct drbd_device *device)
{
struct drbd_connection *connection = first_peer_device(device)->connection;
diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index a1b5e6c9..aca68a5 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -2196,9 +2196,7 @@ conn_set_state(struct drbd_connection *connection, union drbd_state mask, union
ns.disk = os.disk;

rv = _drbd_set_state(device, ns, flags, NULL);
- if (rv < SS_SUCCESS)
- BUG();
-
+ BUG_ON(rv < SS_SUCCESS);
ns.i = device->state.i;
ns_max.role = max_role(ns.role, ns_max.role);
ns_max.peer = max_role(ns.peer, ns_max.peer);
diff --git a/drivers/block/drbd/drbd_state.h b/drivers/block/drbd/drbd_state.h
index bd98953..6c9d5d4 100644
--- a/drivers/block/drbd/drbd_state.h
+++ b/drivers/block/drbd/drbd_state.h
@@ -140,7 +140,7 @@ extern void drbd_resume_al(struct drbd_device *device);
extern bool conn_all_vols_unconf(struct drbd_connection *connection);

/**
- * drbd_request_state() - Reqest a state change
+ * drbd_request_state() - Request a state change
* @device: DRBD device.
* @mask: mask of state bits to change.
* @val: value of new state bits.
diff --git a/drivers/block/drbd/drbd_strings.c b/drivers/block/drbd/drbd_strings.c
index 80b0f63..0eeab14 100644
--- a/drivers/block/drbd/drbd_strings.c
+++ b/drivers/block/drbd/drbd_strings.c
@@ -26,7 +26,7 @@
#include <linux/drbd.h>
#include "drbd_strings.h"

-static const char *drbd_conn_s_names[] = {
+static const char * const drbd_conn_s_names[] = {
[C_STANDALONE] = "StandAlone",
[C_DISCONNECTING] = "Disconnecting",
[C_UNCONNECTED] = "Unconnected",
@@ -53,13 +53,13 @@ static const char *drbd_conn_s_names[] = {
[C_BEHIND] = "Behind",
};

-static const char *drbd_role_s_names[] = {
+static const char * const drbd_role_s_names[] = {
[R_PRIMARY] = "Primary",
[R_SECONDARY] = "Secondary",
[R_UNKNOWN] = "Unknown"
};

-static const char *drbd_disk_s_names[] = {
+static const char * const drbd_disk_s_names[] = {
[D_DISKLESS] = "Diskless",
[D_ATTACHING] = "Attaching",
[D_FAILED] = "Failed",
@@ -71,7 +71,7 @@ static const char *drbd_disk_s_names[] = {
[D_UP_TO_DATE] = "UpToDate",
};

-static const char *drbd_state_sw_errors[] = {
+static const char * const drbd_state_sw_errors[] = {
[-SS_TWO_PRIMARIES] = "Multiple primaries not allowed by config",
[-SS_NO_UP_TO_DATE_DISK] = "Need access to UpToDate data",
[-SS_NO_LOCAL_DISK] = "Can not resync without local disk",
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index ed422a5..9f3c550 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -173,8 +173,8 @@ void drbd_peer_request_endio(struct bio *bio)
{
struct drbd_peer_request *peer_req = bio->bi_private;
struct drbd_device *device = peer_req->peer_device->device;
- int is_write = bio_data_dir(bio) == WRITE;
- int is_discard = !!(bio->bi_rw & REQ_DISCARD);
+ bool is_write = bio_data_dir(bio) == WRITE;
+ bool is_discard = !!(bio->bi_rw & REQ_DISCARD);

if (bio->bi_error && __ratelimit(&drbd_ratelimit_state))
drbd_warn(device, "%s: error=%d s=%llus\n",
@@ -1038,7 +1038,6 @@ static void move_to_net_ee_or_free(struct drbd_device *device, struct drbd_peer_

/**
* w_e_end_data_req() - Worker callback, to send a P_DATA_REPLY packet in response to a P_DATA_REQUEST
- * @device: DRBD device.
* @w: work object.
* @cancel: The connection will be closed anyways
*/
@@ -1699,7 +1698,7 @@ static bool use_checksum_based_resync(struct drbd_connection *connection, struct
rcu_read_unlock();
return connection->agreed_pro_version >= 89 && /* supported? */
connection->csums_tfm && /* configured? */
- (csums_after_crash_only == 0 /* use for each resync? */
+ (csums_after_crash_only == false /* use for each resync? */
|| test_bit(CRASHED_PRIMARY, &device->flags)); /* or only after Primary crash? */
}

@@ -1834,7 +1833,7 @@ void drbd_start_resync(struct drbd_device *device, enum drbd_conns side)
device->bm_resync_fo = 0;
device->use_csums = use_checksum_based_resync(connection, device);
} else {
- device->use_csums = 0;
+ device->use_csums = false;
}

/* Since protocol 96, we must serialize drbd_gen_and_send_sync_uuid
diff --git a/include/linux/drbd.h b/include/linux/drbd.h
index d6b3c99..2b26156 100644
--- a/include/linux/drbd.h
+++ b/include/linux/drbd.h
@@ -370,6 +370,14 @@ enum drbd_notification_type {
NOTIFY_FLAGS = NOTIFY_CONTINUES,
};

+enum drbd_peer_state {
+ P_INCONSISTENT = 3,
+ P_OUTDATED = 4,
+ P_DOWN = 5,
+ P_PRIMARY = 6,
+ P_FENCING = 7,
+};
+
#define UUID_JUST_CREATED ((__u64)4)

enum write_ordering_e {
--
1.9.1

2016-04-25 12:18:30

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 12/30] drbd: possibly disable discard support, if backend has discard_zeroes_data=0

From: Lars Ellenberg <[email protected]>

Now that we have the discard_zeroes_if_aligned setting, we should also
check it when setting up our queue parameters on the primary,
not only on the receiving side.

We announce discard support,
UNLESS

* we are connected to a peer that does not support TRIM
on the DRBD protocol level. Otherwise, it would either discard, or
do a fallback to zero-out, depending on its backend and configuration.

* our local backend does not support discards,
or (discard_zeroes_data=0 AND discard_zeroes_if_aligned=no).

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_nl.c | 80 ++++++++++++++++++++++++++++++--------------
1 file changed, 55 insertions(+), 25 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 4a0b184..f335549 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1154,6 +1154,59 @@ static int drbd_check_al_size(struct drbd_device *device, struct disk_conf *dc)
return 0;
}

+static void blk_queue_discard_granularity(struct request_queue *q, unsigned int granularity)
+{
+ q->limits.discard_granularity = granularity;
+}
+static void decide_on_discard_support(struct drbd_device *device,
+ struct request_queue *q,
+ struct request_queue *b,
+ bool discard_zeroes_if_aligned)
+{
+ /* q = drbd device queue (device->rq_queue)
+ * b = backing device queue (device->ldev->backing_bdev->bd_disk->queue),
+ * or NULL if diskless
+ */
+ struct drbd_connection *connection = first_peer_device(device)->connection;
+ bool can_do = b ? blk_queue_discard(b) : true;
+
+ if (can_do && b && !b->limits.discard_zeroes_data && !discard_zeroes_if_aligned) {
+ can_do = false;
+ drbd_info(device, "discard_zeroes_data=0 and discard_zeroes_if_aligned=no: disabling discards\n");
+ }
+ if (can_do && connection->cstate >= C_CONNECTED && !(connection->agreed_features & FF_TRIM)) {
+ can_do = false;
+ drbd_info(connection, "peer DRBD too old, does not support TRIM: disabling discards\n");
+ }
+ if (can_do) {
+ /* We don't care for the granularity, really.
+ * Stacking limits below should fix it for the local
+ * device. Whether or not it is a suitable granularity
+ * on the remote device is not our problem, really. If
+ * you care, you need to use devices with similar
+ * topology on all peers. */
+ blk_queue_discard_granularity(q, 512);
+ q->limits.max_discard_sectors = DRBD_MAX_DISCARD_SECTORS;
+ queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+ } else {
+ queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
+ blk_queue_discard_granularity(q, 0);
+ q->limits.max_discard_sectors = 0;
+ }
+}
+
+static void fixup_discard_if_not_supported(struct request_queue *q)
+{
+ /* To avoid confusion, if this queue does not support discard, clear
+ * max_discard_sectors, which is what lsblk -D reports to the user.
+ * Older kernels got this wrong in "stack limits".
+ * */
+ if (!blk_queue_discard(q)) {
+ blk_queue_max_discard_sectors(q, 0);
+ blk_queue_discard_granularity(q, 0);
+ }
+}
+
static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backing_dev *bdev,
unsigned int max_bio_size)
{
@@ -1183,26 +1236,8 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
/* This is the workaround for "bio would need to, but cannot, be split" */
blk_queue_max_segments(q, max_segments ? max_segments : BLK_MAX_SEGMENTS);
blk_queue_segment_boundary(q, PAGE_SIZE-1);
-
+ decide_on_discard_support(device, q, b, discard_zeroes_if_aligned);
if (b) {
- struct drbd_connection *connection = first_peer_device(device)->connection;
-
- blk_queue_max_discard_sectors(q, DRBD_MAX_DISCARD_SECTORS);
-
- if (blk_queue_discard(b) && (b->limits.discard_zeroes_data || discard_zeroes_if_aligned) &&
- (connection->cstate < C_CONNECTED || connection->agreed_features & FF_TRIM)) {
- /* We don't care, stacking below should fix it for the local device.
- * Whether or not it is a suitable granularity on the remote device
- * is not our problem, really. If you care, you need to
- * use devices with similar topology on all peers. */
- q->limits.discard_granularity = 512;
- queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
- } else {
- blk_queue_max_discard_sectors(q, 0);
- queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
- q->limits.discard_granularity = 0;
- }
-
blk_queue_stack_limits(q, b);

if (q->backing_dev_info.ra_pages != b->backing_dev_info.ra_pages) {
@@ -1212,12 +1247,7 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
q->backing_dev_info.ra_pages = b->backing_dev_info.ra_pages;
}
}
- /* To avoid confusion, if this queue does not support discard, clear
- * max_discard_sectors, which is what lsblk -D reports to the user. */
- if (!blk_queue_discard(q)) {
- blk_queue_max_discard_sectors(q, 0);
- q->limits.discard_granularity = 0;
- }
+ fixup_discard_if_not_supported(q);
}

void drbd_reconsider_queue_parameters(struct drbd_device *device, struct drbd_backing_dev *bdev)
--
1.9.1

2016-04-25 12:18:28

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 23/30] drbd: sync_handshake: handle identical uuids with current (frozen) Primary

From: Lars Ellenberg <[email protected]>

If in a two-primary scenario, we lost our peer, freeze IO,
and are still frozen (no UUID rotation) when the peer comes back
as Secondary after a hard crash, we will see identical UUIDs.

The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
arbitration, but that would cause the still running (but frozen) Primary
to become SyncTarget (which it typically refuses), and the handshake is
declined.

Fix: check current roles.
If we have *one* current primary, the Primary wins.
(rule_nr = 41)

Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
to determine if rule_nr = 41 can be applied.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_receiver.c | 47 +++++++++++++++++++++++++++++++++++---
1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 1320bb8..8e7afa3 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3181,7 +3181,8 @@ static void drbd_uuid_dump(struct drbd_device *device, char *text, u64 *uuid,
-1091 requires proto 91
-1096 requires proto 96
*/
-static int drbd_uuid_compare(struct drbd_device *const device, int *rule_nr) __must_hold(local)
+
+static int drbd_uuid_compare(struct drbd_device *const device, enum drbd_role const peer_role, int *rule_nr) __must_hold(local)
{
struct drbd_peer_device *const peer_device = first_peer_device(device);
struct drbd_connection *const connection = peer_device ? peer_device->connection : NULL;
@@ -3261,8 +3262,39 @@ static int drbd_uuid_compare(struct drbd_device *const device, int *rule_nr) __m
* next bit (weight 2) is set when peer was primary */
*rule_nr = 40;

+ /* Neither has the "crashed primary" flag set,
+ * only a replication link hickup. */
+ if (rct == 0)
+ return 0;
+
+ /* Current UUID equal and no bitmap uuid; does not necessarily
+ * mean this was a "simultaneous hard crash", maybe IO was
+ * frozen, so no UUID-bump happened.
+ * This is a protocol change, overload DRBD_FF_WSAME as flag
+ * for "new-enough" peer DRBD version. */
+ if (device->state.role == R_PRIMARY || peer_role == R_PRIMARY) {
+ *rule_nr = 41;
+ if (!(connection->agreed_features & DRBD_FF_WSAME)) {
+ drbd_warn(peer_device, "Equivalent unrotated UUIDs, but current primary present.\n");
+ return -(0x10000 | PRO_VERSION_MAX | (DRBD_FF_WSAME << 8));
+ }
+ if (device->state.role == R_PRIMARY && peer_role == R_PRIMARY) {
+ /* At least one has the "crashed primary" bit set,
+ * both are primary now, but neither has rotated its UUIDs?
+ * "Can not happen." */
+ drbd_err(peer_device, "Equivalent unrotated UUIDs, but both are primary. Can not resolve this.\n");
+ return -100;
+ }
+ if (device->state.role == R_PRIMARY)
+ return 1;
+ return -1;
+ }
+
+ /* Both are secondary.
+ * Really looks like recovery from simultaneous hard crash.
+ * Check which had been primary before, and arbitrate. */
switch (rct) {
- case 0: /* !self_pri && !peer_pri */ return 0;
+ case 0: /* !self_pri && !peer_pri */ return 0; /* already handled */
case 1: /* self_pri && !peer_pri */ return 1;
case 2: /* !self_pri && peer_pri */ return -1;
case 3: /* self_pri && peer_pri */
@@ -3389,7 +3421,7 @@ static enum drbd_conns drbd_sync_handshake(struct drbd_peer_device *peer_device,
drbd_uuid_dump(device, "peer", device->p_uuid,
device->p_uuid[UI_SIZE], device->p_uuid[UI_FLAGS]);

- hg = drbd_uuid_compare(device, &rule_nr);
+ hg = drbd_uuid_compare(device, peer_role, &rule_nr);
spin_unlock_irq(&device->ldev->md.uuid_lock);

drbd_info(device, "uuid_compare()=%d by rule %d\n", hg, rule_nr);
@@ -3398,6 +3430,15 @@ static enum drbd_conns drbd_sync_handshake(struct drbd_peer_device *peer_device,
drbd_alert(device, "Unrelated data, aborting!\n");
return C_MASK;
}
+ if (hg < -0x10000) {
+ int proto, fflags;
+ hg = -hg;
+ proto = hg & 0xff;
+ fflags = (hg >> 8) & 0xff;
+ drbd_alert(device, "To resolve this both sides have to support at least protocol %d and feature flags 0x%x\n",
+ proto, fflags);
+ return C_MASK;
+ }
if (hg < -1000) {
drbd_alert(device, "To resolve this both sides have to support at least protocol %d\n", -hg - 1000);
return C_MASK;
--
1.9.1

2016-04-25 12:18:26

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 07/30] drbd: adjust assert in w_bitmap_io to account for BM_LOCKED_CHANGE_ALLOWED

From: Lars Ellenberg <[email protected]>

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_main.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 3cecc4f..3127bba 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -3521,7 +3521,12 @@ static int w_bitmap_io(struct drbd_work *w, int unused)
struct bm_io_work *work = &device->bm_io_work;
int rv = -EIO;

- D_ASSERT(device, atomic_read(&device->ap_bio_cnt) == 0);
+ if (work->flags != BM_LOCKED_CHANGE_ALLOWED) {
+ int cnt = atomic_read(&device->ap_bio_cnt);
+ if (cnt)
+ drbd_err(device, "FIXME: ap_bio_cnt %d, expected 0; queued for '%s'\n",
+ cnt, work->why);
+ }

if (get_ldev(device)) {
drbd_bm_lock(device, work->why, work->flags);
--
1.9.1

2016-04-25 12:19:45

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 13/30] drbd: zero-out partial unaligned discards on local backend

From: Lars Ellenberg <[email protected]>

For consistency, also zero-out partial unaligned chunks of discard
requests on the local backend.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_int.h | 2 ++
drivers/block/drbd/drbd_req.c | 29 +++++++++++++++++++++++------
2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 8cc2955..d818e7d 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1553,6 +1553,8 @@ extern void start_resync_timer_fn(unsigned long data);
extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);

/* drbd_receiver.c */
+extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
+ sector_t start, unsigned int nr_sectors, bool discard);
extern int drbd_receiver(struct drbd_thread *thi);
extern int drbd_ack_receiver(struct drbd_thread *thi);
extern void drbd_send_ping_wf(struct work_struct *ws);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 6dbf1f1..7e441ff 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1156,6 +1156,16 @@ static int drbd_process_write_request(struct drbd_request *req)
return remote;
}

+static void drbd_process_discard_req(struct drbd_request *req)
+{
+ int err = drbd_issue_discard_or_zero_out(req->device,
+ req->i.sector, req->i.size >> 9, true);
+
+ if (err)
+ req->private_bio->bi_error = -EIO;
+ bio_endio(req->private_bio);
+}
+
static void
drbd_submit_req_private_bio(struct drbd_request *req)
{
@@ -1176,6 +1186,8 @@ drbd_submit_req_private_bio(struct drbd_request *req)
: rw == READ ? DRBD_FAULT_DT_RD
: DRBD_FAULT_DT_RA))
bio_io_error(bio);
+ else if (bio->bi_rw & REQ_DISCARD)
+ drbd_process_discard_req(req);
else
generic_make_request(bio);
put_ldev(device);
@@ -1227,18 +1239,23 @@ drbd_request_prepare(struct drbd_device *device, struct bio *bio, unsigned long
/* Update disk stats */
_drbd_start_io_acct(device, req);

+ /* process discards always from our submitter thread */
+ if (bio->bi_rw & REQ_DISCARD)
+ goto queue_for_submitter_thread;
+
if (rw == WRITE && req->private_bio && req->i.size
&& !test_bit(AL_SUSPENDED, &device->flags)) {
- if (!drbd_al_begin_io_fastpath(device, &req->i)) {
- atomic_inc(&device->ap_actlog_cnt);
- drbd_queue_write(device, req);
- return NULL;
- }
+ if (!drbd_al_begin_io_fastpath(device, &req->i))
+ goto queue_for_submitter_thread;
req->rq_state |= RQ_IN_ACT_LOG;
req->in_actlog_jif = jiffies;
}
-
return req;
+
+ queue_for_submitter_thread:
+ atomic_inc(&device->ap_actlog_cnt);
+ drbd_queue_write(device, req);
+ return NULL;
}

static void drbd_send_and_submit(struct drbd_device *device, struct drbd_request *req)
--
1.9.1

2016-04-25 12:20:24

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 19/30] drbd: only restart frozen disk io when D_UP_TO_DATE

From: Lars Ellenberg <[email protected]>

When re-attaching the local backend device to a C_STANDALONE D_DISKLESS
R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the
backend that is being attached as D_UP_TO_DATE.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_state.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 59c6467..24422e8 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1675,7 +1675,7 @@ static void after_state_ch(struct drbd_device *device, union drbd_state os,
what = RESEND;

if ((os.disk == D_ATTACHING || os.disk == D_NEGOTIATING) &&
- conn_lowest_disk(connection) > D_NEGOTIATING)
+ conn_lowest_disk(connection) == D_UP_TO_DATE)
what = RESTART_FROZEN_DISK_IO;

if (resource->susp_nod && what != NOTHING) {
--
1.9.1

2016-04-25 12:20:22

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 28/30] drbd: finally report ms, not jiffies, in log message

From: Lars Ellenberg <[email protected]>

Also skip the message unless bitmap IO took longer than 5 ms.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_bitmap.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 17e5e60..801b8f3 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -1121,10 +1121,14 @@ static int bm_rw(struct drbd_device *device, const unsigned int flags, unsigned
kref_put(&ctx->kref, &drbd_bm_aio_ctx_destroy);

/* summary for global bitmap IO */
- if (flags == 0)
- drbd_info(device, "bitmap %s of %u pages took %lu jiffies\n",
- (flags & BM_AIO_READ) ? "READ" : "WRITE",
- count, jiffies - now);
+ if (flags == 0) {
+ unsigned int ms = jiffies_to_msecs(jiffies - now);
+ if (ms > 5) {
+ drbd_info(device, "bitmap %s of %u pages took %u ms\n",
+ (flags & BM_AIO_READ) ? "READ" : "WRITE",
+ count, ms);
+ }
+ }

if (ctx->error) {
drbd_alert(device, "we had at least one MD IO ERROR during bitmap IO\n");
--
1.9.1

2016-04-25 12:20:19

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 10/30] drbd: allow parallel flushes for multi-volume resources

From: Lars Ellenberg <[email protected]>

To maintain write-order fidelity accros all volumes in a DRBD resource,
the receiver of a P_BARRIER needs to issue flushes to all volumes.
We used to do this by calling blkdev_issue_flush(), synchronously,
one volume at a time.

We now submit all flushes to all volumes in parallel, then wait for all
completions, to reduce worst-case latencies on multi-volume resources.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_receiver.c | 113 +++++++++++++++++++++++++++++--------
1 file changed, 88 insertions(+), 25 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 4cfc721..a2e7ba9 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1204,13 +1204,83 @@ static int drbd_recv_header(struct drbd_connection *connection, struct packet_in
return err;
}

-static void drbd_flush(struct drbd_connection *connection)
+/* This is blkdev_issue_flush, but asynchronous.
+ * We want to submit to all component volumes in parallel,
+ * then wait for all completions.
+ */
+struct issue_flush_context {
+ atomic_t pending;
+ int error;
+ struct completion done;
+};
+struct one_flush_context {
+ struct drbd_device *device;
+ struct issue_flush_context *ctx;
+};
+
+void one_flush_endio(struct bio *bio)
{
- int rv;
- struct drbd_peer_device *peer_device;
- int vnr;
+ struct one_flush_context *octx = bio->bi_private;
+ struct drbd_device *device = octx->device;
+ struct issue_flush_context *ctx = octx->ctx;
+
+ if (bio->bi_error) {
+ ctx->error = bio->bi_error;
+ drbd_info(device, "local disk FLUSH FAILED with status %d\n", bio->bi_error);
+ }
+ kfree(octx);
+ bio_put(bio);
+
+ clear_bit(FLUSH_PENDING, &device->flags);
+ put_ldev(device);
+ kref_put(&device->kref, drbd_destroy_device);
+
+ if (atomic_dec_and_test(&ctx->pending))
+ complete(&ctx->done);
+}
+
+static void submit_one_flush(struct drbd_device *device, struct issue_flush_context *ctx)
+{
+ struct bio *bio = bio_alloc(GFP_NOIO, 0);
+ struct one_flush_context *octx = kmalloc(sizeof(*octx), GFP_NOIO);
+ if (!bio || !octx) {
+ drbd_warn(device, "Could not allocate a bio, CANNOT ISSUE FLUSH\n");
+ /* FIXME: what else can I do now? disconnecting or detaching
+ * really does not help to improve the state of the world, either.
+ */
+ kfree(octx);
+ if (bio)
+ bio_put(bio);

+ ctx->error = -ENOMEM;
+ put_ldev(device);
+ kref_put(&device->kref, drbd_destroy_device);
+ return;
+ }
+
+ octx->device = device;
+ octx->ctx = ctx;
+ bio->bi_bdev = device->ldev->backing_bdev;
+ bio->bi_private = octx;
+ bio->bi_end_io = one_flush_endio;
+
+ device->flush_jif = jiffies;
+ set_bit(FLUSH_PENDING, &device->flags);
+ atomic_inc(&ctx->pending);
+ submit_bio(WRITE_FLUSH, bio);
+}
+
+static void drbd_flush(struct drbd_connection *connection)
+{
if (connection->resource->write_ordering >= WO_BDEV_FLUSH) {
+ struct drbd_peer_device *peer_device;
+ struct issue_flush_context ctx;
+ int vnr;
+
+ atomic_set(&ctx.pending, 1);
+ ctx.error = 0;
+ init_completion(&ctx.done);
+
rcu_read_lock();
idr_for_each_entry(&connection->peer_devices, peer_device, vnr) {
struct drbd_device *device = peer_device->device;
@@ -1220,31 +1290,24 @@ static void drbd_flush(struct drbd_connection *connection)
kref_get(&device->kref);
rcu_read_unlock();

- /* Right now, we have only this one synchronous code path
- * for flushes between request epochs.
- * We may want to make those asynchronous,
- * or at least parallelize the flushes to the volume devices.
- */
- device->flush_jif = jiffies;
- set_bit(FLUSH_PENDING, &device->flags);
- rv = blkdev_issue_flush(device->ldev->backing_bdev,
- GFP_NOIO, NULL);
- clear_bit(FLUSH_PENDING, &device->flags);
- if (rv) {
- drbd_info(device, "local disk flush failed with status %d\n", rv);
- /* would rather check on EOPNOTSUPP, but that is not reliable.
- * don't try again for ANY return value != 0
- * if (rv == -EOPNOTSUPP) */
- drbd_bump_write_ordering(connection->resource, NULL, WO_DRAIN_IO);
- }
- put_ldev(device);
- kref_put(&device->kref, drbd_destroy_device);
+ submit_one_flush(device, &ctx);

rcu_read_lock();
- if (rv)
- break;
}
rcu_read_unlock();
+
+ /* Do we want to add a timeout,
+ * if disk-timeout is set? */
+ if (!atomic_dec_and_test(&ctx.pending))
+ wait_for_completion(&ctx.done);
+
+ if (ctx.error) {
+ /* would rather check on EOPNOTSUPP, but that is not reliable.
+ * don't try again for ANY return value != 0
+ * if (rv == -EOPNOTSUPP) */
+ /* Any error is already reported by bio_endio callback. */
+ drbd_bump_write_ordering(connection->resource, NULL, WO_DRAIN_IO);
+ }
}
}

--
1.9.1

2016-04-25 12:20:17

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 02/30] drbd: change bitmap write-out when leaving resync states

From: Lars Ellenberg <[email protected]>

When leaving resync states because of disconnect,
do the bitmap write-out synchronously in the drbd_disconnected() path.

When leaving resync states because we go back to AHEAD/BEHIND, or
because resync actually finished, or some disk was lost during resync,
trigger the write-out from after_state_ch().

The bitmap write-out for resync -> ahead/behind was missing completely before.

Note that this is all only an optimization to avoid double-resyncs of
already completed blocks in case this node crashes.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_receiver.c | 8 +++++---
drivers/block/drbd/drbd_state.c | 9 +++++++--
2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 050aaa1..8b30ab5 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -4783,9 +4783,11 @@ static int drbd_disconnected(struct drbd_peer_device *peer_device)

drbd_md_sync(device);

- /* serialize with bitmap writeout triggered by the state change,
- * if any. */
- wait_event(device->misc_wait, !test_bit(BITMAP_IO, &device->flags));
+ if (get_ldev(device)) {
+ drbd_bitmap_io(device, &drbd_bm_write_copy_pages,
+ "write from disconnected", BM_LOCKED_CHANGE_ALLOWED);
+ put_ldev(device);
+ }

/* tcp_close and release of sendpage pages can be deferred. I don't
* want to use SO_LINGER, because apparently it can be deferred for
diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 5a7ef78..59c6467 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -1934,12 +1934,17 @@ static void after_state_ch(struct drbd_device *device, union drbd_state os,

/* This triggers bitmap writeout of potentially still unwritten pages
* if the resync finished cleanly, or aborted because of peer disk
- * failure, or because of connection loss.
+ * failure, or on transition from resync back to AHEAD/BEHIND.
+ *
+ * Connection loss is handled in drbd_disconnected() by the receiver.
+ *
* For resync aborted because of local disk failure, we cannot do
* any bitmap writeout anymore.
+ *
* No harm done if some bits change during this phase.
*/
- if (os.conn > C_CONNECTED && ns.conn <= C_CONNECTED && get_ldev(device)) {
+ if ((os.conn > C_CONNECTED && os.conn < C_AHEAD) &&
+ (ns.conn == C_CONNECTED || ns.conn >= C_AHEAD) && get_ldev(device)) {
drbd_queue_bitmap_io(device, &drbd_bm_write_copy_pages, NULL,
"write from resync_finished", BM_LOCKED_CHANGE_ALLOWED);
put_ldev(device);
--
1.9.1

2016-04-25 12:20:15

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 08/30] drbd: fix regression: protocol A sometimes synchronous, C sometimes double-latency

From: Lars Ellenberg <[email protected]>

Regression introduced with 8.4.5
drbd: application writes may set-in-sync in protocol != C

Overwriting the same block (LBA) while a former version is still
"in-flight" to the peer (to be exact: we did not receive the
P_BARRIER_ACK for its epoch yet) would wait for the full epoch of that
former version to be acknowledged by the peer.

In synchronous and quasi-synchronous protocols C and B,
this may double the latency on overwrites.

With protocol A, which is supposed to be asynchronous and only wait for
local completion, it is even worse: it would make overwrites
quasi-synchronous, they would be hit by the full RTT, which protocol A
was specifically meant to avoid, and possibly the additional time it
takes to drain the buffers first.

Particularly bad for databases, or anything else that
does frequent updates to the same blocks (various file system meta data).

No impact if >= rtt passes between updates to the same block.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_req.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 2255dcf..6dbf1f1 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -977,16 +977,20 @@ static void complete_conflicting_writes(struct drbd_request *req)
sector_t sector = req->i.sector;
int size = req->i.size;

- i = drbd_find_overlap(&device->write_requests, sector, size);
- if (!i)
- return;
-
for (;;) {
- prepare_to_wait(&device->misc_wait, &wait, TASK_UNINTERRUPTIBLE);
- i = drbd_find_overlap(&device->write_requests, sector, size);
- if (!i)
+ drbd_for_each_overlap(i, &device->write_requests, sector, size) {
+ /* Ignore, if already completed to upper layers. */
+ if (i->completed)
+ continue;
+ /* Handle the first found overlap. After the schedule
+ * we have to restart the tree walk. */
break;
+ }
+ if (!i) /* if any */
+ break;
+
/* Indicate to wake up device->misc_wait on progress. */
+ prepare_to_wait(&device->misc_wait, &wait, TASK_UNINTERRUPTIBLE);
i->waiting = true;
spin_unlock_irq(&device->resource->req_lock);
schedule();
--
1.9.1

2016-04-25 12:20:14

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 05/30] drbd: Introduce new disk config option rs-discard-granularity

As long as the value is 0 the feature is disabled. With setting
it to a positive value, DRBD limits and aligns its resync requests
to the rs-discard-granularity setting. If the sync source detects
all zeros in such a block, the resync target discards the range
on disk.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_nl.c | 32 +++++++++++++++++++++++++++++---
include/linux/drbd_genl.h | 6 +++---
include/linux/drbd_limits.h | 6 ++++++
3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index bb71e73..ee4642f 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1348,12 +1348,38 @@ static bool write_ordering_changed(struct disk_conf *a, struct disk_conf *b)
a->disk_drain != b->disk_drain;
}

-static void sanitize_disk_conf(struct disk_conf *disk_conf, struct drbd_backing_dev *nbc)
+static void sanitize_disk_conf(struct drbd_device *device, struct disk_conf *disk_conf,
+ struct drbd_backing_dev *nbc)
{
+ struct request_queue * const q = nbc->backing_bdev->bd_disk->queue;
+
if (disk_conf->al_extents < DRBD_AL_EXTENTS_MIN)
disk_conf->al_extents = DRBD_AL_EXTENTS_MIN;
if (disk_conf->al_extents > drbd_al_extents_max(nbc))
disk_conf->al_extents = drbd_al_extents_max(nbc);
+
+ if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data) {
+ disk_conf->rs_discard_granularity = 0; /* disable feature */
+ drbd_info(device, "rs_discard_granularity feature disabled\n");
+ }
+
+ if (disk_conf->rs_discard_granularity) {
+ int orig_value = disk_conf->rs_discard_granularity;
+ int remainder;
+
+ if (q->limits.discard_granularity > disk_conf->rs_discard_granularity)
+ disk_conf->rs_discard_granularity = q->limits.discard_granularity;
+
+ remainder = disk_conf->rs_discard_granularity % q->limits.discard_granularity;
+ disk_conf->rs_discard_granularity += remainder;
+
+ if (disk_conf->rs_discard_granularity > q->limits.max_discard_sectors << 9)
+ disk_conf->rs_discard_granularity = q->limits.max_discard_sectors << 9;
+
+ if (disk_conf->rs_discard_granularity != orig_value)
+ drbd_info(device, "rs_discard_granularity changed to %d\n",
+ disk_conf->rs_discard_granularity);
+ }
}

int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
@@ -1403,7 +1429,7 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
if (!expect(new_disk_conf->resync_rate >= 1))
new_disk_conf->resync_rate = 1;

- sanitize_disk_conf(new_disk_conf, device->ldev);
+ sanitize_disk_conf(device, new_disk_conf, device->ldev);

if (new_disk_conf->c_plan_ahead > DRBD_C_PLAN_AHEAD_MAX)
new_disk_conf->c_plan_ahead = DRBD_C_PLAN_AHEAD_MAX;
@@ -1698,7 +1724,7 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
if (retcode != NO_ERROR)
goto fail;

- sanitize_disk_conf(new_disk_conf, nbc);
+ sanitize_disk_conf(device, new_disk_conf, nbc);

if (drbd_get_max_capacity(nbc) < new_disk_conf->disk_size) {
drbd_err(device, "max capacity %llu smaller than disk size %llu\n",
diff --git a/include/linux/drbd_genl.h b/include/linux/drbd_genl.h
index 2d0e5ad..ab649d8 100644
--- a/include/linux/drbd_genl.h
+++ b/include/linux/drbd_genl.h
@@ -123,14 +123,14 @@ GENL_struct(DRBD_NLA_DISK_CONF, 3, disk_conf,
__u32_field_def(13, DRBD_GENLA_F_MANDATORY, c_fill_target, DRBD_C_FILL_TARGET_DEF)
__u32_field_def(14, DRBD_GENLA_F_MANDATORY, c_max_rate, DRBD_C_MAX_RATE_DEF)
__u32_field_def(15, DRBD_GENLA_F_MANDATORY, c_min_rate, DRBD_C_MIN_RATE_DEF)
+ __u32_field_def(20, DRBD_GENLA_F_MANDATORY, disk_timeout, DRBD_DISK_TIMEOUT_DEF)
+ __u32_field_def(21, 0 /* OPTIONAL */, read_balancing, DRBD_READ_BALANCING_DEF)
+ __u32_field_def(25, 0 /* OPTIONAL */, rs_discard_granularity, DRBD_RS_DISCARD_GRANULARITY_DEF)

__flg_field_def(16, DRBD_GENLA_F_MANDATORY, disk_barrier, DRBD_DISK_BARRIER_DEF)
__flg_field_def(17, DRBD_GENLA_F_MANDATORY, disk_flushes, DRBD_DISK_FLUSHES_DEF)
__flg_field_def(18, DRBD_GENLA_F_MANDATORY, disk_drain, DRBD_DISK_DRAIN_DEF)
__flg_field_def(19, DRBD_GENLA_F_MANDATORY, md_flushes, DRBD_MD_FLUSHES_DEF)
- __u32_field_def(20, DRBD_GENLA_F_MANDATORY, disk_timeout, DRBD_DISK_TIMEOUT_DEF)
- __u32_field_def(21, 0 /* OPTIONAL */, read_balancing, DRBD_READ_BALANCING_DEF)
- /* 9: __u32_field_def(22, DRBD_GENLA_F_MANDATORY, unplug_watermark, DRBD_UNPLUG_WATERMARK_DEF) */
__flg_field_def(23, 0 /* OPTIONAL */, al_updates, DRBD_AL_UPDATES_DEF)
)

diff --git a/include/linux/drbd_limits.h b/include/linux/drbd_limits.h
index 8ac8c5d..2e4aad8 100644
--- a/include/linux/drbd_limits.h
+++ b/include/linux/drbd_limits.h
@@ -230,4 +230,10 @@
#define DRBD_SOCKET_CHECK_TIMEO_MAX DRBD_PING_TIMEO_MAX
#define DRBD_SOCKET_CHECK_TIMEO_DEF 0
#define DRBD_SOCKET_CHECK_TIMEO_SCALE '1'
+
+#define DRBD_RS_DISCARD_GRANULARITY_MIN 0
+#define DRBD_RS_DISCARD_GRANULARITY_MAX (1<<20) /* 1MiByte */
+#define DRBD_RS_DISCARD_GRANULARITY_DEF 0 /* disabled by default */
+#define DRBD_RS_DISCARD_GRANULARITY_SCALE '1' /* bytes */
+
#endif
--
1.9.1

2016-04-25 12:20:12

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 30/30] drbd: correctly handle failed crypto_alloc_hash

From: Lars Ellenberg <[email protected]>

crypto_alloc_hash returns an ERR_PTR(), not NULL.

Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash()
on an errno in the cleanup path.

Reported-by: Insu Yun <[email protected]>

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_receiver.c | 3 ++-
include/linux/drbd.h | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 5c06286..80a6aff 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3668,7 +3668,8 @@ static int receive_protocol(struct drbd_connection *connection, struct packet_in
*/

peer_integrity_tfm = crypto_alloc_ahash(integrity_alg, 0, CRYPTO_ALG_ASYNC);
- if (!peer_integrity_tfm) {
+ if (IS_ERR(peer_integrity_tfm)) {
+ peer_integrity_tfm = NULL;
drbd_err(connection, "peer data-integrity-alg %s not supported\n",
integrity_alg);
goto disconnect;
diff --git a/include/linux/drbd.h b/include/linux/drbd.h
index 2b26156..002611c 100644
--- a/include/linux/drbd.h
+++ b/include/linux/drbd.h
@@ -51,7 +51,7 @@
#endif

extern const char *drbd_buildtag(void);
-#define REL_VERSION "8.4.6"
+#define REL_VERSION "8.4.7"
#define API_VERSION 1
#define PRO_VERSION_MIN 86
#define PRO_VERSION_MAX 101
--
1.9.1

2016-04-25 12:20:10

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 04/30] drbd: Implement handling of thinly provisioned storage on resync target nodes

If during resync we read only zeroes for a range of sectors assume
that these secotors can be discarded on the sync target node.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_int.h | 5 +++
drivers/block/drbd/drbd_main.c | 18 ++++++++
drivers/block/drbd/drbd_protocol.h | 4 ++
drivers/block/drbd/drbd_receiver.c | 88 ++++++++++++++++++++++++++++++++++++--
drivers/block/drbd/drbd_worker.c | 29 ++++++++++++-
5 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 7a1cf7e..1a93f4f 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -471,6 +471,9 @@ enum {
/* this originates from application on peer
* (not some resync or verify or other DRBD internal request) */
__EE_APPLICATION,
+
+ /* If it contains only 0 bytes, send back P_RS_DEALLOCATED */
+ __EE_RS_THIN_REQ,
};
#define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
#define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC)
@@ -485,6 +488,7 @@ enum {
#define EE_SUBMITTED (1<<__EE_SUBMITTED)
#define EE_WRITE (1<<__EE_WRITE)
#define EE_APPLICATION (1<<__EE_APPLICATION)
+#define EE_RS_THIN_REQ (1<<__EE_RS_THIN_REQ)

/* flag bits per device */
enum {
@@ -1123,6 +1127,7 @@ extern int drbd_send_ov_request(struct drbd_peer_device *, sector_t sector, int
extern int drbd_send_bitmap(struct drbd_device *device);
extern void drbd_send_sr_reply(struct drbd_peer_device *, enum drbd_state_rv retcode);
extern void conn_send_sr_reply(struct drbd_connection *connection, enum drbd_state_rv retcode);
+extern int drbd_send_rs_deallocated(struct drbd_peer_device *, struct drbd_peer_request *);
extern void drbd_backing_dev_free(struct drbd_device *device, struct drbd_backing_dev *ldev);
extern void drbd_device_cleanup(struct drbd_device *device);
void drbd_print_uuids(struct drbd_device *device, const char *text);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 802d729..3cecc4f 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1377,6 +1377,22 @@ int drbd_send_ack_ex(struct drbd_peer_device *peer_device, enum drbd_packet cmd,
cpu_to_be64(block_id));
}

+int drbd_send_rs_deallocated(struct drbd_peer_device *peer_device,
+ struct drbd_peer_request *peer_req)
+{
+ struct drbd_socket *sock;
+ struct p_block_desc *p;
+
+ sock = &peer_device->connection->data;
+ p = drbd_prepare_command(peer_device, sock);
+ if (!p)
+ return -EIO;
+ p->sector = cpu_to_be64(peer_req->i.sector);
+ p->blksize = cpu_to_be32(peer_req->i.size);
+ p->pad = 0;
+ return drbd_send_command(peer_device, sock, P_RS_DEALLOCATED, sizeof(*p), NULL, 0);
+}
+
int drbd_send_drequest(struct drbd_peer_device *peer_device, int cmd,
sector_t sector, int size, u64 block_id)
{
@@ -3681,6 +3697,8 @@ const char *cmdname(enum drbd_packet cmd)
[P_CONN_ST_CHG_REPLY] = "conn_st_chg_reply",
[P_RETRY_WRITE] = "retry_write",
[P_PROTOCOL_UPDATE] = "protocol_update",
+ [P_RS_THIN_REQ] = "rs_thin_req",
+ [P_RS_DEALLOCATED] = "rs_deallocated",

/* enum drbd_packet, but not commands - obsoleted flags:
* P_MAY_IGNORE
diff --git a/drivers/block/drbd/drbd_protocol.h b/drivers/block/drbd/drbd_protocol.h
index ef92453..e5e74e3 100644
--- a/drivers/block/drbd/drbd_protocol.h
+++ b/drivers/block/drbd/drbd_protocol.h
@@ -60,6 +60,10 @@ enum drbd_packet {
* which is why I chose TRIM here, to disambiguate. */
P_TRIM = 0x31,

+ /* Only use these two if both support FF_THIN_RESYNC */
+ P_RS_THIN_REQ = 0x32, /* Request a block for resync or reply P_RS_DEALLOCATED */
+ P_RS_DEALLOCATED = 0x33, /* Contains only zeros on sync source node */
+
P_MAY_IGNORE = 0x100, /* Flag to test if (cmd > P_MAY_IGNORE) ... */
P_MAX_OPT_CMD = 0x101,

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 8b30ab5..3a6c2ec 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1417,9 +1417,15 @@ int drbd_submit_peer_request(struct drbd_device *device,
* so we can find it to present it in debugfs */
peer_req->submit_jif = jiffies;
peer_req->flags |= EE_SUBMITTED;
- spin_lock_irq(&device->resource->req_lock);
- list_add_tail(&peer_req->w.list, &device->active_ee);
- spin_unlock_irq(&device->resource->req_lock);
+
+ /* If this was a resync request from receive_rs_deallocated(),
+ * it is already on the sync_ee list */
+ if (list_empty(&peer_req->w.list)) {
+ spin_lock_irq(&device->resource->req_lock);
+ list_add_tail(&peer_req->w.list, &device->active_ee);
+ spin_unlock_irq(&device->resource->req_lock);
+ }
+
if (blkdev_issue_zeroout(device->ldev->backing_bdev,
sector, data_size >> 9, GFP_NOIO, false))
peer_req->flags |= EE_WAS_ERROR;
@@ -2574,6 +2580,7 @@ static int receive_DataRequest(struct drbd_connection *connection, struct packet
case P_DATA_REQUEST:
drbd_send_ack_rp(peer_device, P_NEG_DREPLY, p);
break;
+ case P_RS_THIN_REQ:
case P_RS_DATA_REQUEST:
case P_CSUM_RS_REQUEST:
case P_OV_REQUEST:
@@ -2613,6 +2620,12 @@ static int receive_DataRequest(struct drbd_connection *connection, struct packet
peer_req->flags |= EE_APPLICATION;
goto submit;

+ case P_RS_THIN_REQ:
+ /* If at some point in the future we have a smart way to
+ find out if this data block is completely deallocated,
+ then we would do something smarter here than reading
+ the block... */
+ peer_req->flags |= EE_RS_THIN_REQ;
case P_RS_DATA_REQUEST:
peer_req->w.cb = w_e_end_rsdata_req;
fault_type = DRBD_FAULT_RS_RD;
@@ -4587,6 +4600,72 @@ static int receive_out_of_sync(struct drbd_connection *connection, struct packet
return 0;
}

+static int receive_rs_deallocated(struct drbd_connection *connection, struct packet_info *pi)
+{
+ struct drbd_peer_device *peer_device;
+ struct p_block_desc *p = pi->data;
+ struct drbd_device *device;
+ sector_t sector;
+ int size, err = 0;
+
+ peer_device = conn_peer_device(connection, pi->vnr);
+ if (!peer_device)
+ return -EIO;
+ device = peer_device->device;
+
+ sector = be64_to_cpu(p->sector);
+ size = be32_to_cpu(p->blksize);
+
+ dec_rs_pending(device);
+
+ if (get_ldev(device)) {
+ struct drbd_peer_request *peer_req;
+ const int rw = REQ_WRITE | REQ_DISCARD;
+
+ peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER, sector,
+ size, false, GFP_NOIO);
+ if (!peer_req) {
+ put_ldev(device);
+ return -ENOMEM;
+ }
+
+ peer_req->w.cb = e_end_resync_block;
+ peer_req->submit_jif = jiffies;
+ peer_req->flags |= EE_IS_TRIM;
+
+ spin_lock_irq(&device->resource->req_lock);
+ list_add_tail(&peer_req->w.list, &device->sync_ee);
+ spin_unlock_irq(&device->resource->req_lock);
+
+ atomic_add(pi->size >> 9, &device->rs_sect_ev);
+ err = drbd_submit_peer_request(device, peer_req, rw, DRBD_FAULT_RS_WR);
+
+ if (err) {
+ spin_lock_irq(&device->resource->req_lock);
+ list_del(&peer_req->w.list);
+ spin_unlock_irq(&device->resource->req_lock);
+
+ drbd_free_peer_req(device, peer_req);
+ put_ldev(device);
+ err = 0;
+ goto fail;
+ }
+
+ inc_unacked(device);
+
+ /* No put_ldev() here. Gets called in drbd_endio_write_sec_final(),
+ as well as drbd_rs_complete_io() */
+ } else {
+ fail:
+ drbd_rs_complete_io(device, sector);
+ drbd_send_ack_ex(peer_device, P_NEG_ACK, sector, size, ID_SYNCER);
+ }
+
+ atomic_add(size >> 9, &device->rs_sect_in);
+
+ return err;
+}
+
struct data_cmd {
int expect_payload;
size_t pkt_size;
@@ -4614,11 +4693,14 @@ static struct data_cmd drbd_cmd_handler[] = {
[P_OV_REQUEST] = { 0, sizeof(struct p_block_req), receive_DataRequest },
[P_OV_REPLY] = { 1, sizeof(struct p_block_req), receive_DataRequest },
[P_CSUM_RS_REQUEST] = { 1, sizeof(struct p_block_req), receive_DataRequest },
+ [P_RS_THIN_REQ] = { 0, sizeof(struct p_block_req), receive_DataRequest },
[P_DELAY_PROBE] = { 0, sizeof(struct p_delay_probe93), receive_skip },
[P_OUT_OF_SYNC] = { 0, sizeof(struct p_block_desc), receive_out_of_sync },
[P_CONN_ST_CHG_REQ] = { 0, sizeof(struct p_req_state), receive_req_conn_state },
[P_PROTOCOL_UPDATE] = { 1, sizeof(struct p_protocol), receive_protocol },
[P_TRIM] = { 0, sizeof(struct p_trim), receive_Data },
+ [P_RS_DEALLOCATED] = { 0, sizeof(struct p_block_desc), receive_rs_deallocated },
+
};

static void drbdd(struct drbd_connection *connection)
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 4d87499..01d74ee2 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -1035,6 +1035,30 @@ int w_e_end_data_req(struct drbd_work *w, int cancel)
return err;
}

+static bool all_zero(struct drbd_peer_request *peer_req)
+{
+ struct page *page = peer_req->pages;
+ unsigned int len = peer_req->i.size;
+
+ page_chain_for_each(page) {
+ unsigned int l = min_t(unsigned int, len, PAGE_SIZE);
+ unsigned int i, words = l / sizeof(long);
+ unsigned long *d;
+
+ d = kmap_atomic(page);
+ for (i = 0; i < words; i++) {
+ if (d[i]) {
+ kunmap_atomic(d);
+ return false;
+ }
+ }
+ kunmap_atomic(d);
+ len -= l;
+ }
+
+ return true;
+}
+
/**
* w_e_end_rsdata_req() - Worker callback to send a P_RS_DATA_REPLY packet in response to a P_RS_DATA_REQUEST
* @w: work object.
@@ -1063,7 +1087,10 @@ int w_e_end_rsdata_req(struct drbd_work *w, int cancel)
} else if (likely((peer_req->flags & EE_WAS_ERROR) == 0)) {
if (likely(device->state.pdsk >= D_INCONSISTENT)) {
inc_rs_pending(device);
- err = drbd_send_block(peer_device, P_RS_DATA_REPLY, peer_req);
+ if (peer_req->flags & EE_RS_THIN_REQ && all_zero(peer_req))
+ err = drbd_send_rs_deallocated(peer_device, peer_req);
+ else
+ err = drbd_send_block(peer_device, P_RS_DATA_REPLY, peer_req);
} else {
if (__ratelimit(&drbd_ratelimit_state))
drbd_err(device, "Not sending RSDataReply, "
--
1.9.1

2016-04-25 12:20:09

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 15/30] drbd: finish resync on sync source only by notification from sync target

From: Lars Ellenberg <[email protected]>

If the replication link breaks exactly during "resync finished" detection,
finishing too early on the sync source could again lead to UUIDs rotated
too fast, and potentially a spurious full resync on next handshake.

Always wait for explicit resync finished state change notification from
the sync target.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 16 ++++++++++++----
drivers/block/drbd/drbd_int.h | 19 ++++++++++++++-----
2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 1664762..4e07cff 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -768,10 +768,18 @@ static bool lazy_bitmap_update_due(struct drbd_device *device)

static void maybe_schedule_on_disk_bitmap_update(struct drbd_device *device, bool rs_done)
{
- if (rs_done)
- set_bit(RS_DONE, &device->flags);
- /* and also set RS_PROGRESS below */
- else if (!lazy_bitmap_update_due(device))
+ if (rs_done) {
+ struct drbd_connection *connection = first_peer_device(device)->connection;
+ if (connection->agreed_pro_version <= 95 ||
+ is_sync_target_state(device->state.conn))
+ set_bit(RS_DONE, &device->flags);
+ /* and also set RS_PROGRESS below */
+
+ /* Else: rather wait for explicit notification via receive_state,
+ * to avoid uuids-rotated-too-fast causing full resync
+ * in next handshake, in case the replication link breaks
+ * at the most unfortunate time... */
+ } else if (!lazy_bitmap_update_due(device))
return;

drbd_device_post_work(device, RS_PROGRESS);
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index d82e531..451a745 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -2102,13 +2102,22 @@ static inline void _sub_unacked(struct drbd_device *device, int n, const char *f
ERR_IF_CNT_IS_NEGATIVE(unacked_cnt, func, line);
}

+static inline bool is_sync_target_state(enum drbd_conns connection_state)
+{
+ return connection_state == C_SYNC_TARGET ||
+ connection_state == C_PAUSED_SYNC_T;
+}
+
+static inline bool is_sync_source_state(enum drbd_conns connection_state)
+{
+ return connection_state == C_SYNC_SOURCE ||
+ connection_state == C_PAUSED_SYNC_S;
+}
+
static inline bool is_sync_state(enum drbd_conns connection_state)
{
- return
- (connection_state == C_SYNC_SOURCE
- || connection_state == C_SYNC_TARGET
- || connection_state == C_PAUSED_SYNC_S
- || connection_state == C_PAUSED_SYNC_T);
+ return is_sync_source_state(connection_state) ||
+ is_sync_target_state(connection_state);
}

/**
--
1.9.1

2016-04-25 12:20:07

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 11/30] drbd: when receiving P_TRIM, zero-out partial unaligned chunks

From: Lars Ellenberg <[email protected]>

We can avoid spurious data divergence caused by partially-ignored
discards on certain backends with discard_zeroes_data=0, if we
translate partial unaligned discard requests into explicit zero-out.

The relevant use case is LVM/DM thin.

If on different nodes, DRBD is backed by devices with differing
discard characteristics, discards may lead to data divergence
(old data or garbage left over on one backend, zeroes due to
unmapped areas on the other backend). Online verify would now
potentially report tons of spurious differences.

While probably harmless for most use cases (fstrim on a file system),
DRBD cannot have that, it would violate our promise to upper layers
that our data instances on the nodes are identical.

To be correct and play safe (make sure data is identical on both copies),
we would have to disable discard support, if our local backend (on a
Primary) does not support "discard_zeroes_data=true".

We'd also have to translate discards to explicit zero-out on the
receiving (typically: Secondary) side, unless the receiving side
supports "discard_zeroes_data=true".

Which both would allocate those blocks, instead of unmapping them,
in contrast with expectations.

LVM/DM thin does set discard_zeroes_data=0,
because it silently ignores discards to partial chunks.

We can work around this by checking the alignment first.
For unaligned (wrt. alignment and granularity) or too small discards,
we zero-out the initial (and/or) trailing unaligned partial chunks,
but discard all the aligned full chunks.

At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".

Arguably it should behave this way internally, by default,
and we'll try to make that happen.

But our workaround is still valid for already deployed setups,
and for other devices that may behave this way.

Setting discard-zeroes-if-aligned=yes will allow DRBD to use
discards, and to announce discard_zeroes_data=true, even on
backends that announce discard_zeroes_data=false.

Setting discard-zeroes-if-aligned=no will cause DRBD to always
fall-back to zero-out on the receiving side, and to not even
announce discard capabilities on the Primary, if the respective
backend announces discard_zeroes_data=false.

We used to ignore the discard_zeroes_data setting completely.
To not break established and expected behaviour, and suddenly
cause fstrim on thin-provisioned LVs to run out-of-space,
instead of freeing up space, the default value is "yes".

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_int.h | 2 +-
drivers/block/drbd/drbd_nl.c | 15 ++--
drivers/block/drbd/drbd_receiver.c | 140 ++++++++++++++++++++++++++++++-------
include/linux/drbd_genl.h | 1 +
include/linux/drbd_limits.h | 6 ++
5 files changed, 134 insertions(+), 30 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 1a93f4f..8cc2955 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1488,7 +1488,7 @@ enum determine_dev_size {
extern enum determine_dev_size
drbd_determine_dev_size(struct drbd_device *, enum dds_flags, struct resize_parms *) __must_hold(local);
extern void resync_after_online_grow(struct drbd_device *);
-extern void drbd_reconsider_max_bio_size(struct drbd_device *device, struct drbd_backing_dev *bdev);
+extern void drbd_reconsider_queue_parameters(struct drbd_device *device, struct drbd_backing_dev *bdev);
extern enum drbd_state_rv drbd_set_role(struct drbd_device *device,
enum drbd_role new_role,
int force);
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index e63c5c4..4a0b184 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1161,13 +1161,17 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
unsigned int max_hw_sectors = max_bio_size >> 9;
unsigned int max_segments = 0;
struct request_queue *b = NULL;
+ struct disk_conf *dc;
+ bool discard_zeroes_if_aligned = true;

if (bdev) {
b = bdev->backing_bdev->bd_disk->queue;

max_hw_sectors = min(queue_max_hw_sectors(b), max_bio_size >> 9);
rcu_read_lock();
- max_segments = rcu_dereference(device->ldev->disk_conf)->max_bio_bvecs;
+ dc = rcu_dereference(device->ldev->disk_conf);
+ max_segments = dc->max_bio_bvecs;
+ discard_zeroes_if_aligned = dc->discard_zeroes_if_aligned;
rcu_read_unlock();

blk_set_stacking_limits(&q->limits);
@@ -1185,7 +1189,7 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi

blk_queue_max_discard_sectors(q, DRBD_MAX_DISCARD_SECTORS);

- if (blk_queue_discard(b) &&
+ if (blk_queue_discard(b) && (b->limits.discard_zeroes_data || discard_zeroes_if_aligned) &&
(connection->cstate < C_CONNECTED || connection->agreed_features & FF_TRIM)) {
/* We don't care, stacking below should fix it for the local device.
* Whether or not it is a suitable granularity on the remote device
@@ -1216,7 +1220,7 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
}
}

-void drbd_reconsider_max_bio_size(struct drbd_device *device, struct drbd_backing_dev *bdev)
+void drbd_reconsider_queue_parameters(struct drbd_device *device, struct drbd_backing_dev *bdev)
{
unsigned int now, new, local, peer;

@@ -1488,6 +1492,9 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
if (write_ordering_changed(old_disk_conf, new_disk_conf))
drbd_bump_write_ordering(device->resource, NULL, WO_BDEV_FLUSH);

+ if (old_disk_conf->discard_zeroes_if_aligned != new_disk_conf->discard_zeroes_if_aligned)
+ drbd_reconsider_queue_parameters(device, device->ldev);
+
drbd_md_sync(device);

if (device->state.conn >= C_CONNECTED) {
@@ -1866,7 +1873,7 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
device->read_cnt = 0;
device->writ_cnt = 0;

- drbd_reconsider_max_bio_size(device, device->ldev);
+ drbd_reconsider_queue_parameters(device, device->ldev);

/* If I am currently not R_PRIMARY,
* but meta data primary indicator is set,
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index a2e7ba9..078c4d98 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1442,6 +1442,108 @@ void drbd_bump_write_ordering(struct drbd_resource *resource, struct drbd_backin
drbd_info(resource, "Method to ensure write ordering: %s\n", write_ordering_str[resource->write_ordering]);
}

+/*
+ * We *may* ignore the discard-zeroes-data setting, if so configured.
+ *
+ * Assumption is that it "discard_zeroes_data=0" is only because the backend
+ * may ignore partial unaligned discards.
+ *
+ * LVM/DM thin as of at least
+ * LVM version: 2.02.115(2)-RHEL7 (2015-01-28)
+ * Library version: 1.02.93-RHEL7 (2015-01-28)
+ * Driver version: 4.29.0
+ * still behaves this way.
+ *
+ * For unaligned (wrt. alignment and granularity) or too small discards,
+ * we zero-out the initial (and/or) trailing unaligned partial chunks,
+ * but discard all the aligned full chunks.
+ *
+ * At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
+ */
+int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, unsigned int nr_sectors, bool discard)
+{
+ struct block_device *bdev = device->ldev->backing_bdev;
+ struct request_queue *q = bdev_get_queue(bdev);
+ sector_t tmp, nr;
+ unsigned int max_discard_sectors, granularity;
+ int alignment;
+ int err = 0;
+
+ if (!discard)
+ goto zero_out;
+
+ /* Zero-sector (unknown) and one-sector granularities are the same. */
+ granularity = max(q->limits.discard_granularity >> 9, 1U);
+ alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
+
+ max_discard_sectors = min(q->limits.max_discard_sectors, (1U << 22));
+ max_discard_sectors -= max_discard_sectors % granularity;
+ if (unlikely(!max_discard_sectors))
+ goto zero_out;
+
+ if (nr_sectors < granularity)
+ goto zero_out;
+
+ tmp = start;
+ if (sector_div(tmp, granularity) != alignment) {
+ if (nr_sectors < 2*granularity)
+ goto zero_out;
+ /* start + gran - (start + gran - align) % gran */
+ tmp = start + granularity - alignment;
+ tmp = start + granularity - sector_div(tmp, granularity);
+
+ nr = tmp - start;
+ err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO, 0);
+ nr_sectors -= nr;
+ start = tmp;
+ }
+ while (nr_sectors >= granularity) {
+ nr = min_t(sector_t, nr_sectors, max_discard_sectors);
+ err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO, 0);
+ nr_sectors -= nr;
+ start += nr;
+ }
+ zero_out:
+ if (nr_sectors) {
+ err |= blkdev_issue_zeroout(bdev, start, nr_sectors, GFP_NOIO, 0);
+ }
+ return err != 0;
+}
+
+static bool can_do_reliable_discards(struct drbd_device *device)
+{
+ struct request_queue *q = bdev_get_queue(device->ldev->backing_bdev);
+ struct disk_conf *dc;
+ bool can_do;
+
+ if (!blk_queue_discard(q))
+ return false;
+
+ if (q->limits.discard_zeroes_data)
+ return true;
+
+ rcu_read_lock();
+ dc = rcu_dereference(device->ldev->disk_conf);
+ can_do = dc->discard_zeroes_if_aligned;
+ rcu_read_unlock();
+ return can_do;
+}
+
+void drbd_issue_peer_discard(struct drbd_device *device, struct drbd_peer_request *peer_req)
+{
+ /* If the backend cannot discard, or does not guarantee
+ * read-back zeroes in discarded ranges, we fall back to
+ * zero-out. Unless configuration specifically requested
+ * otherwise. */
+ if (!can_do_reliable_discards(device))
+ peer_req->flags |= EE_IS_TRIM_USE_ZEROOUT;
+
+ if (drbd_issue_discard_or_zero_out(device, peer_req->i.sector,
+ peer_req->i.size >> 9, !(peer_req->flags & EE_IS_TRIM_USE_ZEROOUT)))
+ peer_req->flags |= EE_WAS_ERROR;
+ drbd_endio_write_sec_final(peer_req);
+}
+
/**
* drbd_submit_peer_request()
* @device: DRBD device.
@@ -1472,7 +1574,13 @@ int drbd_submit_peer_request(struct drbd_device *device,
unsigned nr_pages = (data_size + PAGE_SIZE -1) >> PAGE_SHIFT;
int err = -ENOMEM;

- if (peer_req->flags & EE_IS_TRIM_USE_ZEROOUT) {
+ /* TRIM/DISCARD: for now, always use the helper function
+ * blkdev_issue_zeroout(..., discard=true).
+ * It's synchronous, but it does the right thing wrt. bio splitting.
+ * Correctness first, performance later. Next step is to code an
+ * asynchronous variant of the same.
+ */
+ if (peer_req->flags & EE_IS_TRIM) {
/* wait for all pending IO completions, before we start
* zeroing things out. */
conn_wait_active_ee_empty(peer_req->peer_device->connection);
@@ -1489,19 +1597,10 @@ int drbd_submit_peer_request(struct drbd_device *device,
spin_unlock_irq(&device->resource->req_lock);
}

- if (blkdev_issue_zeroout(device->ldev->backing_bdev,
- sector, data_size >> 9, GFP_NOIO, false))
- peer_req->flags |= EE_WAS_ERROR;
- drbd_endio_write_sec_final(peer_req);
+ drbd_issue_peer_discard(device, peer_req);
return 0;
}

- /* Discards don't have any payload.
- * But the scsi layer still expects a bio_vec it can use internally,
- * see sd_setup_discard_cmnd() and blk_add_request_payload(). */
- if (peer_req->flags & EE_IS_TRIM)
- nr_pages = 1;
-
/* In most cases, we will only need one bio. But in case the lower
* level restrictions happen to be different at this offset on this
* side than those of the sending peer, we may need to submit the
@@ -1527,11 +1626,6 @@ next_bio:
bios = bio;
++n_bios;

- if (rw & REQ_DISCARD) {
- bio->bi_iter.bi_size = data_size;
- goto submit;
- }
-
page_chain_for_each(page) {
unsigned len = min_t(unsigned, data_size, PAGE_SIZE);
if (!bio_add_page(bio, page, len, 0)) {
@@ -1553,7 +1647,6 @@ next_bio:
--nr_pages;
}
D_ASSERT(device, data_size == 0);
-submit:
D_ASSERT(device, page == NULL);

atomic_set(&peer_req->pending_bios, n_bios);
@@ -2413,10 +2506,7 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
dp_flags = be32_to_cpu(p->dp_flags);
rw |= wire_flags_to_bio(dp_flags);
if (pi->cmd == P_TRIM) {
- struct request_queue *q = bdev_get_queue(device->ldev->backing_bdev);
peer_req->flags |= EE_IS_TRIM;
- if (!blk_queue_discard(q))
- peer_req->flags |= EE_IS_TRIM_USE_ZEROOUT;
D_ASSERT(peer_device, peer_req->i.size > 0);
D_ASSERT(peer_device, rw & REQ_DISCARD);
D_ASSERT(peer_device, peer_req->pages == NULL);
@@ -2487,7 +2577,7 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
* and we wait for all pending requests, respectively wait for
* active_ee to become empty in drbd_submit_peer_request();
* better not add ourselves here. */
- if ((peer_req->flags & EE_IS_TRIM_USE_ZEROOUT) == 0)
+ if ((peer_req->flags & EE_IS_TRIM) == 0)
list_add_tail(&peer_req->w.list, &device->active_ee);
spin_unlock_irq(&device->resource->req_lock);

@@ -3903,14 +3993,14 @@ static int receive_sizes(struct drbd_connection *connection, struct packet_info
}

device->peer_max_bio_size = be32_to_cpu(p->max_bio_size);
- /* Leave drbd_reconsider_max_bio_size() before drbd_determine_dev_size().
+ /* Leave drbd_reconsider_queue_parameters() before drbd_determine_dev_size().
In case we cleared the QUEUE_FLAG_DISCARD from our queue in
- drbd_reconsider_max_bio_size(), we can be sure that after
+ drbd_reconsider_queue_parameters(), we can be sure that after
drbd_determine_dev_size() no REQ_DISCARDs are in the queue. */

ddsf = be16_to_cpu(p->dds_flags);
if (get_ldev(device)) {
- drbd_reconsider_max_bio_size(device, device->ldev);
+ drbd_reconsider_queue_parameters(device, device->ldev);
dd = drbd_determine_dev_size(device, ddsf, NULL);
put_ldev(device);
if (dd == DS_ERROR)
@@ -3930,7 +4020,7 @@ static int receive_sizes(struct drbd_connection *connection, struct packet_info
* However, if he sends a zero current size,
* take his (user-capped or) backing disk size anyways.
*/
- drbd_reconsider_max_bio_size(device, NULL);
+ drbd_reconsider_queue_parameters(device, NULL);
drbd_set_my_capacity(device, p_csize ?: p_usize ?: p_size);
}

diff --git a/include/linux/drbd_genl.h b/include/linux/drbd_genl.h
index ab649d8..c934d3a 100644
--- a/include/linux/drbd_genl.h
+++ b/include/linux/drbd_genl.h
@@ -132,6 +132,7 @@ GENL_struct(DRBD_NLA_DISK_CONF, 3, disk_conf,
__flg_field_def(18, DRBD_GENLA_F_MANDATORY, disk_drain, DRBD_DISK_DRAIN_DEF)
__flg_field_def(19, DRBD_GENLA_F_MANDATORY, md_flushes, DRBD_MD_FLUSHES_DEF)
__flg_field_def(23, 0 /* OPTIONAL */, al_updates, DRBD_AL_UPDATES_DEF)
+ __flg_field_def(24, 0 /* OPTIONAL */, discard_zeroes_if_aligned, DRBD_DISCARD_ZEROES_IF_ALIGNED)
)

GENL_struct(DRBD_NLA_RESOURCE_OPTS, 4, res_opts,
diff --git a/include/linux/drbd_limits.h b/include/linux/drbd_limits.h
index 2e4aad8..a351c40 100644
--- a/include/linux/drbd_limits.h
+++ b/include/linux/drbd_limits.h
@@ -210,6 +210,12 @@
#define DRBD_MD_FLUSHES_DEF 1
#define DRBD_TCP_CORK_DEF 1
#define DRBD_AL_UPDATES_DEF 1
+/* We used to ignore the discard_zeroes_data setting.
+ * To not change established (and expected) behaviour,
+ * by default assume that, for discard_zeroes_data=0,
+ * we can make that an effective discard_zeroes_data=1,
+ * if we only explicitly zero-out unaligned partial chunks. */
+#define DRBD_DISCARD_ZEROES_IF_ALIGNED 1

#define DRBD_ALLOW_TWO_PRIMARIES_DEF 0
#define DRBD_ALWAYS_ASBP_DEF 0
--
1.9.1

2016-04-25 12:22:43

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 22/30] drbd: introduce WRITE_SAME support

From: Lars Ellenberg <[email protected]>

We will support WRITE_SAME, if
* all peers support WRITE_SAME (both in kernel and DRBD version),
* all peer devices support WRITE_SAME
* logical_block_size is identical on all peers.

We may at some point introduce a fallback on the receiving side
for devices/kernels that do not support WRITE_SAME,
by open-coding a submit loop. But not yet.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 9 ++-
drivers/block/drbd/drbd_debugfs.c | 11 +--
drivers/block/drbd/drbd_int.h | 13 ++--
drivers/block/drbd/drbd_main.c | 82 +++++++++++++++++++---
drivers/block/drbd/drbd_nl.c | 88 +++++++++++++++++++++---
drivers/block/drbd/drbd_protocol.h | 74 ++++++++++++++++++--
drivers/block/drbd/drbd_receiver.c | 137 +++++++++++++++++++++++++++----------
drivers/block/drbd/drbd_req.c | 13 ++--
drivers/block/drbd/drbd_req.h | 5 +-
drivers/block/drbd/drbd_worker.c | 8 ++-
10 files changed, 360 insertions(+), 80 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 4e07cff..99a2b92 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -838,6 +838,13 @@ static int update_sync_bits(struct drbd_device *device,
return count;
}

+static bool plausible_request_size(int size)
+{
+ return size > 0
+ && size <= DRBD_MAX_BATCH_BIO_SIZE
+ && IS_ALIGNED(size, 512);
+}
+
/* clear the bit corresponding to the piece of storage in question:
* size byte of data starting from sector. Only clear a bits of the affected
* one ore more _aligned_ BM_BLOCK_SIZE blocks.
@@ -857,7 +864,7 @@ int __drbd_change_sync(struct drbd_device *device, sector_t sector, int size,
if ((mode == SET_OUT_OF_SYNC) && size == 0)
return 0;

- if (size <= 0 || !IS_ALIGNED(size, 512) || size > DRBD_MAX_DISCARD_SIZE) {
+ if (!plausible_request_size(size)) {
drbd_err(device, "%s: sector=%llus size=%d nonsense!\n",
drbd_change_sync_fname[mode],
(unsigned long long)sector, size);
diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c
index 4de95bb..8a90812 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -237,14 +237,9 @@ static void seq_print_peer_request_flags(struct seq_file *m, struct drbd_peer_re
seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, &sep, "C");
seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, &sep, "set-in-sync");

- if (f & EE_IS_TRIM) {
- seq_putc(m, sep);
- sep = '|';
- if (f & EE_IS_TRIM_USE_ZEROOUT)
- seq_puts(m, "zero-out");
- else
- seq_puts(m, "trim");
- }
+ if (f & EE_IS_TRIM)
+ __seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, &sep, "zero-out", "trim");
+ seq_print_rq_state_bit(m, f & EE_WRITE_SAME, &sep, "write-same");
seq_putc(m, '\n');
}

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index cb42f6c..cb47809 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -468,6 +468,9 @@ enum {
/* this is/was a write request */
__EE_WRITE,

+ /* this is/was a write same request */
+ __EE_WRITE_SAME,
+
/* this originates from application on peer
* (not some resync or verify or other DRBD internal request) */
__EE_APPLICATION,
@@ -487,6 +490,7 @@ enum {
#define EE_IN_INTERVAL_TREE (1<<__EE_IN_INTERVAL_TREE)
#define EE_SUBMITTED (1<<__EE_SUBMITTED)
#define EE_WRITE (1<<__EE_WRITE)
+#define EE_WRITE_SAME (1<<__EE_WRITE_SAME)
#define EE_APPLICATION (1<<__EE_APPLICATION)
#define EE_RS_THIN_REQ (1<<__EE_RS_THIN_REQ)

@@ -1350,8 +1354,8 @@ struct bm_extent {
/* For now, don't allow more than half of what we can "activate" in one
* activity log transaction to be discarded in one go. We may need to rework
* drbd_al_begin_io() to allow for even larger discard ranges */
-#define DRBD_MAX_DISCARD_SIZE (AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
-#define DRBD_MAX_DISCARD_SECTORS (DRBD_MAX_DISCARD_SIZE >> 9)
+#define DRBD_MAX_BATCH_BIO_SIZE (AL_UPDATES_PER_TRANSACTION/2*AL_EXTENT_SIZE)
+#define DRBD_MAX_BBIO_SECTORS (DRBD_MAX_BATCH_BIO_SIZE >> 9)

extern int drbd_bm_init(struct drbd_device *device);
extern int drbd_bm_resize(struct drbd_device *device, sector_t sectors, int set_new_bits);
@@ -1488,7 +1492,8 @@ enum determine_dev_size {
extern enum determine_dev_size
drbd_determine_dev_size(struct drbd_device *, enum dds_flags, struct resize_parms *) __must_hold(local);
extern void resync_after_online_grow(struct drbd_device *);
-extern void drbd_reconsider_queue_parameters(struct drbd_device *device, struct drbd_backing_dev *bdev);
+extern void drbd_reconsider_queue_parameters(struct drbd_device *device,
+ struct drbd_backing_dev *bdev, struct o_qlim *o);
extern enum drbd_state_rv drbd_set_role(struct drbd_device *device,
enum drbd_role new_role,
int force);
@@ -1569,7 +1574,7 @@ extern int drbd_submit_peer_request(struct drbd_device *,
extern int drbd_free_peer_reqs(struct drbd_device *, struct list_head *);
extern struct drbd_peer_request *drbd_alloc_peer_req(struct drbd_peer_device *, u64,
sector_t, unsigned int,
- bool,
+ unsigned int,
gfp_t) __must_hold(local);
extern void __drbd_free_peer_req(struct drbd_device *, struct drbd_peer_request *,
int);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 3127bba..e3aff25 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -920,6 +920,31 @@ void drbd_gen_and_send_sync_uuid(struct drbd_peer_device *peer_device)
}
}

+/* communicated if (agreed_features & DRBD_FF_WSAME) */
+void assign_p_sizes_qlim(struct drbd_device *device, struct p_sizes *p, struct request_queue *q)
+{
+ if (q) {
+ p->qlim->physical_block_size = cpu_to_be32(queue_physical_block_size(q));
+ p->qlim->logical_block_size = cpu_to_be32(queue_logical_block_size(q));
+ p->qlim->alignment_offset = cpu_to_be32(queue_alignment_offset(q));
+ p->qlim->io_min = cpu_to_be32(queue_io_min(q));
+ p->qlim->io_opt = cpu_to_be32(queue_io_opt(q));
+ p->qlim->discard_enabled = blk_queue_discard(q);
+ p->qlim->discard_zeroes_data = queue_discard_zeroes_data(q);
+ p->qlim->write_same_capable = !!q->limits.max_write_same_sectors;
+ } else {
+ q = device->rq_queue;
+ p->qlim->physical_block_size = cpu_to_be32(queue_physical_block_size(q));
+ p->qlim->logical_block_size = cpu_to_be32(queue_logical_block_size(q));
+ p->qlim->alignment_offset = 0;
+ p->qlim->io_min = cpu_to_be32(queue_io_min(q));
+ p->qlim->io_opt = cpu_to_be32(queue_io_opt(q));
+ p->qlim->discard_enabled = 0;
+ p->qlim->discard_zeroes_data = 0;
+ p->qlim->write_same_capable = 0;
+ }
+}
+
int drbd_send_sizes(struct drbd_peer_device *peer_device, int trigger_reply, enum dds_flags flags)
{
struct drbd_device *device = peer_device->device;
@@ -928,29 +953,37 @@ int drbd_send_sizes(struct drbd_peer_device *peer_device, int trigger_reply, enu
sector_t d_size, u_size;
int q_order_type;
unsigned int max_bio_size;
+ unsigned int packet_size;
+
+ sock = &peer_device->connection->data;
+ p = drbd_prepare_command(peer_device, sock);
+ if (!p)
+ return -EIO;

+ packet_size = sizeof(*p);
+ if (peer_device->connection->agreed_features & DRBD_FF_WSAME)
+ packet_size += sizeof(p->qlim[0]);
+
+ memset(p, 0, packet_size);
if (get_ldev_if_state(device, D_NEGOTIATING)) {
- D_ASSERT(device, device->ldev->backing_bdev);
+ struct request_queue *q = bdev_get_queue(device->ldev->backing_bdev);
d_size = drbd_get_max_capacity(device->ldev);
rcu_read_lock();
u_size = rcu_dereference(device->ldev->disk_conf)->disk_size;
rcu_read_unlock();
q_order_type = drbd_queue_order_type(device);
- max_bio_size = queue_max_hw_sectors(device->ldev->backing_bdev->bd_disk->queue) << 9;
+ max_bio_size = queue_max_hw_sectors(q) << 9;
max_bio_size = min(max_bio_size, DRBD_MAX_BIO_SIZE);
+ assign_p_sizes_qlim(device, p, q);
put_ldev(device);
} else {
d_size = 0;
u_size = 0;
q_order_type = QUEUE_ORDERED_NONE;
max_bio_size = DRBD_MAX_BIO_SIZE; /* ... multiple BIOs per peer_request */
+ assign_p_sizes_qlim(device, p, NULL);
}

- sock = &peer_device->connection->data;
- p = drbd_prepare_command(peer_device, sock);
- if (!p)
- return -EIO;
-
if (peer_device->connection->agreed_pro_version <= 94)
max_bio_size = min(max_bio_size, DRBD_MAX_SIZE_H80_PACKET);
else if (peer_device->connection->agreed_pro_version < 100)
@@ -962,7 +995,8 @@ int drbd_send_sizes(struct drbd_peer_device *peer_device, int trigger_reply, enu
p->max_bio_size = cpu_to_be32(max_bio_size);
p->queue_order_type = cpu_to_be16(q_order_type);
p->dds_flags = cpu_to_be16(flags);
- return drbd_send_command(peer_device, sock, P_SIZES, sizeof(*p), NULL, 0);
+
+ return drbd_send_command(peer_device, sock, P_SIZES, packet_size, NULL, 0);
}

/**
@@ -1577,6 +1611,9 @@ static int _drbd_send_bio(struct drbd_peer_device *peer_device, struct bio *bio)
? 0 : MSG_MORE);
if (err)
return err;
+ /* REQ_WRITE_SAME has only one segment */
+ if (bio->bi_rw & REQ_WRITE_SAME)
+ break;
}
return 0;
}
@@ -1595,6 +1632,9 @@ static int _drbd_send_zc_bio(struct drbd_peer_device *peer_device, struct bio *b
bio_iter_last(bvec, iter) ? 0 : MSG_MORE);
if (err)
return err;
+ /* REQ_WRITE_SAME has only one segment */
+ if (bio->bi_rw & REQ_WRITE_SAME)
+ break;
}
return 0;
}
@@ -1625,6 +1665,7 @@ static u32 bio_flags_to_wire(struct drbd_connection *connection, unsigned long b
return (bi_rw & REQ_SYNC ? DP_RW_SYNC : 0) |
(bi_rw & REQ_FUA ? DP_FUA : 0) |
(bi_rw & REQ_FLUSH ? DP_FLUSH : 0) |
+ (bi_rw & REQ_WRITE_SAME ? DP_WSAME : 0) |
(bi_rw & REQ_DISCARD ? DP_DISCARD : 0);
else
return bi_rw & REQ_SYNC ? DP_RW_SYNC : 0;
@@ -1638,6 +1679,8 @@ int drbd_send_dblock(struct drbd_peer_device *peer_device, struct drbd_request *
struct drbd_device *device = peer_device->device;
struct drbd_socket *sock;
struct p_data *p;
+ struct p_wsame *wsame = NULL;
+ void *digest_out;
unsigned int dp_flags = 0;
int digest_size;
int err;
@@ -1673,12 +1716,29 @@ int drbd_send_dblock(struct drbd_peer_device *peer_device, struct drbd_request *
err = __send_command(peer_device->connection, device->vnr, sock, P_TRIM, sizeof(*t), NULL, 0);
goto out;
}
+ if (dp_flags & DP_WSAME) {
+ /* this will only work if DRBD_FF_WSAME is set AND the
+ * handshake agreed that all nodes and backend devices are
+ * WRITE_SAME capable and agree on logical_block_size */
+ wsame = (struct p_wsame*)p;
+ digest_out = wsame + 1;
+ wsame->size = cpu_to_be32(req->i.size);
+ } else
+ digest_out = p + 1;

/* our digest is still only over the payload.
* TRIM does not carry any payload. */
if (digest_size)
- drbd_csum_bio(peer_device->connection->integrity_tfm, req->master_bio, p + 1);
- err = __send_command(peer_device->connection, device->vnr, sock, P_DATA, sizeof(*p) + digest_size, NULL, req->i.size);
+ drbd_csum_bio(peer_device->connection->integrity_tfm, req->master_bio, digest_out);
+ if (wsame) {
+ err =
+ __send_command(peer_device->connection, device->vnr, sock, P_WSAME,
+ sizeof(*wsame) + digest_size, NULL,
+ bio_iovec(req->master_bio).bv_len);
+ } else
+ err =
+ __send_command(peer_device->connection, device->vnr, sock, P_DATA,
+ sizeof(*p) + digest_size, NULL, req->i.size);
if (!err) {
/* For protocol A, we have to memcpy the payload into
* socket buffers, as we may complete right away
@@ -3658,6 +3718,8 @@ const char *cmdname(enum drbd_packet cmd)
* one PRO_VERSION */
static const char *cmdnames[] = {
[P_DATA] = "Data",
+ [P_WSAME] = "WriteSame",
+ [P_TRIM] = "Trim",
[P_DATA_REPLY] = "DataReply",
[P_RS_DATA_REPLY] = "RSDataReply",
[P_BARRIER] = "Barrier",
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 48fa6c3..3704fad 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1174,6 +1174,17 @@ static void blk_queue_discard_granularity(struct request_queue *q, unsigned int
{
q->limits.discard_granularity = granularity;
}
+
+static unsigned int drbd_max_discard_sectors(struct drbd_connection *connection)
+{
+ /* when we introduced REQ_WRITE_SAME support, we also bumped
+ * our maximum supported batch bio size used for discards. */
+ if (connection->agreed_features & DRBD_FF_WSAME)
+ return DRBD_MAX_BBIO_SECTORS;
+ /* before, with DRBD <= 8.4.6, we only allowed up to one AL_EXTENT_SIZE. */
+ return AL_EXTENT_SIZE >> 9;
+}
+
static void decide_on_discard_support(struct drbd_device *device,
struct request_queue *q,
struct request_queue *b,
@@ -1190,7 +1201,7 @@ static void decide_on_discard_support(struct drbd_device *device,
can_do = false;
drbd_info(device, "discard_zeroes_data=0 and discard_zeroes_if_aligned=no: disabling discards\n");
}
- if (can_do && connection->cstate >= C_CONNECTED && !(connection->agreed_features & FF_TRIM)) {
+ if (can_do && connection->cstate >= C_CONNECTED && !(connection->agreed_features & DRBD_FF_TRIM)) {
can_do = false;
drbd_info(connection, "peer DRBD too old, does not support TRIM: disabling discards\n");
}
@@ -1202,7 +1213,7 @@ static void decide_on_discard_support(struct drbd_device *device,
* you care, you need to use devices with similar
* topology on all peers. */
blk_queue_discard_granularity(q, 512);
- q->limits.max_discard_sectors = DRBD_MAX_DISCARD_SECTORS;
+ q->limits.max_discard_sectors = drbd_max_discard_sectors(connection);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
} else {
queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
@@ -1223,8 +1234,67 @@ static void fixup_discard_if_not_supported(struct request_queue *q)
}
}

+static void decide_on_write_same_support(struct drbd_device *device,
+ struct request_queue *q,
+ struct request_queue *b, struct o_qlim *o)
+{
+ struct drbd_peer_device *peer_device = first_peer_device(device);
+ struct drbd_connection *connection = peer_device->connection;
+ bool can_do = b ? b->limits.max_write_same_sectors : true;
+
+ if (can_do && connection->cstate >= C_CONNECTED && !(connection->agreed_features & DRBD_FF_WSAME)) {
+ can_do = false;
+ drbd_info(peer_device, "peer does not support WRITE_SAME\n");
+ }
+
+ if (o) {
+ /* logical block size; queue_logical_block_size(NULL) is 512 */
+ unsigned int peer_lbs = be32_to_cpu(o->logical_block_size);
+ unsigned int me_lbs_b = queue_logical_block_size(b);
+ unsigned int me_lbs = queue_logical_block_size(q);
+
+ if (me_lbs_b != me_lbs) {
+ drbd_warn(device,
+ "logical block size of local backend does not match (drbd:%u, backend:%u); was this a late attach?\n",
+ me_lbs, me_lbs_b);
+ /* rather disable write same than trigger some BUG_ON later in the scsi layer. */
+ can_do = false;
+ }
+ if (me_lbs_b != peer_lbs) {
+ drbd_warn(peer_device, "logical block sizes do not match (me:%u, peer:%u); this may cause problems.\n",
+ me_lbs, peer_lbs);
+ if (can_do) {
+ drbd_dbg(peer_device, "logical block size mismatch: WRITE_SAME disabled.\n");
+ can_do = false;
+ }
+ me_lbs = max(me_lbs, me_lbs_b);
+ /* We cannot change the logical block size of an in-use queue.
+ * We can only hope that access happens to be properly aligned.
+ * If not, the peer will likely produce an IO error, and detach. */
+ if (peer_lbs > me_lbs) {
+ if (device->state.role != R_PRIMARY) {
+ blk_queue_logical_block_size(q, peer_lbs);
+ drbd_warn(peer_device, "logical block size set to %u\n", peer_lbs);
+ } else {
+ drbd_warn(peer_device,
+ "current Primary must NOT adjust logical block size (%u -> %u); hope for the best.\n",
+ me_lbs, peer_lbs);
+ }
+ }
+ }
+ if (can_do && !o->write_same_capable) {
+ /* If we introduce an open-coded write-same loop on the receiving side,
+ * the peer would present itself as "capable". */
+ drbd_dbg(peer_device, "WRITE_SAME disabled (peer device not capable)\n");
+ can_do = false;
+ }
+ }
+
+ blk_queue_max_write_same_sectors(q, can_do ? DRBD_MAX_BBIO_SECTORS : 0);
+}
+
static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backing_dev *bdev,
- unsigned int max_bio_size)
+ unsigned int max_bio_size, struct o_qlim *o)
{
struct request_queue * const q = device->rq_queue;
unsigned int max_hw_sectors = max_bio_size >> 9;
@@ -1244,15 +1314,15 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
rcu_read_unlock();

blk_set_stacking_limits(&q->limits);
- blk_queue_max_write_same_sectors(q, 0);
}

- blk_queue_logical_block_size(q, 512);
blk_queue_max_hw_sectors(q, max_hw_sectors);
/* This is the workaround for "bio would need to, but cannot, be split" */
blk_queue_max_segments(q, max_segments ? max_segments : BLK_MAX_SEGMENTS);
blk_queue_segment_boundary(q, PAGE_SIZE-1);
decide_on_discard_support(device, q, b, discard_zeroes_if_aligned);
+ decide_on_write_same_support(device, q, b, o);
+
if (b) {
blk_queue_stack_limits(q, b);

@@ -1266,7 +1336,7 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
fixup_discard_if_not_supported(q);
}

-void drbd_reconsider_queue_parameters(struct drbd_device *device, struct drbd_backing_dev *bdev)
+void drbd_reconsider_queue_parameters(struct drbd_device *device, struct drbd_backing_dev *bdev, struct o_qlim *o)
{
unsigned int now, new, local, peer;

@@ -1309,7 +1379,7 @@ void drbd_reconsider_queue_parameters(struct drbd_device *device, struct drbd_ba
if (new != now)
drbd_info(device, "max BIO size = %u\n", new);

- drbd_setup_queue_param(device, bdev, new);
+ drbd_setup_queue_param(device, bdev, new, o);
}

/* Starts the worker thread */
@@ -1542,7 +1612,7 @@ int drbd_adm_disk_opts(struct sk_buff *skb, struct genl_info *info)
drbd_bump_write_ordering(device->resource, NULL, WO_BDEV_FLUSH);

if (old_disk_conf->discard_zeroes_if_aligned != new_disk_conf->discard_zeroes_if_aligned)
- drbd_reconsider_queue_parameters(device, device->ldev);
+ drbd_reconsider_queue_parameters(device, device->ldev, NULL);

drbd_md_sync(device);

@@ -1922,7 +1992,7 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
device->read_cnt = 0;
device->writ_cnt = 0;

- drbd_reconsider_queue_parameters(device, device->ldev);
+ drbd_reconsider_queue_parameters(device, device->ldev, NULL);

/* If I am currently not R_PRIMARY,
* but meta data primary indicator is set,
diff --git a/drivers/block/drbd/drbd_protocol.h b/drivers/block/drbd/drbd_protocol.h
index 7acc2e0..d695e44 100644
--- a/drivers/block/drbd/drbd_protocol.h
+++ b/drivers/block/drbd/drbd_protocol.h
@@ -64,6 +64,11 @@ enum drbd_packet {
P_RS_THIN_REQ = 0x32, /* Request a block for resync or reply P_RS_DEALLOCATED */
P_RS_DEALLOCATED = 0x33, /* Contains only zeros on sync source node */

+ /* REQ_WRITE_SAME.
+ * On a receiving side without REQ_WRITE_SAME,
+ * we may fall back to an opencoded loop instead. */
+ P_WSAME = 0x34,
+
P_MAY_IGNORE = 0x100, /* Flag to test if (cmd > P_MAY_IGNORE) ... */
P_MAX_OPT_CMD = 0x101,

@@ -110,8 +115,11 @@ struct p_header100 {
u32 pad;
} __packed;

-/* these defines must not be changed without changing the protocol version */
-#define DP_HARDBARRIER 1 /* depricated */
+/* These defines must not be changed without changing the protocol version.
+ * New defines may only be introduced together with protocol version bump or
+ * new protocol feature flags.
+ */
+#define DP_HARDBARRIER 1 /* no longer used */
#define DP_RW_SYNC 2 /* equals REQ_SYNC */
#define DP_MAY_SET_IN_SYNC 4
#define DP_UNPLUG 8 /* not used anymore */
@@ -120,6 +128,7 @@ struct p_header100 {
#define DP_DISCARD 64 /* equals REQ_DISCARD */
#define DP_SEND_RECEIVE_ACK 128 /* This is a proto B write request */
#define DP_SEND_WRITE_ACK 256 /* This is a proto C write request */
+#define DP_WSAME 512 /* equiv. REQ_WRITE_SAME */

struct p_data {
u64 sector; /* 64 bits sector number */
@@ -133,6 +142,11 @@ struct p_trim {
u32 size; /* == bio->bi_size */
} __packed;

+struct p_wsame {
+ struct p_data p_data;
+ u32 size; /* == bio->bi_size */
+} __packed;
+
/*
* commands which share a struct:
* p_block_ack:
@@ -164,8 +178,23 @@ struct p_block_req {
* ReportParams
*/

-#define FF_TRIM 1
-#define FF_THIN_RESYNC 2
+/* supports TRIM/DISCARD on the "wire" protocol */
+#define DRBD_FF_TRIM 1
+
+/* Detect all-zeros during resync, and rather TRIM/UNMAP/DISCARD those blocks
+ * instead of fully allocate a supposedly thin volume on initial resync */
+#define DRBD_FF_THIN_RESYNC 2
+
+/* supports REQ_WRITE_SAME on the "wire" protocol.
+ * Note: this flag is overloaded,
+ * its presence also
+ * - indicates support for 128 MiB "batch bios",
+ * max discard size of 128 MiB
+ * instead of 4M before that.
+ * - indicates that we exchange additional settings in p_sizes
+ * drbd_send_sizes()/receive_sizes()
+ */
+#define DRBD_FF_WSAME 4

struct p_connection_features {
u32 protocol_min;
@@ -240,6 +269,40 @@ struct p_rs_uuid {
u64 uuid;
} __packed;

+/* optional queue_limits if (agreed_features & DRBD_FF_WSAME)
+ * see also struct queue_limits, as of late 2015 */
+struct o_qlim {
+ /* we don't need it yet, but we may as well communicate it now */
+ u32 physical_block_size;
+
+ /* so the original in struct queue_limits is unsigned short,
+ * but I'd have to put in padding anyways. */
+ u32 logical_block_size;
+
+ /* One incoming bio becomes one DRBD request,
+ * which may be translated to several bio on the receiving side.
+ * We don't need to communicate chunk/boundary/segment ... limits.
+ */
+
+ /* various IO hints may be useful with "diskless client" setups */
+ u32 alignment_offset;
+ u32 io_min;
+ u32 io_opt;
+
+ /* We may need to communicate integrity stuff at some point,
+ * but let's not get ahead of ourselves. */
+
+ /* Backend discard capabilities.
+ * Receiving side uses "blkdev_issue_discard()", no need to communicate
+ * more specifics. If the backend cannot do discards, the DRBD peer
+ * may fall back to blkdev_issue_zeroout().
+ */
+ u8 discard_enabled;
+ u8 discard_zeroes_data;
+ u8 write_same_capable;
+ u8 _pad;
+} __packed;
+
struct p_sizes {
u64 d_size; /* size of disk */
u64 u_size; /* user requested size */
@@ -247,6 +310,9 @@ struct p_sizes {
u32 max_bio_size; /* Maximal size of a BIO */
u16 queue_order_type; /* not yet implemented in DRBD*/
u16 dds_flags; /* use enum dds_flags here. */
+
+ /* optional queue_limits if (agreed_features & DRBD_FF_WSAME) */
+ struct o_qlim qlim[0];
} __packed;

struct p_state {
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 99f4519..1320bb8 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -48,7 +48,7 @@
#include "drbd_req.h"
#include "drbd_vli.h"

-#define PRO_FEATURES (FF_TRIM | FF_THIN_RESYNC)
+#define PRO_FEATURES (DRBD_FF_TRIM|DRBD_FF_THIN_RESYNC|DRBD_FF_WSAME)

struct packet_info {
enum drbd_packet cmd;
@@ -361,14 +361,17 @@ You must not have the req_lock:
drbd_wait_ee_list_empty()
*/

+/* normal: payload_size == request size (bi_size)
+ * w_same: payload_size == logical_block_size
+ * trim: payload_size == 0 */
struct drbd_peer_request *
drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t sector,
- unsigned int data_size, bool has_payload, gfp_t gfp_mask) __must_hold(local)
+ unsigned int request_size, unsigned int payload_size, gfp_t gfp_mask) __must_hold(local)
{
struct drbd_device *device = peer_device->device;
struct drbd_peer_request *peer_req;
struct page *page = NULL;
- unsigned nr_pages = (data_size + PAGE_SIZE -1) >> PAGE_SHIFT;
+ unsigned nr_pages = (payload_size + PAGE_SIZE -1) >> PAGE_SHIFT;

if (drbd_insert_fault(device, DRBD_FAULT_AL_EE))
return NULL;
@@ -380,7 +383,7 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
return NULL;
}

- if (has_payload && data_size) {
+ if (nr_pages) {
page = drbd_alloc_pages(peer_device, nr_pages,
gfpflags_allow_blocking(gfp_mask));
if (!page)
@@ -390,7 +393,7 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
memset(peer_req, 0, sizeof(*peer_req));
INIT_LIST_HEAD(&peer_req->w.list);
drbd_clear_interval(&peer_req->i);
- peer_req->i.size = data_size;
+ peer_req->i.size = request_size;
peer_req->i.sector = sector;
peer_req->submit_jif = jiffies;
peer_req->peer_device = peer_device;
@@ -1529,7 +1532,7 @@ static bool can_do_reliable_discards(struct drbd_device *device)
return can_do;
}

-void drbd_issue_peer_discard(struct drbd_device *device, struct drbd_peer_request *peer_req)
+static void drbd_issue_peer_discard(struct drbd_device *device, struct drbd_peer_request *peer_req)
{
/* If the backend cannot discard, or does not guarantee
* read-back zeroes in discarded ranges, we fall back to
@@ -1544,6 +1547,18 @@ void drbd_issue_peer_discard(struct drbd_device *device, struct drbd_peer_reques
drbd_endio_write_sec_final(peer_req);
}

+static void drbd_issue_peer_wsame(struct drbd_device *device,
+ struct drbd_peer_request *peer_req)
+{
+ struct block_device *bdev = device->ldev->backing_bdev;
+ sector_t s = peer_req->i.sector;
+ sector_t nr = peer_req->i.size >> 9;
+ if (blkdev_issue_write_same(bdev, s, nr, GFP_NOIO, peer_req->pages))
+ peer_req->flags |= EE_WAS_ERROR;
+ drbd_endio_write_sec_final(peer_req);
+}
+
+
/**
* drbd_submit_peer_request()
* @device: DRBD device.
@@ -1580,7 +1595,7 @@ int drbd_submit_peer_request(struct drbd_device *device,
* Correctness first, performance later. Next step is to code an
* asynchronous variant of the same.
*/
- if (peer_req->flags & EE_IS_TRIM) {
+ if (peer_req->flags & (EE_IS_TRIM|EE_WRITE_SAME)) {
/* wait for all pending IO completions, before we start
* zeroing things out. */
conn_wait_active_ee_empty(peer_req->peer_device->connection);
@@ -1597,7 +1612,10 @@ int drbd_submit_peer_request(struct drbd_device *device,
spin_unlock_irq(&device->resource->req_lock);
}

- drbd_issue_peer_discard(device, peer_req);
+ if (peer_req->flags & EE_IS_TRIM)
+ drbd_issue_peer_discard(device, peer_req);
+ else /* EE_WRITE_SAME */
+ drbd_issue_peer_wsame(device, peer_req);
return 0;
}

@@ -1770,8 +1788,26 @@ static int receive_Barrier(struct drbd_connection *connection, struct packet_inf
return 0;
}

+/* quick wrapper in case payload size != request_size (write same) */
+static void drbd_csum_ee_size(struct crypto_ahash *h,
+ struct drbd_peer_request *r, void *d,
+ unsigned int payload_size)
+{
+ unsigned int tmp = r->i.size;
+ r->i.size = payload_size;
+ drbd_csum_ee(h, r, d);
+ r->i.size = tmp;
+}
+
/* used from receive_RSDataReply (recv_resync_read)
- * and from receive_Data */
+ * and from receive_Data.
+ * data_size: actual payload ("data in")
+ * for normal writes that is bi_size.
+ * for discards, that is zero.
+ * for write same, it is logical_block_size.
+ * both trim and write same have the bi_size ("data len to be affected")
+ * as extra argument in the packet header.
+ */
static struct drbd_peer_request *
read_in_block(struct drbd_peer_device *peer_device, u64 id, sector_t sector,
struct packet_info *pi) __must_hold(local)
@@ -1786,6 +1822,7 @@ read_in_block(struct drbd_peer_device *peer_device, u64 id, sector_t sector,
void *dig_vv = peer_device->connection->int_dig_vv;
unsigned long *data;
struct p_trim *trim = (pi->cmd == P_TRIM) ? pi->data : NULL;
+ struct p_trim *wsame = (pi->cmd == P_WSAME) ? pi->data : NULL;

digest_size = 0;
if (!trim && peer_device->connection->peer_integrity_tfm) {
@@ -1800,38 +1837,60 @@ read_in_block(struct drbd_peer_device *peer_device, u64 id, sector_t sector,
data_size -= digest_size;
}

+ /* assume request_size == data_size, but special case trim and wsame. */
+ ds = data_size;
if (trim) {
- D_ASSERT(peer_device, data_size == 0);
- data_size = be32_to_cpu(trim->size);
+ if (!expect(data_size == 0))
+ return NULL;
+ ds = be32_to_cpu(trim->size);
+ } else if (wsame) {
+ if (data_size != queue_logical_block_size(device->rq_queue)) {
+ drbd_err(peer_device, "data size (%u) != drbd logical block size (%u)\n",
+ data_size, queue_logical_block_size(device->rq_queue));
+ return NULL;
+ }
+ if (data_size != bdev_logical_block_size(device->ldev->backing_bdev)) {
+ drbd_err(peer_device, "data size (%u) != backend logical block size (%u)\n",
+ data_size, bdev_logical_block_size(device->ldev->backing_bdev));
+ return NULL;
+ }
+ ds = be32_to_cpu(wsame->size);
}

- if (!expect(IS_ALIGNED(data_size, 512)))
+ if (!expect(IS_ALIGNED(ds, 512)))
return NULL;
- /* prepare for larger trim requests. */
- if (!trim && !expect(data_size <= DRBD_MAX_BIO_SIZE))
+ if (trim || wsame) {
+ if (!expect(ds <= (DRBD_MAX_BBIO_SECTORS << 9)))
+ return NULL;
+ } else if (!expect(ds <= DRBD_MAX_BIO_SIZE))
return NULL;

/* even though we trust out peer,
* we sometimes have to double check. */
- if (sector + (data_size>>9) > capacity) {
+ if (sector + (ds>>9) > capacity) {
drbd_err(device, "request from peer beyond end of local disk: "
"capacity: %llus < sector: %llus + size: %u\n",
(unsigned long long)capacity,
- (unsigned long long)sector, data_size);
+ (unsigned long long)sector, ds);
return NULL;
}

/* GFP_NOIO, because we must not cause arbitrary write-out: in a DRBD
* "criss-cross" setup, that might cause write-out on some other DRBD,
* which in turn might block on the other node at this very place. */
- peer_req = drbd_alloc_peer_req(peer_device, id, sector, data_size, trim == NULL, GFP_NOIO);
+ peer_req = drbd_alloc_peer_req(peer_device, id, sector, ds, data_size, GFP_NOIO);
if (!peer_req)
return NULL;

peer_req->flags |= EE_WRITE;
- if (trim)
+ if (trim) {
+ peer_req->flags |= EE_IS_TRIM;
return peer_req;
+ }
+ if (wsame)
+ peer_req->flags |= EE_WRITE_SAME;

+ /* receive payload size bytes into page chain */
ds = data_size;
page = peer_req->pages;
page_chain_for_each(page) {
@@ -1851,7 +1910,7 @@ read_in_block(struct drbd_peer_device *peer_device, u64 id, sector_t sector,
}

if (digest_size) {
- drbd_csum_ee(peer_device->connection->peer_integrity_tfm, peer_req, dig_vv);
+ drbd_csum_ee_size(peer_device->connection->peer_integrity_tfm, peer_req, dig_vv, data_size);
if (memcmp(dig_in, dig_vv, digest_size)) {
drbd_err(device, "Digest integrity check FAILED: %llus +%u\n",
(unsigned long long)sector, data_size);
@@ -2506,7 +2565,6 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
dp_flags = be32_to_cpu(p->dp_flags);
rw |= wire_flags_to_bio(dp_flags);
if (pi->cmd == P_TRIM) {
- peer_req->flags |= EE_IS_TRIM;
D_ASSERT(peer_device, peer_req->i.size > 0);
D_ASSERT(peer_device, rw & REQ_DISCARD);
D_ASSERT(peer_device, peer_req->pages == NULL);
@@ -2573,11 +2631,11 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
update_peer_seq(peer_device, peer_seq);
spin_lock_irq(&device->resource->req_lock);
}
- /* if we use the zeroout fallback code, we process synchronously
- * and we wait for all pending requests, respectively wait for
+ /* TRIM and WRITE_SAME are processed synchronously,
+ * we wait for all pending requests, respectively wait for
* active_ee to become empty in drbd_submit_peer_request();
* better not add ourselves here. */
- if ((peer_req->flags & EE_IS_TRIM) == 0)
+ if ((peer_req->flags & (EE_IS_TRIM|EE_WRITE_SAME)) == 0)
list_add_tail(&peer_req->w.list, &device->active_ee);
spin_unlock_irq(&device->resource->req_lock);

@@ -2759,7 +2817,7 @@ static int receive_DataRequest(struct drbd_connection *connection, struct packet
* "criss-cross" setup, that might cause write-out on some other DRBD,
* which in turn might block on the other node at this very place. */
peer_req = drbd_alloc_peer_req(peer_device, p->block_id, sector, size,
- true /* has real payload */, GFP_NOIO);
+ size, GFP_NOIO);
if (!peer_req) {
put_ldev(device);
return -ENOMEM;
@@ -3920,6 +3978,7 @@ static int receive_sizes(struct drbd_connection *connection, struct packet_info
struct drbd_peer_device *peer_device;
struct drbd_device *device;
struct p_sizes *p = pi->data;
+ struct o_qlim *o = (connection->agreed_features & DRBD_FF_WSAME) ? p->qlim : NULL;
enum determine_dev_size dd = DS_UNCHANGED;
sector_t p_size, p_usize, p_csize, my_usize;
int ldsc = 0; /* local disk size changed */
@@ -4003,7 +4062,7 @@ static int receive_sizes(struct drbd_connection *connection, struct packet_info

ddsf = be16_to_cpu(p->dds_flags);
if (get_ldev(device)) {
- drbd_reconsider_queue_parameters(device, device->ldev);
+ drbd_reconsider_queue_parameters(device, device->ldev, o);
dd = drbd_determine_dev_size(device, ddsf, NULL);
put_ldev(device);
if (dd == DS_ERROR)
@@ -4023,7 +4082,7 @@ static int receive_sizes(struct drbd_connection *connection, struct packet_info
* However, if he sends a zero current size,
* take his (user-capped or) backing disk size anyways.
*/
- drbd_reconsider_queue_parameters(device, NULL);
+ drbd_reconsider_queue_parameters(device, NULL, o);
drbd_set_my_capacity(device, p_csize ?: p_usize ?: p_size);
}

@@ -4779,7 +4838,7 @@ static int receive_rs_deallocated(struct drbd_connection *connection, struct pac
const int rw = REQ_WRITE | REQ_DISCARD;

peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER, sector,
- size, false, GFP_NOIO);
+ size, 0, GFP_NOIO);
if (!peer_req) {
put_ldev(device);
return -ENOMEM;
@@ -4824,7 +4883,7 @@ static int receive_rs_deallocated(struct drbd_connection *connection, struct pac

struct data_cmd {
int expect_payload;
- size_t pkt_size;
+ unsigned int pkt_size;
int (*fn)(struct drbd_connection *, struct packet_info *);
};

@@ -4856,7 +4915,7 @@ static struct data_cmd drbd_cmd_handler[] = {
[P_PROTOCOL_UPDATE] = { 1, sizeof(struct p_protocol), receive_protocol },
[P_TRIM] = { 0, sizeof(struct p_trim), receive_Data },
[P_RS_DEALLOCATED] = { 0, sizeof(struct p_block_desc), receive_rs_deallocated },
-
+ [P_WSAME] = { 1, sizeof(struct p_wsame), receive_Data },
};

static void drbdd(struct drbd_connection *connection)
@@ -4866,7 +4925,7 @@ static void drbdd(struct drbd_connection *connection)
int err;

while (get_t_state(&connection->receiver) == RUNNING) {
- struct data_cmd *cmd;
+ struct data_cmd const *cmd;

drbd_thread_current_set_cpu(&connection->receiver);
update_receiver_timing_details(connection, drbd_recv_header);
@@ -4881,11 +4940,18 @@ static void drbdd(struct drbd_connection *connection)
}

shs = cmd->pkt_size;
+ if (pi.cmd == P_SIZES && connection->agreed_features & DRBD_FF_WSAME)
+ shs += sizeof(struct o_qlim);
if (pi.size > shs && !cmd->expect_payload) {
drbd_err(connection, "No payload expected %s l:%d\n",
cmdname(pi.cmd), pi.size);
goto err_out;
}
+ if (pi.size < shs) {
+ drbd_err(connection, "%s: unexpected packet size, expected:%d received:%d\n",
+ cmdname(pi.cmd), (int)shs, pi.size);
+ goto err_out;
+ }

if (shs) {
update_receiver_timing_details(connection, drbd_recv_all_warn);
@@ -5132,11 +5198,12 @@ static int drbd_do_features(struct drbd_connection *connection)
drbd_info(connection, "Handshake successful: "
"Agreed network protocol version %d\n", connection->agreed_pro_version);

- drbd_info(connection, "Agreed to%ssupport TRIM on protocol level\n",
- connection->agreed_features & FF_TRIM ? " " : " not ");
-
- drbd_info(connection, "Agreed to%ssupport THIN_RESYNC on protocol level\n",
- connection->agreed_features & FF_THIN_RESYNC ? " " : " not ");
+ drbd_info(connection, "Feature flags enabled on protocol level: 0x%x%s%s%s.\n",
+ connection->agreed_features,
+ connection->agreed_features & DRBD_FF_TRIM ? " TRIM" : "",
+ connection->agreed_features & DRBD_FF_THIN_RESYNC ? " THIN_RESYNC" : "",
+ connection->agreed_features & DRBD_FF_WSAME ? " WRITE_SAME" :
+ connection->agreed_features ? "" : " none");

return 1;

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 8260dfa6..72bb1b1 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -47,8 +47,7 @@ static void _drbd_end_io_acct(struct drbd_device *device, struct drbd_request *r
&device->vdisk->part0, req->start_jif);
}

-static struct drbd_request *drbd_req_new(struct drbd_device *device,
- struct bio *bio_src)
+static struct drbd_request *drbd_req_new(struct drbd_device *device, struct bio *bio_src)
{
struct drbd_request *req;

@@ -58,10 +57,12 @@ static struct drbd_request *drbd_req_new(struct drbd_device *device,
memset(req, 0, sizeof(*req));

drbd_req_make_private_bio(req, bio_src);
- req->rq_state = bio_data_dir(bio_src) == WRITE ? RQ_WRITE : 0;
- req->device = device;
- req->master_bio = bio_src;
- req->epoch = 0;
+ req->rq_state = (bio_data_dir(bio_src) == WRITE ? RQ_WRITE : 0)
+ | (bio_src->bi_rw & REQ_WRITE_SAME ? RQ_WSAME : 0)
+ | (bio_src->bi_rw & REQ_DISCARD ? RQ_UNMAP : 0);
+ req->device = device;
+ req->master_bio = bio_src;
+ req->epoch = 0;

drbd_clear_interval(&req->i);
req->i.sector = bio_src->bi_iter.bi_sector;
diff --git a/drivers/block/drbd/drbd_req.h b/drivers/block/drbd/drbd_req.h
index bb2ef78..eb49e7f 100644
--- a/drivers/block/drbd/drbd_req.h
+++ b/drivers/block/drbd/drbd_req.h
@@ -206,6 +206,8 @@ enum drbd_req_state_bits {

/* Set when this is a write, clear for a read */
__RQ_WRITE,
+ __RQ_WSAME,
+ __RQ_UNMAP,

/* Should call drbd_al_complete_io() for this request... */
__RQ_IN_ACT_LOG,
@@ -241,10 +243,11 @@ enum drbd_req_state_bits {
#define RQ_NET_OK (1UL << __RQ_NET_OK)
#define RQ_NET_SIS (1UL << __RQ_NET_SIS)

-/* 0x1f8 */
#define RQ_NET_MASK (((1UL << __RQ_NET_MAX)-1) & ~RQ_LOCAL_MASK)

#define RQ_WRITE (1UL << __RQ_WRITE)
+#define RQ_WSAME (1UL << __RQ_WSAME)
+#define RQ_UNMAP (1UL << __RQ_UNMAP)
#define RQ_IN_ACT_LOG (1UL << __RQ_IN_ACT_LOG)
#define RQ_POSTPONED (1UL << __RQ_POSTPONED)
#define RQ_COMPLETION_SUSP (1UL << __RQ_COMPLETION_SUSP)
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index f9e142d..ed422a5 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -320,6 +320,10 @@ void drbd_csum_bio(struct crypto_ahash *tfm, struct bio *bio, void *digest)
sg_set_page(&sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset);
ahash_request_set_crypt(req, &sg, NULL, sg.length);
crypto_ahash_update(req);
+ /* REQ_WRITE_SAME has only one segment,
+ * checksum the payload only once. */
+ if (bio->bi_rw & REQ_WRITE_SAME)
+ break;
}
ahash_request_set_crypt(req, NULL, digest, 0);
crypto_ahash_final(req);
@@ -387,7 +391,7 @@ static int read_for_csum(struct drbd_peer_device *peer_device, sector_t sector,
/* GFP_TRY, because if there is no memory available right now, this may
* be rescheduled for later. It is "only" background resync, after all. */
peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER /* unused */, sector,
- size, true /* has real payload */, GFP_TRY);
+ size, size, GFP_TRY);
if (!peer_req)
goto defer;

@@ -602,7 +606,7 @@ static int make_resync_request(struct drbd_device *const device, int cancel)
return 0;
}

- if (connection->agreed_features & FF_THIN_RESYNC) {
+ if (connection->agreed_features & DRBD_FF_THIN_RESYNC) {
rcu_read_lock();
discard_granularity = rcu_dereference(device->ldev->disk_conf)->rs_discard_granularity;
rcu_read_unlock();
--
1.9.1

2016-04-25 12:23:28

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 24/30] drbd: disallow promotion during resync handshake, avoid deadlock and hard reset

From: Lars Ellenberg <[email protected]>

We already serialize connection state changes,
and other, non-connection state changes (role changes)
while we are establishing a connection.

But if we have an established connection,
then trigger a resync handshake (by primary --force or similar),
until now we just had to be "lucky".

Consider this sequence (e.g. deployment scenario):
create-md; up;
-> Connected Secondary/Secondary Inconsistent/Inconsistent
then do a racy primary --force on both peers.

block drbd0: drbd_sync_handshake:
block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent )
block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
*** HERE things go wrong. ***
block drbd0: role( Secondary -> Primary )
block drbd0: drbd_sync_handshake:
block drbd0: self 0000000000000005:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer C90D2FC716D232AB:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: Becoming sync target due to disk states.
block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * timeout (30 * 0.1s)
drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )

The problem here is that the local promotion happens before the sync handshake
triggered by the remote promotion was completed. Some assumptions elsewhere
become wrong, and when the expected resync handshake is then received and
processed, we get stuck in a deadlock, which can only be recovered by reboot :-(

Fix: if we know the peer has good data,
and our own disk is present, but NOT good,
and there is no resync going on yet,
we expect a sync handshake to happen "soon".
So reject a racy promotion with SS_IN_TRANSIENT_STATE.

Result:
... as above ...
block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
*** local promotion being postponed until ... ***
block drbd0: drbd_sync_handshake:
block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer 77868BDA836E12A5:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
...
block drbd0: conn( WFBitMapT -> WFSyncUUID )
block drbd0: updated sync uuid 85D06D0E8887AD44:0000000000000000:0000000000000000:0000000000000000
block drbd0: conn( WFSyncUUID -> SyncTarget )
*** ... after the resync handshake ***
block drbd0: role( Secondary -> Primary )

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_state.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/drivers/block/drbd/drbd_state.c b/drivers/block/drbd/drbd_state.c
index 24422e8..7562c5c 100644
--- a/drivers/block/drbd/drbd_state.c
+++ b/drivers/block/drbd/drbd_state.c
@@ -906,6 +906,15 @@ is_valid_soft_transition(union drbd_state os, union drbd_state ns, struct drbd_c
(ns.conn >= C_CONNECTED && os.conn == C_WF_REPORT_PARAMS)))
rv = SS_IN_TRANSIENT_STATE;

+ /* Do not promote during resync handshake triggered by "force primary".
+ * This is a hack. It should really be rejected by the peer during the
+ * cluster wide state change request. */
+ if (os.role != R_PRIMARY && ns.role == R_PRIMARY
+ && ns.pdsk == D_UP_TO_DATE
+ && ns.disk != D_UP_TO_DATE && ns.disk != D_DISKLESS
+ && (ns.conn <= C_WF_SYNC_UUID || ns.conn != os.conn))
+ rv = SS_IN_TRANSIENT_STATE;
+
if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) && os.conn < C_CONNECTED)
rv = SS_NEED_CONNECTION;

--
1.9.1

2016-04-25 12:23:26

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 17/30] drbd: don't forget error completion when "unsuspending" IO

From: Lars Ellenberg <[email protected]>

Possibly sequence of events:
SyncTarget is made Primary, then loses replication link
(only path to good data on SyncSource).

Behavior is then controlled by the on-no-data-accessible policy,
which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO).

If OND_IO_ERROR is in fact the current policy, we clear the susp_fen
(IO suspended due to fencing policy) flag, do NOT set the susp_nod
(IO suspended due to no data) flag.

But we forgot to call the IO error completion for all pending,
suspended, requests.

While at it, also add a race check for a theoretically possible
race with a new handshake (network hickup), we may be able to
re-send requests, and can avoid passing IO errors up the stack.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_nl.c | 48 +++++++++++++++++++++++++++++---------------
1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index f16084a..a703a0e 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -442,19 +442,17 @@ static enum drbd_fencing_p highest_fencing_policy(struct drbd_connection *connec
}
rcu_read_unlock();

- if (fp == FP_NOT_AVAIL) {
- /* IO Suspending works on the whole resource.
- Do it only for one device. */
- vnr = 0;
- peer_device = idr_get_next(&connection->peer_devices, &vnr);
- drbd_change_state(peer_device->device, CS_VERBOSE | CS_HARD, NS(susp_fen, 0));
- }
-
return fp;
}

+static bool resource_is_supended(struct drbd_resource *resource)
+{
+ return resource->susp || resource->susp_fen || resource->susp_nod;
+}
+
bool conn_try_outdate_peer(struct drbd_connection *connection)
{
+ struct drbd_resource * const resource = connection->resource;
unsigned int connect_cnt;
union drbd_state mask = { };
union drbd_state val = { };
@@ -462,21 +460,41 @@ bool conn_try_outdate_peer(struct drbd_connection *connection)
char *ex_to_string;
int r;

- spin_lock_irq(&connection->resource->req_lock);
+ spin_lock_irq(&resource->req_lock);
if (connection->cstate >= C_WF_REPORT_PARAMS) {
drbd_err(connection, "Expected cstate < C_WF_REPORT_PARAMS\n");
- spin_unlock_irq(&connection->resource->req_lock);
+ spin_unlock_irq(&resource->req_lock);
return false;
}

connect_cnt = connection->connect_cnt;
- spin_unlock_irq(&connection->resource->req_lock);
+ spin_unlock_irq(&resource->req_lock);

fp = highest_fencing_policy(connection);
switch (fp) {
case FP_NOT_AVAIL:
drbd_warn(connection, "Not fencing peer, I'm not even Consistent myself.\n");
- goto out;
+ spin_lock_irq(&resource->req_lock);
+ if (connection->cstate < C_WF_REPORT_PARAMS) {
+ _conn_request_state(connection,
+ (union drbd_state) { { .susp_fen = 1 } },
+ (union drbd_state) { { .susp_fen = 0 } },
+ CS_VERBOSE | CS_HARD | CS_DC_SUSP);
+ /* We are no longer suspended due to the fencing policy.
+ * We may still be suspended due to the on-no-data-accessible policy.
+ * If that was OND_IO_ERROR, fail pending requests. */
+ if (!resource_is_supended(resource))
+ _tl_restart(connection, CONNECTION_LOST_WHILE_PENDING);
+ }
+ /* Else: in case we raced with a connection handshake,
+ * let the handshake figure out if we maybe can RESEND,
+ * and do not resume/fail pending requests here.
+ * Worst case is we stay suspended for now, which may be
+ * resolved by either re-establishing the replication link, or
+ * the next link failure, or eventually the administrator. */
+ spin_unlock_irq(&resource->req_lock);
+ return false;
+
case FP_DONT_CARE:
return true;
default: ;
@@ -529,13 +547,11 @@ bool conn_try_outdate_peer(struct drbd_connection *connection)
drbd_info(connection, "fence-peer helper returned %d (%s)\n",
(r>>8) & 0xff, ex_to_string);

- out:
-
/* Not using
conn_request_state(connection, mask, val, CS_VERBOSE);
here, because we might were able to re-establish the connection in the
meantime. */
- spin_lock_irq(&connection->resource->req_lock);
+ spin_lock_irq(&resource->req_lock);
if (connection->cstate < C_WF_REPORT_PARAMS && !test_bit(STATE_SENT, &connection->flags)) {
if (connection->connect_cnt != connect_cnt)
/* In case the connection was established and droped
@@ -544,7 +560,7 @@ bool conn_try_outdate_peer(struct drbd_connection *connection)
else
_conn_request_state(connection, mask, val, CS_VERBOSE);
}
- spin_unlock_irq(&connection->resource->req_lock);
+ spin_unlock_irq(&resource->req_lock);

return conn_highest_pdsk(connection) <= D_OUTDATED;
}
--
1.9.1

2016-04-25 12:23:24

by Philipp Reisner

[permalink] [raw]

Subject: [PATCH 29/30] drbd: al_write_transaction: skip re-scanning of bitmap page pointer array

From: Lars Ellenberg <[email protected]>

For larger devices, the array of bitmap page pointers can grow very
large (8000 pointers per TB of storage).

For each activity log transaction, we need to flush the associated
bitmap pages to stable storage. Currently, we just "mark" the respective
pages while setting up the transaction, then tell the bitmap code to
write out all marked pages, but skip unchanged pages.

But one such transaction can affect only a small number of bitmap pages,
there is no need to scan the full array of several (ten-)thousand
page pointers to find the few marked ones.

Instead, remember the index numbers of the few affected pages,
and later only re-check those to skip duplicates and unchanged ones.

Signed-off-by: Philipp Reisner <[email protected]>
Signed-off-by: Lars Ellenberg <[email protected]>
---
drivers/block/drbd/drbd_actlog.c | 2 ++
drivers/block/drbd/drbd_bitmap.c | 66 +++++++++++++++++++++++++++++++---------
drivers/block/drbd/drbd_int.h | 1 +
3 files changed, 54 insertions(+), 15 deletions(-)

diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 99a2b92..8305615 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -339,6 +339,8 @@ static int __al_write_transaction(struct drbd_device *device, struct al_transact

i = 0;

+ drbd_bm_reset_al_hints(device);
+
/* Even though no one can start to change this list
* once we set the LC_LOCKED -- from drbd_al_begin_io(),
* lc_try_lock_for_transaction() --, someone may still
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 801b8f3..b1c2a57 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -96,6 +96,13 @@ struct drbd_bitmap {
struct page **bm_pages;
spinlock_t bm_lock;

+ /* exclusively to be used by __al_write_transaction(),
+ * drbd_bm_mark_for_writeout() and
+ * and drbd_bm_write_hinted() -> bm_rw() called from there.
+ */
+ unsigned int n_bitmap_hints;
+ unsigned int al_bitmap_hints[AL_UPDATES_PER_TRANSACTION];
+
/* see LIMITATIONS: above */

unsigned long bm_set; /* nr of set bits; THINK maybe atomic_t? */
@@ -242,6 +249,11 @@ static void bm_set_page_need_writeout(struct page *page)
set_bit(BM_PAGE_NEED_WRITEOUT, &page_private(page));
}

+void drbd_bm_reset_al_hints(struct drbd_device *device)
+{
+ device->bitmap->n_bitmap_hints = 0;
+}
+
/**
* drbd_bm_mark_for_writeout() - mark a page with a "hint" to be considered for writeout
* @device: DRBD device.
@@ -253,6 +265,7 @@ static void bm_set_page_need_writeout(struct page *page)
*/
void drbd_bm_mark_for_writeout(struct drbd_device *device, int page_nr)
{
+ struct drbd_bitmap *b = device->bitmap;
struct page *page;
if (page_nr >= device->bitmap->bm_number_of_pages) {
drbd_warn(device, "BAD: page_nr: %u, number_of_pages: %u\n",
@@ -260,7 +273,9 @@ void drbd_bm_mark_for_writeout(struct drbd_device *device, int page_nr)
return;
}
page = device->bitmap->bm_pages[page_nr];
- set_bit(BM_PAGE_HINT_WRITEOUT, &page_private(page));
+ BUG_ON(b->n_bitmap_hints >= ARRAY_SIZE(b->al_bitmap_hints));
+ if (!test_and_set_bit(BM_PAGE_HINT_WRITEOUT, &page_private(page)))
+ b->al_bitmap_hints[b->n_bitmap_hints++] = page_nr;
}

static int bm_test_page_unchanged(struct page *page)
@@ -1030,7 +1045,7 @@ static int bm_rw(struct drbd_device *device, const unsigned int flags, unsigned
{
struct drbd_bm_aio_ctx *ctx;
struct drbd_bitmap *b = device->bitmap;
- int num_pages, i, count = 0;
+ unsigned int num_pages, i, count = 0;
unsigned long now;
char ppb[10];
int err = 0;
@@ -1078,16 +1093,37 @@ static int bm_rw(struct drbd_device *device, const unsigned int flags, unsigned
now = jiffies;

/* let the layers below us try to merge these bios... */
- for (i = 0; i < num_pages; i++) {
- /* ignore completely unchanged pages */
- if (lazy_writeout_upper_idx && i == lazy_writeout_upper_idx)
- break;
- if (!(flags & BM_AIO_READ)) {
- if ((flags & BM_AIO_WRITE_HINTED) &&
- !test_and_clear_bit(BM_PAGE_HINT_WRITEOUT,
- &page_private(b->bm_pages[i])))
- continue;

+ if (flags & BM_AIO_READ) {
+ for (i = 0; i < num_pages; i++) {
+ atomic_inc(&ctx->in_flight);
+ bm_page_io_async(ctx, i);
+ ++count;
+ cond_resched();
+ }
+ } else if (flags & BM_AIO_WRITE_HINTED) {
+ /* ASSERT: BM_AIO_WRITE_ALL_PAGES is not set. */
+ unsigned int hint;
+ for (hint = 0; hint < b->n_bitmap_hints; hint++) {
+ i = b->al_bitmap_hints[hint];
+ if (i >= num_pages) /* == -1U: no hint here. */
+ continue;
+ /* Several AL-extents may point to the same page. */
+ if (!test_and_clear_bit(BM_PAGE_HINT_WRITEOUT,
+ &page_private(b->bm_pages[i])))
+ continue;
+ /* Has it even changed? */
+ if (bm_test_page_unchanged(b->bm_pages[i]))
+ continue;
+ atomic_inc(&ctx->in_flight);
+ bm_page_io_async(ctx, i);
+ ++count;
+ }
+ } else {
+ for (i = 0; i < num_pages; i++) {
+ /* ignore completely unchanged pages */
+ if (lazy_writeout_upper_idx && i == lazy_writeout_upper_idx)
+ break;
if (!(flags & BM_AIO_WRITE_ALL_PAGES) &&
bm_test_page_unchanged(b->bm_pages[i])) {
dynamic_drbd_dbg(device, "skipped bm write for idx %u\n", i);
@@ -1100,11 +1136,11 @@ static int bm_rw(struct drbd_device *device, const unsigned int flags, unsigned
dynamic_drbd_dbg(device, "skipped bm lazy write for idx %u\n", i);
continue;
}
+ atomic_inc(&ctx->in_flight);
+ bm_page_io_async(ctx, i);
+ ++count;
+ cond_resched();
}
- atomic_inc(&ctx->in_flight);
- bm_page_io_async(ctx, i);
- ++count;
- cond_resched();
}

/*
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index acb1462..c83393c 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1378,6 +1378,7 @@ extern int drbd_bm_e_weight(struct drbd_device *device, unsigned long enr);
extern int drbd_bm_read(struct drbd_device *device) __must_hold(local);
extern void drbd_bm_mark_for_writeout(struct drbd_device *device, int page_nr);
extern int drbd_bm_write(struct drbd_device *device) __must_hold(local);
+extern void drbd_bm_reset_al_hints(struct drbd_device *device) __must_hold(local);
extern int drbd_bm_write_hinted(struct drbd_device *device) __must_hold(local);
extern int drbd_bm_write_lazy(struct drbd_device *device, unsigned upper_idx) __must_hold(local);
extern int drbd_bm_write_all(struct drbd_device *device) __must_hold(local);
--
1.9.1

2016-04-25 15:29:00

by Bart Van Assche

[permalink] [raw]

Subject: Re: [Drbd-dev] [PATCH 04/30] drbd: Implement handling of thinly provisioned storage on resync target nodes

On 04/25/2016 05:10 AM, Philipp Reisner wrote:
> If during resync we read only zeroes for a range of sectors assume
> that these secotors can be discarded on the sync target node.

Hello Phil,

With which interconnect(s) has this patch been tested? I'm afraid that
for high-speed interconnects this patch will slow down I/O instead of
making it faster because all_zero() examines all data before it is sent.

Bart.

2016-04-25 15:35:33

by Bart Van Assche

[permalink] [raw]

Subject: Re: [Drbd-dev] [PATCH 05/30] drbd: Introduce new disk config option rs-discard-granularity

On 04/25/2016 05:10 AM, Philipp Reisner wrote:
> As long as the value is 0 the feature is disabled. With setting
> it to a positive value, DRBD limits and aligns its resync requests
> to the rs-discard-granularity setting. If the sync source detects
> all zeros in such a block, the resync target discards the range
> on disk.

Hello Phil,

Can you explain why rs-discard-granularity is configurable instead of
e.g. setting it to the least common multiple of the discard
granularities of the underlying block devices at both sides?

Thanks,

Bart.

2016-04-25 16:37:38

by Bart Van Assche

[permalink] [raw]

Subject: Re: [Drbd-dev] [PATCH 11/30] drbd: when receiving P_TRIM, zero-out partial unaligned chunks

On 04/25/2016 05:13 AM, Philipp Reisner wrote:
> + while (nr_sectors >= granularity) {
> + nr = min_t(sector_t, nr_sectors, max_discard_sectors);
> + err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO, 0);
> + nr_sectors -= nr;
> + start += nr;
> + }

Hello Phil,

In blk_bio_discard_split() the following statement protects against
block drivers for which max_discard_sectors is not a multiple of the
discard granularity:

max_discard_sectors -= max_discard_sectors % granularity;

Do we need something similar in the above loop?

To Jens: should the drbd_issue_discard_or_zero_out() function go
upstream or should rather what this function does be integrated in
blkdev_issue_zeroout() as is done by the patch series "[PATCH v2 0/6]
Make blkdev_issue_discard() submit aligned discard requests"
(http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/23801)?

Thanks,

Bart.

2016-04-25 16:42:36

by Philipp Reisner

[permalink] [raw]

Subject: Re: [Drbd-dev] [PATCH 04/30] drbd: Implement handling of thinly provisioned storage on resync target nodes

Am Montag, 25. April 2016, 08:28:45 schrieb Bart Van Assche:
> On 04/25/2016 05:10 AM, Philipp Reisner wrote:
> > If during resync we read only zeroes for a range of sectors assume
> > that these secotors can be discarded on the sync target node.
>
> Hello Phil,
>
> With which interconnect(s) has this patch been tested? I'm afraid that
> for high-speed interconnects this patch will slow down I/O instead of
> making it faster because all_zero() examines all data before it is sent.
>

Hi Bart,

that it might make things slower is true for sure. The benefit it
provides is to de-allocate blocks on the secondary obviously.
The whole feature is optional and it is off by default.

Obviously we want to have a generic interface like SEEK_HOLE/
SEEK_DATA for block devices, but that does not exist as of today.

best regards,
Phil

2016-04-25 16:42:34

by Philipp Reisner

[permalink] [raw]

Subject: Re: [Drbd-dev] [PATCH 05/30] drbd: Introduce new disk config option rs-discard-granularity

Am Montag, 25. April 2016, 08:35:26 schrieb Bart Van Assche:
> On 04/25/2016 05:10 AM, Philipp Reisner wrote:
> > As long as the value is 0 the feature is disabled. With setting
> > it to a positive value, DRBD limits and aligns its resync requests
> > to the rs-discard-granularity setting. If the sync source detects
> > all zeros in such a block, the resync target discards the range
> > on disk.
>
> Hello Phil,
>
> Can you explain why rs-discard-granularity is configurable instead of
> e.g. setting it to the least common multiple of the discard
> granularities of the underlying block devices at both sides?
>
> Thanks,
>

Hi Bart,

we had this idea as well. It seems that real world devices like larger
discards better than smaller discards. The other motivation was that
a device mapper logical volume might change it on the fly...
So we think it is best to delegate the decision on the discard chunk
size to user space.

best regards,
Phil

2016-04-25 19:04:28

by Bart Van Assche

[permalink] [raw]

Subject: Re: [Drbd-dev] [PATCH 05/30] drbd: Introduce new disk config option rs-discard-granularity

On 04/25/2016 09:42 AM, Philipp Reisner wrote:
> Am Montag, 25. April 2016, 08:35:26 schrieb Bart Van Assche:
>> On 04/25/2016 05:10 AM, Philipp Reisner wrote:
>>> As long as the value is 0 the feature is disabled. With setting
>>> it to a positive value, DRBD limits and aligns its resync requests
>>> to the rs-discard-granularity setting. If the sync source detects
>>> all zeros in such a block, the resync target discards the range
>>> on disk.
>>
>> Can you explain why rs-discard-granularity is configurable instead of
>> e.g. setting it to the least common multiple of the discard
>> granularities of the underlying block devices at both sides?
>
> we had this idea as well. It seems that real world devices like larger
> discards better than smaller discards. The other motivation was that
> a device mapper logical volume might change it on the fly...
> So we think it is best to delegate the decision on the discard chunk
> size to user space.

Hello Phil,

Are you aware that for aligned discard requests the discard granularity
does not affect the size of discard requests at all?

Regarding LVM volumes: if the discard granularity for such volumes can
change on the fly, shouldn't I/O be quiesced by the LVM kernel driver
before it changes the discard granularity? I think that increasing
discard granularity while I/O is in progress should be considered as a bug.

Bart.

2016-04-25 19:49:22

by Philipp Reisner

[permalink] [raw]

Subject: Re: [Drbd-dev] [PATCH 05/30] drbd: Introduce new disk config option rs-discard-granularity

Am Montag, 25. April 2016, 11:48:30 schrieb Bart Van Assche:
> On 04/25/2016 09:42 AM, Philipp Reisner wrote:
> > Am Montag, 25. April 2016, 08:35:26 schrieb Bart Van Assche:
> >> On 04/25/2016 05:10 AM, Philipp Reisner wrote:
> >>> As long as the value is 0 the feature is disabled. With setting
> >>> it to a positive value, DRBD limits and aligns its resync requests
> >>> to the rs-discard-granularity setting. If the sync source detects
> >>> all zeros in such a block, the resync target discards the range
> >>> on disk.
> >>
> >> Can you explain why rs-discard-granularity is configurable instead of
> >> e.g. setting it to the least common multiple of the discard
> >> granularities of the underlying block devices at both sides?
> >
> > we had this idea as well. It seems that real world devices like larger
> > discards better than smaller discards. The other motivation was that
> > a device mapper logical volume might change it on the fly...
> > So we think it is best to delegate the decision on the discard chunk
> > size to user space.
>
> Hello Phil,
>
> Are you aware that for aligned discard requests the discard granularity
> does not affect the size of discard requests at all?
>
> Regarding LVM volumes: if the discard granularity for such volumes can
> change on the fly, shouldn't I/O be quiesced by the LVM kernel driver
> before it changes the discard granularity? I think that increasing
> discard granularity while I/O is in progress should be considered as a bug.
>
> Bart.

Hi Bart,

I worked on this about 6 month ago, sorry for not having all the details
at the top of my head immediately. I think it came back now:
We need to announce the discard granularity when we create the device/minor.
At might it might be that there is no connection to the peer node. So we
are left with information about the discard granularity of the local
backing device only.
Therefore we decided to delegate it to the user/admin to provide the
discard granularity for the resync process.

best regards,
phil

2016-04-25 20:32:22

by Lars Ellenberg

[permalink] [raw]

Subject: [PATCH 11/30] drbd: when receiving P_TRIM, zero-out partial unaligned chunks

On Mon, Apr 25, 2016 at 09:37:28AM -0700, Bart Van Assche wrote:
> On 04/25/2016 05:13 AM, Philipp Reisner wrote:
> >+ while (nr_sectors >= granularity) {
> >+ nr = min_t(sector_t, nr_sectors, max_discard_sectors);
> >+ err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO, 0);
> >+ nr_sectors -= nr;
> >+ start += nr;
> >+ }
>
> Hello Phil,
>
> In blk_bio_discard_split() the following statement protects against
> block drivers for which max_discard_sectors is not a multiple of the
> discard granularity:
>
> max_discard_sectors -= max_discard_sectors % granularity;
>
> Do we need something similar in the above loop?

Right. Did not realize that was "legal".

> To Jens: should the drbd_issue_discard_or_zero_out() function go
> upstream or should rather what this function does be integrated in
> blkdev_issue_zeroout() as is done by the patch series
> "[PATCH v2 0/6] Make blkdev_issue_discard() submit aligned discard requests"
> (http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/23801)?

Thanks,
Lars

2016-04-25 20:37:11

by Lars Ellenberg

[permalink] [raw]

Subject: [PATCH 05/30] drbd: Introduce new disk config option rs-discard-granularity

On Mon, Apr 25, 2016 at 02:49:11PM -0500, Philipp Reisner wrote:
> Am Montag, 25. April 2016, 11:48:30 schrieb Bart Van Assche:
> > On 04/25/2016 09:42 AM, Philipp Reisner wrote:
> > > Am Montag, 25. April 2016, 08:35:26 schrieb Bart Van Assche:
> > >> On 04/25/2016 05:10 AM, Philipp Reisner wrote:
> > >>> As long as the value is 0 the feature is disabled. With setting
> > >>> it to a positive value, DRBD limits and aligns its resync requests
> > >>> to the rs-discard-granularity setting. If the sync source detects
> > >>> all zeros in such a block, the resync target discards the range
> > >>> on disk.
> > >>
> > >> Can you explain why rs-discard-granularity is configurable instead of
> > >> e.g. setting it to the least common multiple of the discard
> > >> granularities of the underlying block devices at both sides?
> > >
> > > we had this idea as well. It seems that real world devices like larger
> > > discards better than smaller discards. The other motivation was that
> > > a device mapper logical volume might change it on the fly...
> > > So we think it is best to delegate the decision on the discard chunk
> > > size to user space.
> >
> > Hello Phil,
> >
> > Are you aware that for aligned discard requests the discard granularity
> > does not affect the size of discard requests at all?
> >
> > Regarding LVM volumes: if the discard granularity for such volumes can
> > change on the fly, shouldn't I/O be quiesced by the LVM kernel driver
> > before it changes the discard granularity? I think that increasing
> > discard granularity while I/O is in progress should be considered as a bug.
> >
> > Bart.
>
> Hi Bart,
>
> I worked on this about 6 month ago, sorry for not having all the details
> at the top of my head immediately. I think it came back now:
> We need to announce the discard granularity when we create the device/minor.
> At might it might be that there is no connection to the peer node. So we
> are left with information about the discard granularity of the local
> backing device only.
> Therefore we decided to delegate it to the user/admin to provide the
> discard granularity for the resync process.

Also, even though it may be technically possible to discard at 512 Byte
granularity, you may want to have the resync only actually do it for
bigger chunks. for $reasons.

Lars

2016-04-25 21:23:30

by Lars Ellenberg

[permalink] [raw]

Subject: Re: [PATCH 11/30] drbd: when receiving P_TRIM, zero-out partial unaligned chunks

On Mon, Apr 25, 2016 at 02:07:38PM +0200, Philipp Reisner wrote:
> + max_discard_sectors = min(q->limits.max_discard_sectors, (1U << 22));
> + max_discard_sectors -= max_discard_sectors % granularity;

^^^^ there

On Mon, Apr 25, 2016 at 10:32:17PM +0200, Lars Ellenberg wrote:
> On Mon, Apr 25, 2016 at 09:37:28AM -0700, Bart Van Assche wrote:
> > On 04/25/2016 05:13 AM, Philipp Reisner wrote:
> > >+ while (nr_sectors >= granularity) {
> > >+ nr = min_t(sector_t, nr_sectors, max_discard_sectors);
> > >+ err |= blkdev_issue_discard(bdev, start, nr, GFP_NOIO, 0);
> > >+ nr_sectors -= nr;
> > >+ start += nr;
> > >+ }
> >
> > Hello Phil,
> >
> > In blk_bio_discard_split() the following statement protects against
> > block drivers for which max_discard_sectors is not a multiple of the
> > discard granularity:
> >
> > max_discard_sectors -= max_discard_sectors % granularity;
> >
> > Do we need something similar in the above loop?
>
> Right. Did not realize that was "legal".

Wait. We already have that. see above.

Lars