LinuxLists.cc - [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs

2019-11-14 10:59:46

Subject: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs

Hi!

So, after the feedback I got from v1 [1] I've sent out a pull-request
for the OSDs [2] which encodes require_osd_release into the OSDMap
client data. This allows the client to figure out which ceph release
the OSDs cluster is running and decide whether or not it's safe to use
the copy-from Op for copy_file_range.

This new patchset I'm sending simply adds enough functionality to the
kernel client so that it can take advantage of this OSD patch:

0001 - adds the ability to decode TYPE_MSGR2 addresses. This is a
required functionality for enabling SERVER_NAUTILUS in the
client. I hope I got the new format right, as I couldn't figure
out what the hard-coded values (see comments) really mean.

0002 - allows the client to retrieve the new require_osd_release field
from the OSDMap if available. This patch also adds SERVER_MIMIC,
SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
which TBH I'm not sure if that's a safe thing to do -- the only
issue I've seen was that Nautilus requires the ability to decode
TYPE_MSGR2 address, but I may have missed others.

0003 - debug code to add require_osd_release to the osdmap debugfs file.

0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
if the OSDs are >= Octopus.

Also note that, as suggested by Ilya, I've dropped the patch that would
change the default mount options to 'copyfrom'.

These patches have been tested with the xfstests generic test suite, and
with a couple of other (local) tests that exercise the cephfs
copy_file_range syscall. I didn't saw any issues, but as I said above,
I'm not really sure if adding the SERVER_* flags to the supported
features have other side effects.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://github.com/ceph/ceph/pull/31611

Cheers,
--
Luis

Luis Henriques (4):
ceph: add support for TYPE_MSGR2 address decode
ceph: get the require_osd_release field from the osdmap
ceph: add require_osd_release field to osdmap debugfs
ceph: add support for sending truncate_{seq,size} in 'copy-from' Op

fs/ceph/file.c | 10 +++++++-
include/linux/ceph/ceph_features.h | 10 ++++++--
include/linux/ceph/decode.h | 3 ++-
include/linux/ceph/osd_client.h | 1 +
include/linux/ceph/osdmap.h | 1 +
include/linux/ceph/rados.h | 23 ++++++++++++++++++
net/ceph/ceph_strings.c | 38 ++++++++++++++++++++++++++++++
net/ceph/debugfs.c | 2 ++
net/ceph/decode.c | 33 ++++++++++++++++++++++++--
net/ceph/osd_client.c | 7 +++++-
net/ceph/osdmap.c | 21 +++++++++++++++++
11 files changed, 142 insertions(+), 7 deletions(-)

2019-11-14 11:01:28

by Luis Henriques

[permalink] [raw]

Subject: [RFC PATCH v2 1/4] ceph: add support for TYPE_MSGR2 address decode

The new format actually includes two addresses: one the new messenger v2,
and other for the legacy v1, which is the only one currently understood
by kernel clients. Add code to pick the legacy address and ignore the v2
one.

Signed-off-by: Luis Henriques <[email protected]>
---
include/linux/ceph/decode.h | 3 ++-
net/ceph/decode.c | 33 +++++++++++++++++++++++++++++++--
2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h
index 450384fe487c..2a2f07dfb39c 100644
--- a/include/linux/ceph/decode.h
+++ b/include/linux/ceph/decode.h
@@ -219,7 +219,8 @@ static inline void ceph_encode_timespec64(struct ceph_timespec *tv,
* sockaddr_storage <-> ceph_sockaddr
*/
#define CEPH_ENTITY_ADDR_TYPE_NONE 0
-#define CEPH_ENTITY_ADDR_TYPE_LEGACY __cpu_to_le32(1)
+#define CEPH_ENTITY_ADDR_TYPE_LEGACY __cpu_to_le32(1) /* legacy msgr1 */
+#define CEPH_ENTITY_ADDR_TYPE_MSGR2 __cpu_to_le32(2) /* msgr2 protocol */

static inline void ceph_encode_banner_addr(struct ceph_entity_addr *a)
{
diff --git a/net/ceph/decode.c b/net/ceph/decode.c
index eea529595a7a..613a2bc6f805 100644
--- a/net/ceph/decode.c
+++ b/net/ceph/decode.c
@@ -67,16 +67,45 @@ ceph_decode_entity_addr_legacy(void **p, void *end,
return ret;
}

+static int
+ceph_decode_entity_addr_versioned_msgr2(void **p, void *end,
+ struct ceph_entity_addr *addr)
+{
+ struct ceph_entity_addr tmp_addr;
+ struct ceph_entity_addr *paddr = addr;
+ int ret = -EINVAL;
+
+ ceph_decode_skip_32(p, end, bad); /* hard-coded '2' */
+ ceph_decode_skip_8(p, end, bad); /* hard-coded '1' */
+
+ ret = ceph_decode_entity_addr_versioned(p, end, paddr);
+ if (ret)
+ goto bad;
+ /* If we already have a v1 address, simply skip over the other address */
+ if (paddr->type == CEPH_ENTITY_ADDR_TYPE_LEGACY)
+ paddr = &tmp_addr;
+
+ ceph_decode_skip_8(p, end, bad); /* hard-coded '1' */
+
+ ret = ceph_decode_entity_addr_versioned(p, end, paddr);
+
+bad:
+ return ret;
+}
+
int
ceph_decode_entity_addr(void **p, void *end, struct ceph_entity_addr *addr)
{
u8 marker;

ceph_decode_8_safe(p, end, marker, bad);
- if (marker == 1)
+ if (marker == CEPH_ENTITY_ADDR_TYPE_MSGR2)
+ return ceph_decode_entity_addr_versioned_msgr2(p, end, addr);
+ else if (marker == CEPH_ENTITY_ADDR_TYPE_LEGACY)
return ceph_decode_entity_addr_versioned(p, end, addr);
- else if (marker == 0)
+ else if (marker == CEPH_ENTITY_ADDR_TYPE_NONE)
return ceph_decode_entity_addr_legacy(p, end, addr);
+
bad:
return -EINVAL;
}

2019-11-14 11:01:56

by Luis Henriques

[permalink] [raw]

Subject: [RFC PATCH v2 2/4] ceph: get the require_osd_release field from the osdmap

Since Ceph Octopus, OSDs are encoding require_osd_release into the client
data part of the osdmap. This patch adds code to pick this extra field.

Signed-off-by: Luis Henriques <[email protected]>
---
include/linux/ceph/ceph_features.h | 10 ++++++++--
include/linux/ceph/osdmap.h | 1 +
net/ceph/osdmap.c | 21 +++++++++++++++++++++
3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
index 39e6f4c57580..f329d1907dd7 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -9,6 +9,7 @@
*/
#define CEPH_FEATURE_INCARNATION_1 (0ull)
#define CEPH_FEATURE_INCARNATION_2 (1ull<<57) // CEPH_FEATURE_SERVER_JEWEL
+#define CEPH_FEATURE_INCARNATION_3 ((1ull<<57)|(1ull<<28)) // SERVER_MIMIC

#define DEFINE_CEPH_FEATURE(bit, incarnation, name) \
static const uint64_t CEPH_FEATURE_##name = (1ULL<<bit); \
@@ -76,6 +77,7 @@ DEFINE_CEPH_FEATURE( 0, 1, UID)
DEFINE_CEPH_FEATURE( 1, 1, NOSRCADDR)
DEFINE_CEPH_FEATURE_RETIRED( 2, 1, MONCLOCKCHECK, JEWEL, LUMINOUS)

+DEFINE_CEPH_FEATURE( 2, 3, SERVER_NAUTILUS)
DEFINE_CEPH_FEATURE( 3, 1, FLOCK)
DEFINE_CEPH_FEATURE( 4, 1, SUBSCRIBE2)
DEFINE_CEPH_FEATURE( 5, 1, MONNAMES)
@@ -92,6 +94,7 @@ DEFINE_CEPH_FEATURE(14, 2, SERVER_KRAKEN)
DEFINE_CEPH_FEATURE(15, 1, MONENC)
DEFINE_CEPH_FEATURE_RETIRED(16, 1, QUERY_T, JEWEL, LUMINOUS)

+DEFINE_CEPH_FEATURE(16, 3, SERVER_OCTOPUS)
DEFINE_CEPH_FEATURE_RETIRED(17, 1, INDEP_PG_MAP, JEWEL, LUMINOUS)

DEFINE_CEPH_FEATURE(18, 1, CRUSH_TUNABLES)
@@ -114,7 +117,7 @@ DEFINE_CEPH_FEATURE(25, 1, CRUSH_TUNABLES2)
DEFINE_CEPH_FEATURE(26, 1, CREATEPOOLID)
DEFINE_CEPH_FEATURE(27, 1, REPLY_CREATE_INODE)
DEFINE_CEPH_FEATURE_RETIRED(28, 1, OSD_HBMSGS, HAMMER, JEWEL)
-DEFINE_CEPH_FEATURE(28, 2, SERVER_M)
+DEFINE_CEPH_FEATURE(28, 2, SERVER_MIMIC)
DEFINE_CEPH_FEATURE(29, 1, MDSENC)
DEFINE_CEPH_FEATURE(30, 1, OSDHASHPSPOOL)
DEFINE_CEPH_FEATURE(31, 1, MON_SINGLE_PAXOS) // deprecate me
@@ -212,7 +215,10 @@ DEFINE_CEPH_FEATURE_DEPRECATED(63, 1, RESERVED_BROKEN, LUMINOUS) // client-facin
CEPH_FEATURE_CRUSH_TUNABLES5 | \
CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING | \
CEPH_FEATURE_MSG_ADDR2 | \
- CEPH_FEATURE_CEPHX_V2)
+ CEPH_FEATURE_CEPHX_V2 | \
+ CEPH_FEATURE_SERVER_MIMIC | \
+ CEPH_FEATURE_SERVER_NAUTILUS | \
+ CEPH_FEATURE_SERVER_OCTOPUS)

#define CEPH_FEATURES_REQUIRED_DEFAULT 0

diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index e081b56f1c1d..0d8e7f5e3478 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -160,6 +160,7 @@ struct ceph_osdmap {

u32 flags; /* CEPH_OSDMAP_* */

+ u8 require_osd_release;
u32 max_osd; /* size of osd_state, _offload, _addr arrays */
u32 *osd_state; /* CEPH_OSD_* */
u32 *osd_weight; /* 0 = failed, 0x10000 = 100% normal */
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 4e0de14f80bb..29526fd61983 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -1582,6 +1582,27 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
WARN_ON(!RB_EMPTY_ROOT(&map->pg_upmap_items));
}

+ if (struct_v >= 6)
+ /* crush version */
+ ceph_decode_skip_32(p, end, e_inval);
+ if (struct_v >= 7) {
+ /*
+ * skip removed_snaps and purged_snaps
+ * (snap_interval_set_t = 8 + 8)
+ */
+ ceph_decode_skip_set(p, end, 16, e_inval);
+ ceph_decode_skip_set(p, end, 16, e_inval);
+ }
+ if (struct_v >= 9) {
+ struct ceph_timespec ts;
+
+ /* last_up_change and last_in_change */
+ ceph_decode_copy_safe(p, end, &ts, sizeof(ts), e_inval);
+ ceph_decode_copy_safe(p, end, &ts, sizeof(ts), e_inval);
+ }
+ if (struct_v >= 10)
+ ceph_decode_8_safe(p, end, map->require_osd_release, e_inval);
+
/* ignore the rest */
*p = end;

2019-11-14 12:22:14

by Jeff Layton

[permalink] [raw]

Subject: Re: [RFC PATCH v2 1/4] ceph: add support for TYPE_MSGR2 address decode

On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> The new format actually includes two addresses: one the new messenger v2,
> and other for the legacy v1, which is the only one currently understood
> by kernel clients. Add code to pick the legacy address and ignore the v2
> one.
>
> Signed-off-by: Luis Henriques <[email protected]>
> ---
> include/linux/ceph/decode.h | 3 ++-
> net/ceph/decode.c | 33 +++++++++++++++++++++++++++++++--
> 2 files changed, 33 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h
> index 450384fe487c..2a2f07dfb39c 100644
> --- a/include/linux/ceph/decode.h
> +++ b/include/linux/ceph/decode.h
> @@ -219,7 +219,8 @@ static inline void ceph_encode_timespec64(struct ceph_timespec *tv,
> * sockaddr_storage <-> ceph_sockaddr
> */
> #define CEPH_ENTITY_ADDR_TYPE_NONE 0
> -#define CEPH_ENTITY_ADDR_TYPE_LEGACY __cpu_to_le32(1)
> +#define CEPH_ENTITY_ADDR_TYPE_LEGACY __cpu_to_le32(1) /* legacy msgr1 */
> +#define CEPH_ENTITY_ADDR_TYPE_MSGR2 __cpu_to_le32(2) /* msgr2 protocol */
>
> static inline void ceph_encode_banner_addr(struct ceph_entity_addr *a)
> {
> diff --git a/net/ceph/decode.c b/net/ceph/decode.c
> index eea529595a7a..613a2bc6f805 100644
> --- a/net/ceph/decode.c
> +++ b/net/ceph/decode.c
> @@ -67,16 +67,45 @@ ceph_decode_entity_addr_legacy(void **p, void *end,
> return ret;
> }
>
> +static int
> +ceph_decode_entity_addr_versioned_msgr2(void **p, void *end,
> + struct ceph_entity_addr *addr)
> +{
> + struct ceph_entity_addr tmp_addr;
> + struct ceph_entity_addr *paddr = addr;
> + int ret = -EINVAL;
> +
> + ceph_decode_skip_32(p, end, bad); /* hard-coded '2' */
> + ceph_decode_skip_8(p, end, bad); /* hard-coded '1' */
> +
> + ret = ceph_decode_entity_addr_versioned(p, end, paddr);
> + if (ret)
> + goto bad;
> + /* If we already have a v1 address, simply skip over the other address */
> + if (paddr->type == CEPH_ENTITY_ADDR_TYPE_LEGACY)
> + paddr = &tmp_addr;
> +
> + ceph_decode_skip_8(p, end, bad); /* hard-coded '1' */
> +
> + ret = ceph_decode_entity_addr_versioned(p, end, paddr);
> +
> +bad:
> + return ret;
> +}
> +
> int
> ceph_decode_entity_addr(void **p, void *end, struct ceph_entity_addr *addr)
> {
> u8 marker;
>
> ceph_decode_8_safe(p, end, marker, bad);
> - if (marker == 1)
> + if (marker == CEPH_ENTITY_ADDR_TYPE_MSGR2)
> + return ceph_decode_entity_addr_versioned_msgr2(p, end, addr);
> + else if (marker == CEPH_ENTITY_ADDR_TYPE_LEGACY)
> return ceph_decode_entity_addr_versioned(p, end, addr);
> - else if (marker == 0)
> + else if (marker == CEPH_ENTITY_ADDR_TYPE_NONE)
> return ceph_decode_entity_addr_legacy(p, end, addr);

You're decoding a byte into "marker" and then comparing that to a __le32
value. They almost certainly won't match correctly on a BE arch.

> +
> bad:
> return -EINVAL;
> }

--
Jeff Layton <[email protected]>

2019-11-14 13:04:42

by Jeff Layton

[permalink] [raw]

Subject: Re: [RFC PATCH v2 2/4] ceph: get the require_osd_release field from the osdmap

On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> Since Ceph Octopus, OSDs are encoding require_osd_release into the client
> data part of the osdmap. This patch adds code to pick this extra field.
>
> Signed-off-by: Luis Henriques <[email protected]>
> ---
> include/linux/ceph/ceph_features.h | 10 ++++++++--
> include/linux/ceph/osdmap.h | 1 +
> net/ceph/osdmap.c | 21 +++++++++++++++++++++
> 3 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
> index 39e6f4c57580..f329d1907dd7 100644
> --- a/include/linux/ceph/ceph_features.h
> +++ b/include/linux/ceph/ceph_features.h
> @@ -9,6 +9,7 @@
> */
> #define CEPH_FEATURE_INCARNATION_1 (0ull)
> #define CEPH_FEATURE_INCARNATION_2 (1ull<<57) // CEPH_FEATURE_SERVER_JEWEL
> +#define CEPH_FEATURE_INCARNATION_3 ((1ull<<57)|(1ull<<28)) // SERVER_MIMIC
>
> #define DEFINE_CEPH_FEATURE(bit, incarnation, name) \
> static const uint64_t CEPH_FEATURE_##name = (1ULL<<bit); \
> @@ -76,6 +77,7 @@ DEFINE_CEPH_FEATURE( 0, 1, UID)
> DEFINE_CEPH_FEATURE( 1, 1, NOSRCADDR)
> DEFINE_CEPH_FEATURE_RETIRED( 2, 1, MONCLOCKCHECK, JEWEL, LUMINOUS)
>
> +DEFINE_CEPH_FEATURE( 2, 3, SERVER_NAUTILUS)
> DEFINE_CEPH_FEATURE( 3, 1, FLOCK)
> DEFINE_CEPH_FEATURE( 4, 1, SUBSCRIBE2)
> DEFINE_CEPH_FEATURE( 5, 1, MONNAMES)
> @@ -92,6 +94,7 @@ DEFINE_CEPH_FEATURE(14, 2, SERVER_KRAKEN)
> DEFINE_CEPH_FEATURE(15, 1, MONENC)
> DEFINE_CEPH_FEATURE_RETIRED(16, 1, QUERY_T, JEWEL, LUMINOUS)
>
> +DEFINE_CEPH_FEATURE(16, 3, SERVER_OCTOPUS)
> DEFINE_CEPH_FEATURE_RETIRED(17, 1, INDEP_PG_MAP, JEWEL, LUMINOUS)
>
> DEFINE_CEPH_FEATURE(18, 1, CRUSH_TUNABLES)
> @@ -114,7 +117,7 @@ DEFINE_CEPH_FEATURE(25, 1, CRUSH_TUNABLES2)
> DEFINE_CEPH_FEATURE(26, 1, CREATEPOOLID)
> DEFINE_CEPH_FEATURE(27, 1, REPLY_CREATE_INODE)
> DEFINE_CEPH_FEATURE_RETIRED(28, 1, OSD_HBMSGS, HAMMER, JEWEL)
> -DEFINE_CEPH_FEATURE(28, 2, SERVER_M)
> +DEFINE_CEPH_FEATURE(28, 2, SERVER_MIMIC)
> DEFINE_CEPH_FEATURE(29, 1, MDSENC)
> DEFINE_CEPH_FEATURE(30, 1, OSDHASHPSPOOL)
> DEFINE_CEPH_FEATURE(31, 1, MON_SINGLE_PAXOS) // deprecate me
> @@ -212,7 +215,10 @@ DEFINE_CEPH_FEATURE_DEPRECATED(63, 1, RESERVED_BROKEN, LUMINOUS) // client-facin
> CEPH_FEATURE_CRUSH_TUNABLES5 | \
> CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING | \
> CEPH_FEATURE_MSG_ADDR2 | \
> - CEPH_FEATURE_CEPHX_V2)
> + CEPH_FEATURE_CEPHX_V2 | \
> + CEPH_FEATURE_SERVER_MIMIC | \
> + CEPH_FEATURE_SERVER_NAUTILUS | \
> + CEPH_FEATURE_SERVER_OCTOPUS)
>

As you mentioned in the cover letter, that doesn't seem likely to be
safe to just enable like this. I'm pretty sure the kclient is missing
some things that are encompassed by these bits.

Unfortunately none of that seems to be clearly documented anywhere, so
you're probably stuck walking through the ceph tree to see why the
server daemons are checking these flags.

> #define CEPH_FEATURES_REQUIRED_DEFAULT 0
>
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index e081b56f1c1d..0d8e7f5e3478 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -160,6 +160,7 @@ struct ceph_osdmap {
>
> u32 flags; /* CEPH_OSDMAP_* */
>
> + u8 require_osd_release;
> u32 max_osd; /* size of osd_state, _offload, _addr arrays */
> u32 *osd_state; /* CEPH_OSD_* */
> u32 *osd_weight; /* 0 = failed, 0x10000 = 100% normal */
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 4e0de14f80bb..29526fd61983 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -1582,6 +1582,27 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
> WARN_ON(!RB_EMPTY_ROOT(&map->pg_upmap_items));
> }
>
> + if (struct_v >= 6)
> + /* crush version */
> + ceph_decode_skip_32(p, end, e_inval);
> + if (struct_v >= 7) {
> + /*
> + * skip removed_snaps and purged_snaps
> + * (snap_interval_set_t = 8 + 8)
> + */
> + ceph_decode_skip_set(p, end, 16, e_inval);
> + ceph_decode_skip_set(p, end, 16, e_inval);
> + }
> + if (struct_v >= 9) {
> + struct ceph_timespec ts;
> +
> + /* last_up_change and last_in_change */
> + ceph_decode_copy_safe(p, end, &ts, sizeof(ts), e_inval);
> + ceph_decode_copy_safe(p, end, &ts, sizeof(ts), e_inval);
> + }
> + if (struct_v >= 10)
> + ceph_decode_8_safe(p, end, map->require_osd_release, e_inval);
> +
> /* ignore the rest */
> *p = end;
>

--
Jeff Layton <[email protected]>

2019-11-14 13:19:13

by Jeff Layton

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs

On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> Hi!
>
> So, after the feedback I got from v1 [1] I've sent out a pull-request
> for the OSDs [2] which encodes require_osd_release into the OSDMap
> client data. This allows the client to figure out which ceph release
> the OSDs cluster is running and decide whether or not it's safe to use
> the copy-from Op for copy_file_range.
>
> This new patchset I'm sending simply adds enough functionality to the
> kernel client so that it can take advantage of this OSD patch:
>
> 0001 - adds the ability to decode TYPE_MSGR2 addresses. This is a
> required functionality for enabling SERVER_NAUTILUS in the
> client. I hope I got the new format right, as I couldn't figure
> out what the hard-coded values (see comments) really mean.
>

nit: the first 3 patch subject lines should probably be prefixed with
"libceph:"

> 0002 - allows the client to retrieve the new require_osd_release field
> from the OSDMap if available. This patch also adds SERVER_MIMIC,
> SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
> which TBH I'm not sure if that's a safe thing to do -- the only
> issue I've seen was that Nautilus requires the ability to decode
> TYPE_MSGR2 address, but I may have missed others.
>

Yes, this needs to be done with care. We have to ensure that the server
side isn't assuming that the client supports something that it doesn't.
I think that means just trawling through the code and verifying whether
this is safe.

> 0003 - debug code to add require_osd_release to the osdmap debugfs file.
>
> 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
> if the OSDs are >= Octopus.
>
> Also note that, as suggested by Ilya, I've dropped the patch that would
> change the default mount options to 'copyfrom'.
>
> These patches have been tested with the xfstests generic test suite, and
> with a couple of other (local) tests that exercise the cephfs
> copy_file_range syscall. I didn't saw any issues, but as I said above,
> I'm not really sure if adding the SERVER_* flags to the supported
> features have other side effects.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://github.com/ceph/ceph/pull/31611
>

I'm just getting caught up on the discussion here, but why was it
decided to do it this way instead of just adding a new OSD
"copy-from-no-truncseq" operation? Once you tried it once and an OSD
didn't support it, you could just give up on using it any longer? That
seems a lot simpler than trying to monkey with feature bits.

--
Jeff Layton <[email protected]>

2019-11-14 13:29:17

by Sage Weil

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs

On Thu, 14 Nov 2019, Jeff Layton wrote:
> On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> > Hi!
> >
> > So, after the feedback I got from v1 [1] I've sent out a pull-request
> > for the OSDs [2] which encodes require_osd_release into the OSDMap
> > client data. This allows the client to figure out which ceph release
> > the OSDs cluster is running and decide whether or not it's safe to use
> > the copy-from Op for copy_file_range.
> >
> > This new patchset I'm sending simply adds enough functionality to the
> > kernel client so that it can take advantage of this OSD patch:
> >
> > 0001 - adds the ability to decode TYPE_MSGR2 addresses. This is a
> > required functionality for enabling SERVER_NAUTILUS in the
> > client. I hope I got the new format right, as I couldn't figure
> > out what the hard-coded values (see comments) really mean.
> >
>
> nit: the first 3 patch subject lines should probably be prefixed with
> "libceph:"
>
> > 0002 - allows the client to retrieve the new require_osd_release field
> > from the OSDMap if available. This patch also adds SERVER_MIMIC,
> > SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
> > which TBH I'm not sure if that's a safe thing to do -- the only
> > issue I've seen was that Nautilus requires the ability to decode
> > TYPE_MSGR2 address, but I may have missed others.
> >
>
> Yes, this needs to be done with care. We have to ensure that the server
> side isn't assuming that the client supports something that it doesn't.
> I think that means just trawling through the code and verifying whether
> this is safe.
>
> > 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> >
> > 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
> > if the OSDs are >= Octopus.
> >
> > Also note that, as suggested by Ilya, I've dropped the patch that would
> > change the default mount options to 'copyfrom'.
> >
> > These patches have been tested with the xfstests generic test suite, and
> > with a couple of other (local) tests that exercise the cephfs
> > copy_file_range syscall. I didn't saw any issues, but as I said above,
> > I'm not really sure if adding the SERVER_* flags to the supported
> > features have other side effects.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] https://github.com/ceph/ceph/pull/31611
> >
>
> I'm just getting caught up on the discussion here, but why was it
> decided to do it this way instead of just adding a new OSD
> "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> didn't support it, you could just give up on using it any longer? That
> seems a lot simpler than trying to monkey with feature bits.

I don't remember the original discussion either, but in retrospect that
does seem much simpler--especially since hte client is conditioning
sending this based on the the require_osd_release. It seems like passing
a flag to the copy-from op would be more reasonable instead of conditional
feature-based behavior.

Greg, do you remember why we ended up here?

sage

2019-11-14 14:14:29

by Ilya Dryomov

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs

On Thu, Nov 14, 2019 at 2:28 PM Sage Weil <[email protected]> wrote:
>
> On Thu, 14 Nov 2019, Jeff Layton wrote:
> > On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> > > Hi!
> > >
> > > So, after the feedback I got from v1 [1] I've sent out a pull-request
> > > for the OSDs [2] which encodes require_osd_release into the OSDMap
> > > client data. This allows the client to figure out which ceph release
> > > the OSDs cluster is running and decide whether or not it's safe to use
> > > the copy-from Op for copy_file_range.
> > >
> > > This new patchset I'm sending simply adds enough functionality to the
> > > kernel client so that it can take advantage of this OSD patch:
> > >
> > > 0001 - adds the ability to decode TYPE_MSGR2 addresses. This is a
> > > required functionality for enabling SERVER_NAUTILUS in the
> > > client. I hope I got the new format right, as I couldn't figure
> > > out what the hard-coded values (see comments) really mean.
> > >
> >
> > nit: the first 3 patch subject lines should probably be prefixed with
> > "libceph:"
> >
> > > 0002 - allows the client to retrieve the new require_osd_release field
> > > from the OSDMap if available. This patch also adds SERVER_MIMIC,
> > > SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
> > > which TBH I'm not sure if that's a safe thing to do -- the only
> > > issue I've seen was that Nautilus requires the ability to decode
> > > TYPE_MSGR2 address, but I may have missed others.
> > >
> >
> > Yes, this needs to be done with care. We have to ensure that the server
> > side isn't assuming that the client supports something that it doesn't.
> > I think that means just trawling through the code and verifying whether
> > this is safe.
> >
> > > 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> > >
> > > 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
> > > if the OSDs are >= Octopus.
> > >
> > > Also note that, as suggested by Ilya, I've dropped the patch that would
> > > change the default mount options to 'copyfrom'.
> > >
> > > These patches have been tested with the xfstests generic test suite, and
> > > with a couple of other (local) tests that exercise the cephfs
> > > copy_file_range syscall. I didn't saw any issues, but as I said above,
> > > I'm not really sure if adding the SERVER_* flags to the supported
> > > features have other side effects.
> > >
> > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > [2] https://github.com/ceph/ceph/pull/31611
> > >
> >
> > I'm just getting caught up on the discussion here, but why was it
> > decided to do it this way instead of just adding a new OSD
> > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > didn't support it, you could just give up on using it any longer? That
> > seems a lot simpler than trying to monkey with feature bits.
>
> I don't remember the original discussion either, but in retrospect that
> does seem much simpler--especially since hte client is conditioning
> sending this based on the the require_osd_release. It seems like passing
> a flag to the copy-from op would be more reasonable instead of conditional
> feature-based behavior.

Yeah, I suggested adding require_osd_release to the client portion just
because we are running into it more and more: Objecter relies on it for
RESEND_ON_SPLIT for example. It needs to be accessible so that patches
like that can be carried over to the kernel client without workarounds.

copy-from in its existing form is another example. AFAIU the problem
is that copy-from op doesn't reject unknown flags. Luis added a flag
in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
nautilus and older releases, potentially leading to data corruption.

Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
copy-from2 op that would behave just like copy-from, but reject unknown
flags to avoid similar compatibility issues in the future is probably
the best thing we can do from the client perspective.

Thanks,

Ilya

2019-11-14 14:21:17

by Sage Weil

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs

On Thu, 14 Nov 2019, Ilya Dryomov wrote:
> > > I'm just getting caught up on the discussion here, but why was it
> > > decided to do it this way instead of just adding a new OSD
> > > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > > didn't support it, you could just give up on using it any longer? That
> > > seems a lot simpler than trying to monkey with feature bits.
> >
> > I don't remember the original discussion either, but in retrospect that
> > does seem much simpler--especially since hte client is conditioning
> > sending this based on the the require_osd_release. It seems like passing
> > a flag to the copy-from op would be more reasonable instead of conditional
> > feature-based behavior.
>
> Yeah, I suggested adding require_osd_release to the client portion just
> because we are running into it more and more: Objecter relies on it for
> RESEND_ON_SPLIT for example. It needs to be accessible so that patches
> like that can be carried over to the kernel client without workarounds.
>
> copy-from in its existing form is another example. AFAIU the problem
> is that copy-from op doesn't reject unknown flags. Luis added a flag
> in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
> nautilus and older releases, potentially leading to data corruption.
>
> Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
> CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
> copy-from2 op that would behave just like copy-from, but reject unknown
> flags to avoid similar compatibility issues in the future is probably
> the best thing we can do from the client perspective.

Yeah, I think copy-from2 is the best path. I think that means we should
revert what we merged to ceph.git a few weeks back, Luis! Sorry we didn't
put all the pieces together before...

sage

2019-11-14 15:26:04

by Luis Henriques

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs

On Thu, Nov 14, 2019 at 02:17:44PM +0000, Sage Weil wrote:
> On Thu, 14 Nov 2019, Ilya Dryomov wrote:
> > > > I'm just getting caught up on the discussion here, but why was it
> > > > decided to do it this way instead of just adding a new OSD
> > > > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > > > didn't support it, you could just give up on using it any longer? That
> > > > seems a lot simpler than trying to monkey with feature bits.
> > >
> > > I don't remember the original discussion either, but in retrospect that
> > > does seem much simpler--especially since hte client is conditioning
> > > sending this based on the the require_osd_release. It seems like passing
> > > a flag to the copy-from op would be more reasonable instead of conditional
> > > feature-based behavior.
> >
> > Yeah, I suggested adding require_osd_release to the client portion just
> > because we are running into it more and more: Objecter relies on it for
> > RESEND_ON_SPLIT for example. It needs to be accessible so that patches
> > like that can be carried over to the kernel client without workarounds.
> >
> > copy-from in its existing form is another example. AFAIU the problem
> > is that copy-from op doesn't reject unknown flags. Luis added a flag
> > in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
> > nautilus and older releases, potentially leading to data corruption.
> >
> > Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
> > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
> > copy-from2 op that would behave just like copy-from, but reject unknown
> > flags to avoid similar compatibility issues in the future is probably
> > the best thing we can do from the client perspective.
>
> Yeah, I think copy-from2 is the best path. I think that means we should
> revert what we merged to ceph.git a few weeks back, Luis! Sorry we didn't
> put all the pieces together before...

Well, that's an unexpected turn. I'm not disagreeing with that decision
but since my initial pull request was done one year ago (almost to the
day!), it's a bit disappointing to see that in the end I'm back to
square one :-)

I guess that the PR I mentioned in the cover letter can also be dropped,
as it's not really usable by the kernel client (at least not until it
fully supports all the features up to SERVER_OCTOPUS).

Anyway, thanks everyone for the review.

Cheers,
--
Lu?s

2019-11-14 15:50:25

by Ilya Dryomov

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs

On Thu, Nov 14, 2019 at 4:24 PM Luis Henriques <[email protected]> wrote:
>
> On Thu, Nov 14, 2019 at 02:17:44PM +0000, Sage Weil wrote:
> > On Thu, 14 Nov 2019, Ilya Dryomov wrote:
> > > > > I'm just getting caught up on the discussion here, but why was it
> > > > > decided to do it this way instead of just adding a new OSD
> > > > > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > > > > didn't support it, you could just give up on using it any longer? That
> > > > > seems a lot simpler than trying to monkey with feature bits.
> > > >
> > > > I don't remember the original discussion either, but in retrospect that
> > > > does seem much simpler--especially since hte client is conditioning
> > > > sending this based on the the require_osd_release. It seems like passing
> > > > a flag to the copy-from op would be more reasonable instead of conditional
> > > > feature-based behavior.
> > >
> > > Yeah, I suggested adding require_osd_release to the client portion just
> > > because we are running into it more and more: Objecter relies on it for
> > > RESEND_ON_SPLIT for example. It needs to be accessible so that patches
> > > like that can be carried over to the kernel client without workarounds.
> > >
> > > copy-from in its existing form is another example. AFAIU the problem
> > > is that copy-from op doesn't reject unknown flags. Luis added a flag
> > > in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
> > > nautilus and older releases, potentially leading to data corruption.
> > >
> > > Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
> > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
> > > copy-from2 op that would behave just like copy-from, but reject unknown
> > > flags to avoid similar compatibility issues in the future is probably
> > > the best thing we can do from the client perspective.
> >
> > Yeah, I think copy-from2 is the best path. I think that means we should
> > revert what we merged to ceph.git a few weeks back, Luis! Sorry we didn't
> > put all the pieces together before...
>
> Well, that's an unexpected turn. I'm not disagreeing with that decision
> but since my initial pull request was done one year ago (almost to the
> day!), it's a bit disappointing to see that in the end I'm back to
> square one :-)

Well, I think literally every line from that PR will still go in, just
wrapped in a new OSD op. Backwards compatibility is hard...

>
> I guess that the PR I mentioned in the cover letter can also be dropped,
> as it's not really usable by the kernel client (at least not until it
> fully supports all the features up to SERVER_OCTOPUS).

No, some form of https://github.com/ceph/ceph/pull/31611 should go in.
I'm pretty certain it will come up at some point in the future, even if
the new field isn't immediately usable today. Someone porting a change
to the kernel client a couple years from now will thank you for it ;)

Thanks,

Ilya

2019-11-14 18:30:03

by Gregory Farnum

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs

On Thu, Nov 14, 2019 at 5:28 AM Sage Weil <[email protected]> wrote:
>
> On Thu, 14 Nov 2019, Jeff Layton wrote:
> > On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> > > Hi!
> > >
> > > So, after the feedback I got from v1 [1] I've sent out a pull-request
> > > for the OSDs [2] which encodes require_osd_release into the OSDMap
> > > client data. This allows the client to figure out which ceph release
> > > the OSDs cluster is running and decide whether or not it's safe to use
> > > the copy-from Op for copy_file_range.
> > >
> > > This new patchset I'm sending simply adds enough functionality to the
> > > kernel client so that it can take advantage of this OSD patch:
> > >
> > > 0001 - adds the ability to decode TYPE_MSGR2 addresses. This is a
> > > required functionality for enabling SERVER_NAUTILUS in the
> > > client. I hope I got the new format right, as I couldn't figure
> > > out what the hard-coded values (see comments) really mean.
> > >
> >
> > nit: the first 3 patch subject lines should probably be prefixed with
> > "libceph:"
> >
> > > 0002 - allows the client to retrieve the new require_osd_release field
> > > from the OSDMap if available. This patch also adds SERVER_MIMIC,
> > > SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
> > > which TBH I'm not sure if that's a safe thing to do -- the only
> > > issue I've seen was that Nautilus requires the ability to decode
> > > TYPE_MSGR2 address, but I may have missed others.
> > >
> >
> > Yes, this needs to be done with care. We have to ensure that the server
> > side isn't assuming that the client supports something that it doesn't.
> > I think that means just trawling through the code and verifying whether
> > this is safe.
> >
> > > 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> > >
> > > 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
> > > if the OSDs are >= Octopus.
> > >
> > > Also note that, as suggested by Ilya, I've dropped the patch that would
> > > change the default mount options to 'copyfrom'.
> > >
> > > These patches have been tested with the xfstests generic test suite, and
> > > with a couple of other (local) tests that exercise the cephfs
> > > copy_file_range syscall. I didn't saw any issues, but as I said above,
> > > I'm not really sure if adding the SERVER_* flags to the supported
> > > features have other side effects.
> > >
> > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > [2] https://github.com/ceph/ceph/pull/31611
> > >
> >
> > I'm just getting caught up on the discussion here, but why was it
> > decided to do it this way instead of just adding a new OSD
> > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > didn't support it, you could just give up on using it any longer? That
> > seems a lot simpler than trying to monkey with feature bits.
>
> I don't remember the original discussion either, but in retrospect that
> does seem much simpler--especially since hte client is conditioning
> sending this based on the the require_osd_release. It seems like passing
> a flag to the copy-from op would be more reasonable instead of conditional
> feature-based behavior.
>
> Greg, do you remember why we ended up here?

Well, I can look it up. We discussed it being a different op in
February ("OSD 'copy-from' operation and truncate_seq value") and...it
looks like that conversation ended with that being the plan?

I can't see why it changed in the making though, and everyone seems to
have forgotten about it at the next pass.
-Greg

>
> sage
>