Hi!
(Sorry for the long cover letter!)
Since the fix for [1] has finally been merged and should be available in
the next (Octopus) ceph release, I'm trying to clean-up my kernel client
patch that tries to find out whether or not it's safe to use the
'copy-from' RADOS operation for copy_file_range.
So, the fix for [1] was to modify the 'copy-from' operation to allow
clients to optionally (using the CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ
flag) send the extra truncate_seq and truncate_size parameters. Since
only Octopus will have this fix (no backports planned), the client
simply needs to ensure the OSDs being used have SERVER_OCTOPUS in their
features.
My initial solution was to add an extra test in __submit_request,
looping all the request ops and checking if the connection has the
required features for that operation. Obviously, at the moment only the
copy-from operation has a restriction but I guess others may be added in
the future. I believe that doing this at this point (__submit_request)
allows to cover cases where a cluster is being upgraded to Octopus and
we have different OSDs running with different feature bits.
Unfortunately, this solution is racy because the connection state
machine may be changing and the peer_features field isn't yet set. For
example: if the connection to an OSD is being re-open when we're about
to check the features, the con->state will be CON_STATE_PREOPEN and the
con->peer_features will be 0. I tried to find ways to move the feature
check further down in the stack, but that can't be easily done without
adding more infrastructure. A solution that came to my mind was to add
a new con->ops, invoked in the context of ceph_con_workfn, under the
con->mutex. This callback could then verify the available features,
aborting the operation if needed.
Note that the race in this patchset doesn't seem to be a huge problem,
other than occasionally reverting to a VFS generic copy_file_range, as
-EOPNOTSUPP will be returned here. But it's still a race, and there are
probably other cases that I'm missing.
Anyway, maybe I'm missing an obvious solution for checking these OSD
features, but I'm open to any suggestions on other options (or some
feedback on the new callback in ceph_connection_operations option).
[1] https://tracker.ceph.com/issues/37378
Cheers,
--
Luis
Luis Henriques (2):
ceph: add support for sending truncate_{seq,size} in 'copy-from' Op
ceph: make 'copyfrom' a default mount option again
fs/ceph/file.c | 4 +++-
fs/ceph/super.c | 4 ++--
fs/ceph/super.h | 4 +---
include/linux/ceph/ceph_features.h | 6 ++++-
include/linux/ceph/osd_client.h | 1 +
include/linux/ceph/rados.h | 1 +
net/ceph/osd_client.c | 37 +++++++++++++++++++++++++++++-
7 files changed, 49 insertions(+), 8 deletions(-)
Now that we're able to detect whether an OSD can correctly handle
'copy-from' without corrupting the destination file, we can make the
'copyfrom' mount option the default again. This effectively reverts
commit 6f9718fe41c3 ("ceph: make 'nocopyfrom' a default mount option").
Signed-off-by: Luis Henriques <[email protected]>
---
fs/ceph/super.c | 4 ++--
fs/ceph/super.h | 4 +---
2 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index edfd643a8205..c761be9eecbf 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -584,8 +584,8 @@ static int ceph_show_options(struct seq_file *m, struct dentry *root)
seq_puts(m, ",noacl");
#endif
- if ((fsopt->flags & CEPH_MOUNT_OPT_NOCOPYFROM) == 0)
- seq_puts(m, ",copyfrom");
+ if (fsopt->flags & CEPH_MOUNT_OPT_NOCOPYFROM)
+ seq_puts(m, ",nocopyfrom");
if (fsopt->mds_namespace)
seq_show_option(m, "mds_namespace", fsopt->mds_namespace);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index f98d9247f9cb..4cbcaee6e670 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -44,9 +44,7 @@
#define CEPH_MOUNT_OPT_NOQUOTADF (1<<13) /* no root dir quota in statfs */
#define CEPH_MOUNT_OPT_NOCOPYFROM (1<<14) /* don't use RADOS 'copy-from' op */
-#define CEPH_MOUNT_OPT_DEFAULT \
- (CEPH_MOUNT_OPT_DCACHE | \
- CEPH_MOUNT_OPT_NOCOPYFROM)
+#define CEPH_MOUNT_OPT_DEFAULT CEPH_MOUNT_OPT_DCACHE
#define ceph_set_mount_opt(fsc, opt) \
(fsc)->mount_options->flags |= CEPH_MOUNT_OPT_##opt;
By default, doing an object copy in Ceph will result in not only the
data being copied but also the truncate_seq and truncate_size values.
This may make sense in generic RADOS object copies, but for the specific
case of performing a file copy will result in data corruption in the
destination file.
In order to fix this, the 'copy-from' operation has been modified so
that it could receive the two extra parameters for the destination
object truncate_seq and truncate_size. This patch adds support for
these extra parameters to the kernel client. Unfortunately, this
operation modification is available in Ceph Octopus only, so it is
necessary to ensure that the OSD doing the copy does indeed support this
feature.
Link: https://tracker.ceph.com/issues/37378
Signed-off-by: Luis Henriques <[email protected]>
---
fs/ceph/file.c | 4 +++-
include/linux/ceph/ceph_features.h | 6 ++++-
include/linux/ceph/osd_client.h | 1 +
include/linux/ceph/rados.h | 1 +
net/ceph/osd_client.c | 37 +++++++++++++++++++++++++++++-
5 files changed, 46 insertions(+), 3 deletions(-)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d277f71abe0b..e21a8eaabeb1 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2075,7 +2075,9 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
CEPH_OSD_OP_FLAG_FADVISE_NOCACHE,
&dst_oid, &dst_oloc,
CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
- CEPH_OSD_OP_FLAG_FADVISE_DONTNEED, 0);
+ CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
+ dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
+ CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
if (err) {
dout("ceph_osdc_copy_from returned %d\n", err);
if (!ret)
diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
index 39e6f4c57580..232257f6b60c 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -9,6 +9,7 @@
*/
#define CEPH_FEATURE_INCARNATION_1 (0ull)
#define CEPH_FEATURE_INCARNATION_2 (1ull<<57) // CEPH_FEATURE_SERVER_JEWEL
+#define CEPH_FEATURE_INCARNATION_3 ((1ull<<57)|(1ull<<28)) // SERVER_MIMIC
#define DEFINE_CEPH_FEATURE(bit, incarnation, name) \
static const uint64_t CEPH_FEATURE_##name = (1ULL<<bit); \
@@ -76,6 +77,7 @@ DEFINE_CEPH_FEATURE( 0, 1, UID)
DEFINE_CEPH_FEATURE( 1, 1, NOSRCADDR)
DEFINE_CEPH_FEATURE_RETIRED( 2, 1, MONCLOCKCHECK, JEWEL, LUMINOUS)
+DEFINE_CEPH_FEATURE( 2, 3, SERVER_NAUTILUS)
DEFINE_CEPH_FEATURE( 3, 1, FLOCK)
DEFINE_CEPH_FEATURE( 4, 1, SUBSCRIBE2)
DEFINE_CEPH_FEATURE( 5, 1, MONNAMES)
@@ -92,6 +94,7 @@ DEFINE_CEPH_FEATURE(14, 2, SERVER_KRAKEN)
DEFINE_CEPH_FEATURE(15, 1, MONENC)
DEFINE_CEPH_FEATURE_RETIRED(16, 1, QUERY_T, JEWEL, LUMINOUS)
+DEFINE_CEPH_FEATURE(16, 3, SERVER_OCTOPUS)
DEFINE_CEPH_FEATURE_RETIRED(17, 1, INDEP_PG_MAP, JEWEL, LUMINOUS)
DEFINE_CEPH_FEATURE(18, 1, CRUSH_TUNABLES)
@@ -212,7 +215,8 @@ DEFINE_CEPH_FEATURE_DEPRECATED(63, 1, RESERVED_BROKEN, LUMINOUS) // client-facin
CEPH_FEATURE_CRUSH_TUNABLES5 | \
CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING | \
CEPH_FEATURE_MSG_ADDR2 | \
- CEPH_FEATURE_CEPHX_V2)
+ CEPH_FEATURE_CEPHX_V2 | \
+ CEPH_FEATURE_SERVER_OCTOPUS)
#define CEPH_FEATURES_REQUIRED_DEFAULT 0
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index eaffbdddf89a..5a62dbd3f4c2 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -534,6 +534,7 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
struct ceph_object_id *dst_oid,
struct ceph_object_locator *dst_oloc,
u32 dst_fadvise_flags,
+ u32 truncate_seq, u64 truncate_size,
u8 copy_from_flags);
/* watch/notify */
diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
index 3eb0e55665b4..fc70e68231b3 100644
--- a/include/linux/ceph/rados.h
+++ b/include/linux/ceph/rados.h
@@ -446,6 +446,7 @@ enum {
CEPH_OSD_COPY_FROM_FLAG_MAP_SNAP_CLONE = 8, /* map snap direct to
* cloneid */
CEPH_OSD_COPY_FROM_FLAG_RWORDERED = 16, /* order with write */
+ CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ = 32, /* send truncate_{seq,size} */
};
enum {
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index ba45b074a362..ade27f5fa777 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2272,6 +2272,32 @@ static void maybe_request_map(struct ceph_osd_client *osdc)
ceph_monc_renew_subs(&osdc->client->monc);
}
+/*
+ * This function will check, for each OSD operation in the request, if the
+ * required support features are available in the connection.
+ */
+static bool check_con_features(struct ceph_connection *con,
+ struct ceph_osd_request *req)
+{
+ int i;
+
+ for (i = 0; i < req->r_num_ops; i++) {
+ switch (req->r_ops[i].op) {
+ case CEPH_OSD_OP_COPY_FROM:
+ /*
+ * 'copy-from' implementation had a bug in the OSDs
+ * before Octopus release where file data would get
+ * corructed when truncated
+ */
+ if (!CEPH_HAVE_FEATURE(con->peer_features,
+ SERVER_OCTOPUS))
+ return false;
+ break;
+ }
+ }
+ return true;
+}
+
static void complete_request(struct ceph_osd_request *req, int err);
static void send_map_check(struct ceph_osd_request *req);
@@ -2336,6 +2362,10 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
}
mutex_lock(&osd->lock);
+ if (!check_con_features(&osd->o_con, req)) {
+ err = -EOPNOTSUPP;
+ need_send = false;
+ }
/*
* Assign the tid atomically with send_request() to protect
* multiple writes to the same object from racing with each
@@ -5315,6 +5345,7 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
struct ceph_object_locator *src_oloc,
u32 src_fadvise_flags,
u32 dst_fadvise_flags,
+ u32 truncate_seq, u64 truncate_size,
u8 copy_from_flags)
{
struct ceph_osd_req_op *op;
@@ -5335,6 +5366,8 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
end = p + PAGE_SIZE;
ceph_encode_string(&p, end, src_oid->name, src_oid->name_len);
encode_oloc(&p, end, src_oloc);
+ ceph_encode_32(&p, truncate_seq);
+ ceph_encode_64(&p, truncate_size);
op->indata_len = PAGE_SIZE - (end - p);
ceph_osd_data_pages_init(&op->copy_from.osd_data, pages,
@@ -5350,6 +5383,7 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
struct ceph_object_id *dst_oid,
struct ceph_object_locator *dst_oloc,
u32 dst_fadvise_flags,
+ u32 truncate_seq, u64 truncate_size,
u8 copy_from_flags)
{
struct ceph_osd_request *req;
@@ -5366,7 +5400,8 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
ret = osd_req_op_copy_from_init(req, src_snapid, src_version, src_oid,
src_oloc, src_fadvise_flags,
- dst_fadvise_flags, copy_from_flags);
+ dst_fadvise_flags, truncate_seq,
+ truncate_size, copy_from_flags);
if (ret)
goto out;
On Fri, Nov 8, 2019 at 3:15 PM Luis Henriques <[email protected]> wrote:
>
> Hi!
>
> (Sorry for the long cover letter!)
This is exactly what cover letters are for!
>
> Since the fix for [1] has finally been merged and should be available in
> the next (Octopus) ceph release, I'm trying to clean-up my kernel client
> patch that tries to find out whether or not it's safe to use the
> 'copy-from' RADOS operation for copy_file_range.
>
> So, the fix for [1] was to modify the 'copy-from' operation to allow
> clients to optionally (using the CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ
> flag) send the extra truncate_seq and truncate_size parameters. Since
> only Octopus will have this fix (no backports planned), the client
> simply needs to ensure the OSDs being used have SERVER_OCTOPUS in their
> features.
>
> My initial solution was to add an extra test in __submit_request,
> looping all the request ops and checking if the connection has the
> required features for that operation. Obviously, at the moment only the
> copy-from operation has a restriction but I guess others may be added in
> the future. I believe that doing this at this point (__submit_request)
> allows to cover cases where a cluster is being upgraded to Octopus and
> we have different OSDs running with different feature bits.
>
> Unfortunately, this solution is racy because the connection state
> machine may be changing and the peer_features field isn't yet set. For
> example: if the connection to an OSD is being re-open when we're about
> to check the features, the con->state will be CON_STATE_PREOPEN and the
> con->peer_features will be 0. I tried to find ways to move the feature
> check further down in the stack, but that can't be easily done without
> adding more infrastructure. A solution that came to my mind was to add
> a new con->ops, invoked in the context of ceph_con_workfn, under the
> con->mutex. This callback could then verify the available features,
> aborting the operation if needed.
>
> Note that the race in this patchset doesn't seem to be a huge problem,
> other than occasionally reverting to a VFS generic copy_file_range, as
> -EOPNOTSUPP will be returned here. But it's still a race, and there are
> probably other cases that I'm missing.
>
> Anyway, maybe I'm missing an obvious solution for checking these OSD
> features, but I'm open to any suggestions on other options (or some
> feedback on the new callback in ceph_connection_operations option).
>
> [1] https://tracker.ceph.com/issues/37378
If the OSD checked for unknown flags, like newer syscalls do, it would
be super easy, but it looks like it doesn't.
An obvious solution is to look at require_osd_release in osdmap, but we
don't decode that in the kernel because it lives the OSD portion of the
osdmap. We could add that and consider the fact that the client now
needs to decode more than just the client portion a design mistake.
I'm not sure what can of worms does that open and if copy-from alone is
worth it though. Perhaps that field could be moved to (or a copy of it
be replicated in) the client portion of the osdmap starting with
octopus? We seem to be running into it on the client side more and
more...
Given the track record of this feature (the fix for the most recently
discovered data corrupting bug hasn't even merged yet), I would be very
hesitant to turn it back on by default even if we figure out a good
solution for the feature check. In my opinion, it should stay opt-in.
Thanks,
Ilya
On Fri, Nov 08, 2019 at 04:15:35PM +0100, Ilya Dryomov wrote:
> On Fri, Nov 8, 2019 at 3:15 PM Luis Henriques <[email protected]> wrote:
> >
> > Hi!
> >
> > (Sorry for the long cover letter!)
>
> This is exactly what cover letters are for!
>
> >
> > Since the fix for [1] has finally been merged and should be available in
> > the next (Octopus) ceph release, I'm trying to clean-up my kernel client
> > patch that tries to find out whether or not it's safe to use the
> > 'copy-from' RADOS operation for copy_file_range.
> >
> > So, the fix for [1] was to modify the 'copy-from' operation to allow
> > clients to optionally (using the CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ
> > flag) send the extra truncate_seq and truncate_size parameters. Since
> > only Octopus will have this fix (no backports planned), the client
> > simply needs to ensure the OSDs being used have SERVER_OCTOPUS in their
> > features.
> >
> > My initial solution was to add an extra test in __submit_request,
> > looping all the request ops and checking if the connection has the
> > required features for that operation. Obviously, at the moment only the
> > copy-from operation has a restriction but I guess others may be added in
> > the future. I believe that doing this at this point (__submit_request)
> > allows to cover cases where a cluster is being upgraded to Octopus and
> > we have different OSDs running with different feature bits.
> >
> > Unfortunately, this solution is racy because the connection state
> > machine may be changing and the peer_features field isn't yet set. For
> > example: if the connection to an OSD is being re-open when we're about
> > to check the features, the con->state will be CON_STATE_PREOPEN and the
> > con->peer_features will be 0. I tried to find ways to move the feature
> > check further down in the stack, but that can't be easily done without
> > adding more infrastructure. A solution that came to my mind was to add
> > a new con->ops, invoked in the context of ceph_con_workfn, under the
> > con->mutex. This callback could then verify the available features,
> > aborting the operation if needed.
> >
> > Note that the race in this patchset doesn't seem to be a huge problem,
> > other than occasionally reverting to a VFS generic copy_file_range, as
> > -EOPNOTSUPP will be returned here. But it's still a race, and there are
> > probably other cases that I'm missing.
> >
> > Anyway, maybe I'm missing an obvious solution for checking these OSD
> > features, but I'm open to any suggestions on other options (or some
> > feedback on the new callback in ceph_connection_operations option).
> >
> > [1] https://tracker.ceph.com/issues/37378
>
> If the OSD checked for unknown flags, like newer syscalls do, it would
> be super easy, but it looks like it doesn't.
>
> An obvious solution is to look at require_osd_release in osdmap, but we
> don't decode that in the kernel because it lives the OSD portion of the
> osdmap. We could add that and consider the fact that the client now
> needs to decode more than just the client portion a design mistake.
> I'm not sure what can of worms does that open and if copy-from alone is
> worth it though. Perhaps that field could be moved to (or a copy of it
> be replicated in) the client portion of the osdmap starting with
> octopus? We seem to be running into it on the client side more and
> more...
I can't say I'm thrilled with the idea of going back to hack into the
OSDs code again, I was hoping to be able to handle this with the
information we already have on the connection peer_features field. It
took me *months* to have the OSD fix merged in so I'm not really
convinced a change to the osdmap would make it into Octopus :-)
(But I'll have a look at this and see if I can understand what moving or
replicating the field in the osdmap would really entail.)
> Given the track record of this feature (the fix for the most recently
> discovered data corrupting bug hasn't even merged yet), I would be very
> hesitant to turn it back on by default even if we figure out a good
> solution for the feature check. In my opinion, it should stay opt-in.
Ok, makes sense.
And thanks a lot for your feedback, Ilya.
Cheers,
--
Lu?s
On Fri, Nov 8, 2019 at 5:48 PM Luis Henriques <[email protected]> wrote:
>
> On Fri, Nov 08, 2019 at 04:15:35PM +0100, Ilya Dryomov wrote:
> > On Fri, Nov 8, 2019 at 3:15 PM Luis Henriques <[email protected]> wrote:
> > >
> > > Hi!
> > >
> > > (Sorry for the long cover letter!)
> >
> > This is exactly what cover letters are for!
> >
> > >
> > > Since the fix for [1] has finally been merged and should be available in
> > > the next (Octopus) ceph release, I'm trying to clean-up my kernel client
> > > patch that tries to find out whether or not it's safe to use the
> > > 'copy-from' RADOS operation for copy_file_range.
> > >
> > > So, the fix for [1] was to modify the 'copy-from' operation to allow
> > > clients to optionally (using the CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ
> > > flag) send the extra truncate_seq and truncate_size parameters. Since
> > > only Octopus will have this fix (no backports planned), the client
> > > simply needs to ensure the OSDs being used have SERVER_OCTOPUS in their
> > > features.
> > >
> > > My initial solution was to add an extra test in __submit_request,
> > > looping all the request ops and checking if the connection has the
> > > required features for that operation. Obviously, at the moment only the
> > > copy-from operation has a restriction but I guess others may be added in
> > > the future. I believe that doing this at this point (__submit_request)
> > > allows to cover cases where a cluster is being upgraded to Octopus and
> > > we have different OSDs running with different feature bits.
> > >
> > > Unfortunately, this solution is racy because the connection state
> > > machine may be changing and the peer_features field isn't yet set. For
> > > example: if the connection to an OSD is being re-open when we're about
> > > to check the features, the con->state will be CON_STATE_PREOPEN and the
> > > con->peer_features will be 0. I tried to find ways to move the feature
> > > check further down in the stack, but that can't be easily done without
> > > adding more infrastructure. A solution that came to my mind was to add
> > > a new con->ops, invoked in the context of ceph_con_workfn, under the
> > > con->mutex. This callback could then verify the available features,
> > > aborting the operation if needed.
> > >
> > > Note that the race in this patchset doesn't seem to be a huge problem,
> > > other than occasionally reverting to a VFS generic copy_file_range, as
> > > -EOPNOTSUPP will be returned here. But it's still a race, and there are
> > > probably other cases that I'm missing.
> > >
> > > Anyway, maybe I'm missing an obvious solution for checking these OSD
> > > features, but I'm open to any suggestions on other options (or some
> > > feedback on the new callback in ceph_connection_operations option).
> > >
> > > [1] https://tracker.ceph.com/issues/37378
> >
> > If the OSD checked for unknown flags, like newer syscalls do, it would
> > be super easy, but it looks like it doesn't.
> >
> > An obvious solution is to look at require_osd_release in osdmap, but we
> > don't decode that in the kernel because it lives the OSD portion of the
> > osdmap. We could add that and consider the fact that the client now
> > needs to decode more than just the client portion a design mistake.
> > I'm not sure what can of worms does that open and if copy-from alone is
> > worth it though. Perhaps that field could be moved to (or a copy of it
> > be replicated in) the client portion of the osdmap starting with
> > octopus? We seem to be running into it on the client side more and
> > more...
>
> I can't say I'm thrilled with the idea of going back to hack into the
> OSDs code again, I was hoping to be able to handle this with the
> information we already have on the connection peer_features field. It
> took me *months* to have the OSD fix merged in so I'm not really
> convinced a change to the osdmap would make it into Octopus :-)
>
> (But I'll have a look at this and see if I can understand what moving or
> replicating the field in the osdmap would really entail.)
Just to be clear: I'm not suggesting that you do it ;) More of an
observation that something that is buried deep in the OSD portion of
the osdmap is being needed increasingly by the clients.
Thanks,
Ilya
On Fri, 8 Nov 2019, Luis Henriques wrote:
> On Fri, Nov 08, 2019 at 04:15:35PM +0100, Ilya Dryomov wrote:
> > If the OSD checked for unknown flags, like newer syscalls do, it would
> > be super easy, but it looks like it doesn't.
> >
> > An obvious solution is to look at require_osd_release in osdmap, but we
> > don't decode that in the kernel because it lives the OSD portion of the
> > osdmap. We could add that and consider the fact that the client now
> > needs to decode more than just the client portion a design mistake.
> > I'm not sure what can of worms does that open and if copy-from alone is
> > worth it though. Perhaps that field could be moved to (or a copy of it
> > be replicated in) the client portion of the osdmap starting with
> > octopus? We seem to be running into it on the client side more and
> > more...
>
> I can't say I'm thrilled with the idea of going back to hack into the
> OSDs code again, I was hoping to be able to handle this with the
> information we already have on the connection peer_features field. It
> took me *months* to have the OSD fix merged in so I'm not really
> convinced a change to the osdmap would make it into Octopus :-)
>
> (But I'll have a look at this and see if I can understand what moving or
> replicating the field in the osdmap would really entail.)
Adding a copy of require_osd_release in the client portion of the map is
an easy thing to do (and probably where it should have gone in the first
place!). Let's do that!
sage
On Fri, Nov 08, 2019 at 04:57:27PM +0000, Sage Weil wrote:
> On Fri, 8 Nov 2019, Luis Henriques wrote:
> > On Fri, Nov 08, 2019 at 04:15:35PM +0100, Ilya Dryomov wrote:
> > > If the OSD checked for unknown flags, like newer syscalls do, it would
> > > be super easy, but it looks like it doesn't.
> > >
> > > An obvious solution is to look at require_osd_release in osdmap, but we
> > > don't decode that in the kernel because it lives the OSD portion of the
> > > osdmap. We could add that and consider the fact that the client now
> > > needs to decode more than just the client portion a design mistake.
> > > I'm not sure what can of worms does that open and if copy-from alone is
> > > worth it though. Perhaps that field could be moved to (or a copy of it
> > > be replicated in) the client portion of the osdmap starting with
> > > octopus? We seem to be running into it on the client side more and
> > > more...
> >
> > I can't say I'm thrilled with the idea of going back to hack into the
> > OSDs code again, I was hoping to be able to handle this with the
> > information we already have on the connection peer_features field. It
> > took me *months* to have the OSD fix merged in so I'm not really
> > convinced a change to the osdmap would make it into Octopus :-)
> >
> > (But I'll have a look at this and see if I can understand what moving or
> > replicating the field in the osdmap would really entail.)
>
> Adding a copy of require_osd_release in the client portion of the map is
> an easy thing to do (and probably where it should have gone in the first
> place!). Let's do that!
Yeah, after sending my reply to Ilya I took a quick look and it _seems_
to be as easy as adding a new encode(require_osd_release...) in the
OSDMap. And handle the versions, of course. Let me spend some time
playing with that and I'll try to come up with something during the next
few days.
Thanks for your feedback.
Cheers,
--
Lu?s
On Fri, 8 Nov 2019, Luis Henriques wrote:
> On Fri, Nov 08, 2019 at 04:57:27PM +0000, Sage Weil wrote:
> > On Fri, 8 Nov 2019, Luis Henriques wrote:
> > > On Fri, Nov 08, 2019 at 04:15:35PM +0100, Ilya Dryomov wrote:
> > > > If the OSD checked for unknown flags, like newer syscalls do, it would
> > > > be super easy, but it looks like it doesn't.
> > > >
> > > > An obvious solution is to look at require_osd_release in osdmap, but we
> > > > don't decode that in the kernel because it lives the OSD portion of the
> > > > osdmap. We could add that and consider the fact that the client now
> > > > needs to decode more than just the client portion a design mistake.
> > > > I'm not sure what can of worms does that open and if copy-from alone is
> > > > worth it though. Perhaps that field could be moved to (or a copy of it
> > > > be replicated in) the client portion of the osdmap starting with
> > > > octopus? We seem to be running into it on the client side more and
> > > > more...
> > >
> > > I can't say I'm thrilled with the idea of going back to hack into the
> > > OSDs code again, I was hoping to be able to handle this with the
> > > information we already have on the connection peer_features field. It
> > > took me *months* to have the OSD fix merged in so I'm not really
> > > convinced a change to the osdmap would make it into Octopus :-)
> > >
> > > (But I'll have a look at this and see if I can understand what moving or
> > > replicating the field in the osdmap would really entail.)
> >
> > Adding a copy of require_osd_release in the client portion of the map is
> > an easy thing to do (and probably where it should have gone in the first
> > place!). Let's do that!
>
> Yeah, after sending my reply to Ilya I took a quick look and it _seems_
> to be as easy as adding a new encode(require_osd_release...) in the
> OSDMap. And handle the versions, of course. Let me spend some time
> playing with that and I'll try to come up with something during the next
> few days.
- You'll need to add it for both OSDMap::Incremental and OSDMap
- You'll need to make the encoding condition by updating the block like
the one below from OSDMap::encode()
uint8_t v = 9;
if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) {
v = 3;
} else if (!HAVE_FEATURE(features, SERVER_MIMIC)) {
v = 6;
} else if (!HAVE_FEATURE(features, SERVER_NAUTILUS)) {
v = 7;
}
to include a SERVER_OCTOPUS case too. Same goes for Incremental::encode()
sage
On Fri, Nov 08, 2019 at 05:22:28PM +0000, Sage Weil wrote:
> On Fri, 8 Nov 2019, Luis Henriques wrote:
> > On Fri, Nov 08, 2019 at 04:57:27PM +0000, Sage Weil wrote:
> > > On Fri, 8 Nov 2019, Luis Henriques wrote:
> > > > On Fri, Nov 08, 2019 at 04:15:35PM +0100, Ilya Dryomov wrote:
> > > > > If the OSD checked for unknown flags, like newer syscalls do, it would
> > > > > be super easy, but it looks like it doesn't.
> > > > >
> > > > > An obvious solution is to look at require_osd_release in osdmap, but we
> > > > > don't decode that in the kernel because it lives the OSD portion of the
> > > > > osdmap. We could add that and consider the fact that the client now
> > > > > needs to decode more than just the client portion a design mistake.
> > > > > I'm not sure what can of worms does that open and if copy-from alone is
> > > > > worth it though. Perhaps that field could be moved to (or a copy of it
> > > > > be replicated in) the client portion of the osdmap starting with
> > > > > octopus? We seem to be running into it on the client side more and
> > > > > more...
> > > >
> > > > I can't say I'm thrilled with the idea of going back to hack into the
> > > > OSDs code again, I was hoping to be able to handle this with the
> > > > information we already have on the connection peer_features field. It
> > > > took me *months* to have the OSD fix merged in so I'm not really
> > > > convinced a change to the osdmap would make it into Octopus :-)
> > > >
> > > > (But I'll have a look at this and see if I can understand what moving or
> > > > replicating the field in the osdmap would really entail.)
> > >
> > > Adding a copy of require_osd_release in the client portion of the map is
> > > an easy thing to do (and probably where it should have gone in the first
> > > place!). Let's do that!
> >
> > Yeah, after sending my reply to Ilya I took a quick look and it _seems_
> > to be as easy as adding a new encode(require_osd_release...) in the
> > OSDMap. And handle the versions, of course. Let me spend some time
> > playing with that and I'll try to come up with something during the next
> > few days.
>
> - You'll need to add it for both OSDMap::Incremental and OSDMap
> - You'll need to make the encoding condition by updating the block like
> the one below from OSDMap::encode()
>
> uint8_t v = 9;
> if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) {
> v = 3;
> } else if (!HAVE_FEATURE(features, SERVER_MIMIC)) {
> v = 6;
> } else if (!HAVE_FEATURE(features, SERVER_NAUTILUS)) {
> v = 7;
> }
>
> to include a SERVER_OCTOPUS case too. Same goes for Incremental::encode()
Awesome, thanks! I'll give it a try, and test it with the appropriate
kernel client side changes to use this.
Cheers,
--
Lu?s
On Fri, Nov 08, 2019 at 05:31:01PM +0000, Luis Henriques wrote:
<snip>
> > - You'll need to add it for both OSDMap::Incremental and OSDMap
> > - You'll need to make the encoding condition by updating the block like
> > the one below from OSDMap::encode()
> >
> > uint8_t v = 9;
> > if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) {
> > v = 3;
> > } else if (!HAVE_FEATURE(features, SERVER_MIMIC)) {
> > v = 6;
> > } else if (!HAVE_FEATURE(features, SERVER_NAUTILUS)) {
> > v = 7;
> > }
> >
> > to include a SERVER_OCTOPUS case too. Same goes for Incremental::encode()
>
> Awesome, thanks! I'll give it a try, and test it with the appropriate
> kernel client side changes to use this.
Ok, I've got the patch bellow for the OSD code, which IIRC should do
exactly what we want: duplicate the require_osd_release in the client
side.
Now, in order to quickly test this I've started adding flags to the
CEPH_FEATURES_SUPPORTED_DEFAULT definition. SERVER_MIMIC *seemed* to be
Ok, but once I've added SERVER_NAUTILUS I've realized that we'll need to
handle TYPE_MSGR2 address. Which is a _big_ thing. Is anyone already
looking into adding support for msgr v2 to the kernel client?
Cheers,
--
Lu?s
diff --git a/src/osd/OSDMap.cc b/src/osd/OSDMap.cc
index 6b5930743a33..b38d91d98fcf 100644
--- a/src/osd/OSDMap.cc
+++ b/src/osd/OSDMap.cc
@@ -555,13 +555,15 @@ void OSDMap::Incremental::encode(ceph::buffer::list& bl, uint64_t features) cons
ENCODE_START(8, 7, bl);
{
- uint8_t v = 8;
+ uint8_t v = 9;
if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) {
v = 3;
} else if (!HAVE_FEATURE(features, SERVER_MIMIC)) {
v = 5;
} else if (!HAVE_FEATURE(features, SERVER_NAUTILUS)) {
v = 6;
+ } else if (!HAVE_FEATURE(features, SERVER_OCTOPUS)) {
+ v = 8;
}
ENCODE_START(v, 1, bl); // client-usable data
encode(fsid, bl);
@@ -611,6 +613,9 @@ void OSDMap::Incremental::encode(ceph::buffer::list& bl, uint64_t features) cons
encode(new_last_up_change, bl);
encode(new_last_in_change, bl);
}
+ if (v >= 9) {
+ encode(new_require_osd_release, bl);
+ }
ENCODE_FINISH(bl); // client-usable data
}
@@ -816,7 +821,7 @@ void OSDMap::Incremental::decode(ceph::buffer::list::const_iterator& bl)
return;
}
{
- DECODE_START(8, bl); // client-usable data
+ DECODE_START(9, bl); // client-usable data
decode(fsid, bl);
decode(epoch, bl);
decode(modified, bl);
@@ -2847,13 +2852,15 @@ void OSDMap::encode(ceph::buffer::list& bl, uint64_t features) const
{
// NOTE: any new encoding dependencies must be reflected by
// SIGNIFICANT_FEATURES
- uint8_t v = 9;
+ uint8_t v = 10;
if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) {
v = 3;
} else if (!HAVE_FEATURE(features, SERVER_MIMIC)) {
v = 6;
} else if (!HAVE_FEATURE(features, SERVER_NAUTILUS)) {
v = 7;
+ } else if (!HAVE_FEATURE(features, SERVER_OCTOPUS)) {
+ v = 9;
}
ENCODE_START(v, 1, bl); // client-usable data
// base
@@ -2929,6 +2936,9 @@ void OSDMap::encode(ceph::buffer::list& bl, uint64_t features) const
encode(last_up_change, bl);
encode(last_in_change, bl);
}
+ if (v >= 10) {
+ encode(require_osd_release, bl);
+ }
ENCODE_FINISH(bl); // client-usable data
}
@@ -3170,7 +3180,7 @@ void OSDMap::decode(ceph::buffer::list::const_iterator& bl)
* Since we made it past that hurdle, we can use our normal paths.
*/
{
- DECODE_START(9, bl); // client-usable data
+ DECODE_START(10, bl); // client-usable data
// base
decode(fsid, bl);
decode(epoch, bl);
On Mon, Nov 11, 2019 at 5:30 PM Luis Henriques <[email protected]> wrote:
>
> On Fri, Nov 08, 2019 at 05:31:01PM +0000, Luis Henriques wrote:
> <snip>
> > > - You'll need to add it for both OSDMap::Incremental and OSDMap
> > > - You'll need to make the encoding condition by updating the block like
> > > the one below from OSDMap::encode()
> > >
> > > uint8_t v = 9;
> > > if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) {
> > > v = 3;
> > > } else if (!HAVE_FEATURE(features, SERVER_MIMIC)) {
> > > v = 6;
> > > } else if (!HAVE_FEATURE(features, SERVER_NAUTILUS)) {
> > > v = 7;
> > > }
> > >
> > > to include a SERVER_OCTOPUS case too. Same goes for Incremental::encode()
> >
> > Awesome, thanks! I'll give it a try, and test it with the appropriate
> > kernel client side changes to use this.
>
> Ok, I've got the patch bellow for the OSD code, which IIRC should do
> exactly what we want: duplicate the require_osd_release in the client
> side.
>
> Now, in order to quickly test this I've started adding flags to the
> CEPH_FEATURES_SUPPORTED_DEFAULT definition. SERVER_MIMIC *seemed* to be
> Ok, but once I've added SERVER_NAUTILUS I've realized that we'll need to
> handle TYPE_MSGR2 address. Which is a _big_ thing. Is anyone already
> looking into adding support for msgr v2 to the kernel client?
It should be easy enough to hack around it for testing purposes.
I made some initial steps and hope to be able to dedicate the 5.6 cycle
to it.
Thanks,
Ilya
On Mon, Nov 11, 2019 at 09:51:47PM +0100, Ilya Dryomov wrote:
> On Mon, Nov 11, 2019 at 5:30 PM Luis Henriques <[email protected]> wrote:
> >
> > On Fri, Nov 08, 2019 at 05:31:01PM +0000, Luis Henriques wrote:
> > <snip>
> > > > - You'll need to add it for both OSDMap::Incremental and OSDMap
> > > > - You'll need to make the encoding condition by updating the block like
> > > > the one below from OSDMap::encode()
> > > >
> > > > uint8_t v = 9;
> > > > if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) {
> > > > v = 3;
> > > > } else if (!HAVE_FEATURE(features, SERVER_MIMIC)) {
> > > > v = 6;
> > > > } else if (!HAVE_FEATURE(features, SERVER_NAUTILUS)) {
> > > > v = 7;
> > > > }
> > > >
> > > > to include a SERVER_OCTOPUS case too. Same goes for Incremental::encode()
> > >
> > > Awesome, thanks! I'll give it a try, and test it with the appropriate
> > > kernel client side changes to use this.
> >
> > Ok, I've got the patch bellow for the OSD code, which IIRC should do
> > exactly what we want: duplicate the require_osd_release in the client
> > side.
> >
> > Now, in order to quickly test this I've started adding flags to the
> > CEPH_FEATURES_SUPPORTED_DEFAULT definition. SERVER_MIMIC *seemed* to be
> > Ok, but once I've added SERVER_NAUTILUS I've realized that we'll need to
> > handle TYPE_MSGR2 address. Which is a _big_ thing. Is anyone already
> > looking into adding support for msgr v2 to the kernel client?
>
> It should be easy enough to hack around it for testing purposes.
>
> I made some initial steps and hope to be able to dedicate the 5.6 cycle
> to it.
Yeah, I'll give that a try; adding support for that new address type
shouldn't be a big deal. I was just wondering if that wasn't already
being handling by any new msgrv2 code under development. Thanks, Ilya.
Cheers,
--
Lu?s