Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp2640359ybx; Fri, 8 Nov 2019 07:16:36 -0800 (PST) X-Google-Smtp-Source: APXvYqzyJnWmC3hIXq6onCHJ0k80YviPJnKebPsyGbDe9Wnb6o38A/YMM1GU5O49fGOCl6TmUMYB X-Received: by 2002:a50:ab10:: with SMTP id s16mr10673786edc.118.1573226196708; Fri, 08 Nov 2019 07:16:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1573226196; cv=none; d=google.com; s=arc-20160816; b=uPPUsCEtix8hGQxpdWkiZrbD39cm4QQ7cFvzgLe4Skdt4qsjkOH6Qo9cwaE9c9WXwX rWoLYIcWGvducJ03Ca8/4AmQ+Qd0cNRtoLmJdBIlIOg76umoWVMQod8J/TWu7V1Jxc1f Qu8KA4+Kqs17D1K2unaYGrKxDkBzYmMqdZif5LPXUsrYfoyaOxgcKX6SFJhOr8qtOCNS RYB52kdsLt9jWGss8BqJ9Z2dKByKiF+YvD1C8QJG3Ox6F1WO4GTRMF8rR7QrRbRVRy3l WJ75TyWSOIUHqH0tsA4owMVopL7YG4r9lmC4RLmQ1D1vfeBYF5LjwYNZEuZEe/B6G/Kt P0RQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=nOrCgpvdg/KvlO/h5Wd7G5wPvTtcd3OWqo9gJ0rgrlo=; b=m1sXpodntKdHF+QZQ3SLbtNtSIjcFcNG8AAmzSGVcUraZHGpIYdssH/DJR4gq4tTc9 xiptlw/uFo3MeSRTAMZeHEtmUVtv4AxOhygbvA7xHf67nYsO4lHesUdsb3Az+e1BA01c ZET1nlmjODpMhNswr5CWWmwuwC8PRrFID29HKZP2ZWia+dS5+fuyUtmrQL3gQUII4DEH 960B3DpxjsojRkAGAt5RVMQskJnjFAeRgFdc9gbeRq+SE/KYEx3Xxy+eAeEbkFDYLvhi De4wk8dCTTYNF04SeHYLwpNknbCvGjN3bVn67hgwhKfWRej1ZUkdGDVja0iY2Xe0F65C Dg7g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=dWD4v22O; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o19si366282edq.238.2019.11.08.07.16.13; Fri, 08 Nov 2019 07:16:36 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=dWD4v22O; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726743AbfKHPPP (ORCPT + 99 others); Fri, 8 Nov 2019 10:15:15 -0500 Received: from mail-io1-f68.google.com ([209.85.166.68]:42007 "EHLO mail-io1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726005AbfKHPPP (ORCPT ); Fri, 8 Nov 2019 10:15:15 -0500 Received: by mail-io1-f68.google.com with SMTP id g15so6719211iob.9; Fri, 08 Nov 2019 07:15:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=nOrCgpvdg/KvlO/h5Wd7G5wPvTtcd3OWqo9gJ0rgrlo=; b=dWD4v22OkhbbvMSwYUZ95Ok4srdhWnRsopBViIewC4JmRXRvOygT1/st1DPNL/km0a 4Qh7mw7c1/p9vTWr9caM7ju13HrbpcIoVQkwMyAfL/4369W722zY8lGpL1B2QXTT+3LT sX2CkMi9IH0pqzWqbL0+WiNyPQ761G4VtIBuoUH2dTEuNW4wLXkFIgT2/V133bdOW/GG 7KjOO3LrGgjISU35uVmmUYQnZbns+V0hH02lTWY2b8mGGB6CppsZYGHc9T3cfiK/22nM jVEit7LL33l8rThG0som+WHYLoAUvrS43aRSEjI+ZC4iSeSYdldAiswtdzA/I19jFv0s 6lwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=nOrCgpvdg/KvlO/h5Wd7G5wPvTtcd3OWqo9gJ0rgrlo=; b=axB++Qh90eLd+x98FKnTvDaA8RZvZuGPIzyeXBcjEq7YNlOaAq9xThOSMY1M3MfJaY HdWmHVuCxsg8hg6rAxRCmxQu0ET1tNVV8jmK9wVWAbBA5jUmWhvRErVKSxS8F34tP8Hj vtqdHV4f6iF4N6NCIcMUPK07NHCJfQ7WPskzFefM2AYvl/jOw+InsuTNMVoX5DWVCE69 1+iWjpVDgRfEkSgbES8iWuiyHrrqECCrk9QkorUHt87YobpNwLIZfIygJNgLbcBp2+Y2 5Z29kLjbSiBKZfqKj8vW2RXu7WsWSJix0CmJ2q9udgFYeIkWujNPLl+9abMfcX8gjOHR wWlQ== X-Gm-Message-State: APjAAAWHJZVQ86cEgh4TLexiXY++AHoGlU2Siz3qzRyrfElFouyp3cSi DRXqHFNU482D38XM5B4bVHx9Ce2ATu0T4i/nRj4= X-Received: by 2002:a5d:94d8:: with SMTP id y24mr11174566ior.131.1573226112847; Fri, 08 Nov 2019 07:15:12 -0800 (PST) MIME-Version: 1.0 References: <20191108141555.31176-1-lhenriques@suse.com> In-Reply-To: <20191108141555.31176-1-lhenriques@suse.com> From: Ilya Dryomov Date: Fri, 8 Nov 2019 16:15:35 +0100 Message-ID: Subject: Re: [RFC PATCH 0/2] ceph: safely use 'copy-from' Op on Octopus OSDs To: Luis Henriques Cc: Jeff Layton , Sage Weil , "Yan, Zheng" , Ceph Development , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 8, 2019 at 3:15 PM Luis Henriques wrote: > > Hi! > > (Sorry for the long cover letter!) This is exactly what cover letters are for! > > Since the fix for [1] has finally been merged and should be available in > the next (Octopus) ceph release, I'm trying to clean-up my kernel client > patch that tries to find out whether or not it's safe to use the > 'copy-from' RADOS operation for copy_file_range. > > So, the fix for [1] was to modify the 'copy-from' operation to allow > clients to optionally (using the CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ > flag) send the extra truncate_seq and truncate_size parameters. Since > only Octopus will have this fix (no backports planned), the client > simply needs to ensure the OSDs being used have SERVER_OCTOPUS in their > features. > > My initial solution was to add an extra test in __submit_request, > looping all the request ops and checking if the connection has the > required features for that operation. Obviously, at the moment only the > copy-from operation has a restriction but I guess others may be added in > the future. I believe that doing this at this point (__submit_request) > allows to cover cases where a cluster is being upgraded to Octopus and > we have different OSDs running with different feature bits. > > Unfortunately, this solution is racy because the connection state > machine may be changing and the peer_features field isn't yet set. For > example: if the connection to an OSD is being re-open when we're about > to check the features, the con->state will be CON_STATE_PREOPEN and the > con->peer_features will be 0. I tried to find ways to move the feature > check further down in the stack, but that can't be easily done without > adding more infrastructure. A solution that came to my mind was to add > a new con->ops, invoked in the context of ceph_con_workfn, under the > con->mutex. This callback could then verify the available features, > aborting the operation if needed. > > Note that the race in this patchset doesn't seem to be a huge problem, > other than occasionally reverting to a VFS generic copy_file_range, as > -EOPNOTSUPP will be returned here. But it's still a race, and there are > probably other cases that I'm missing. > > Anyway, maybe I'm missing an obvious solution for checking these OSD > features, but I'm open to any suggestions on other options (or some > feedback on the new callback in ceph_connection_operations option). > > [1] https://tracker.ceph.com/issues/37378 If the OSD checked for unknown flags, like newer syscalls do, it would be super easy, but it looks like it doesn't. An obvious solution is to look at require_osd_release in osdmap, but we don't decode that in the kernel because it lives the OSD portion of the osdmap. We could add that and consider the fact that the client now needs to decode more than just the client portion a design mistake. I'm not sure what can of worms does that open and if copy-from alone is worth it though. Perhaps that field could be moved to (or a copy of it be replicated in) the client portion of the osdmap starting with octopus? We seem to be running into it on the client side more and more... Given the track record of this feature (the fix for the most recently discovered data corrupting bug hasn't even merged yet), I would be very hesitant to turn it back on by default even if we figure out a good solution for the feature check. In my opinion, it should stay opt-in. Thanks, Ilya