Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp10741277rwr; Fri, 12 May 2023 12:15:03 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5XP9OpVBcUIDjzYQmfRndfaHiu348AJmfsK8FQRF1WM0/frYYMqKxVg8lxZ0Je/CULv+u6 X-Received: by 2002:a05:6a00:cc8:b0:644:d220:64ac with SMTP id b8-20020a056a000cc800b00644d22064acmr30944695pfv.2.1683918903561; Fri, 12 May 2023 12:15:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683918903; cv=none; d=google.com; s=arc-20160816; b=jSeUNhlTxXSfdDbpnvlmTBprZJhwnsfUIEXWXhJOSlEqGVzjRZFPwitZD49B/+uoba u0VHQSl50e54gz4r/meVJNshgbuYV87Lh6xbZduyegeydyRiqMrRPAXlhtgdGiO/p2VI 8fKWgS6aSbh6rxaQ5uvbEeYKPzMJBT5vOfG+iWoBOolGFojvmqy9vDHax4r9Brya0+T5 ++Mg6ILGJ8T6Pl9flC/ZDtZRS52Bh9I/Odu0bH7MKV/d5937PtVwjubQvD6OyzpCTb7L 2rYWJIr8DE/hBm/46ONr0qNX/ig+TfANHnLqbBS+CTo417rKJ4BWg55l4WQSdI2p585Z mpyA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:dkim-signature:date; bh=T7EDRGNNQxDhy7JG1X1Rb19Q2/wH1f0oPatD+5ixTQ0=; b=LFny6nhoinLLvRl/FsbfX9sw6VSyAhasSC+5dcFf6IRF7fSbwheFRbMfLDWE+O8K1O NqjXzpRxXWCFxWcNOV5UpgNQgXZM6I8csnWkQGFM5pZ0VktngSGnTyOnFOFe5fc/sHks dLEm13baUNxKCzqFaJOZP/UGnatbraQ+F8QtOtZt8JIiVi5gGg3WIzemt4MBE3jzMTQQ QV1H/XMjSuAVapgCiXqFmCgjYU0s+QvkTQGu3d/y8LS7uo97ujuLVDh+1+2+WA4gWF83 1B1gh4uHWIP0w6YOYscewChvQxxHsxb9JZk6GLKwFfNJuuB4LA1bpmvzOcTYJGrtNIwj 58Aw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=xCCG65Bw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x6-20020aa79a46000000b006478fe28452si8528211pfj.27.2023.05.12.12.14.47; Fri, 12 May 2023 12:15:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=xCCG65Bw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238654AbjELSga (ORCPT + 99 others); Fri, 12 May 2023 14:36:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52006 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238397AbjELSg0 (ORCPT ); Fri, 12 May 2023 14:36:26 -0400 Received: from out-58.mta0.migadu.com (out-58.mta0.migadu.com [91.218.175.58]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2DF5F30C4 for ; Fri, 12 May 2023 11:36:20 -0700 (PDT) Date: Fri, 12 May 2023 14:36:13 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1683916578; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=T7EDRGNNQxDhy7JG1X1Rb19Q2/wH1f0oPatD+5ixTQ0=; b=xCCG65Bw0cexF9Nd3/QXcOc4QPm5PIm/tJf9tNZJ/su0MlbxUQbD+p6omf5kA3Co7G8nL/ iKkpCUP16KIF+s/7KGW5iG29X/AMwQXWGMz1fiwMlg/iC9CpcDYGMn9MxqtwgqNuGFB1Fk YJyZ5UQbufDhClfNgTlxNnft73FPFzM= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: Eric Biggers Cc: Lorenzo Stoakes , Christoph Hellwig , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-bcachefs@vger.kernel.org, Kent Overstreet , Andrew Morton , Uladzislau Rezki , linux-mm@kvack.org Subject: Re: [PATCH 07/32] mm: Bring back vmalloc_exec Message-ID: References: <20230509165657.1735798-1-kent.overstreet@linux.dev> <20230509165657.1735798-8-kent.overstreet@linux.dev> <20230510064849.GC1851@quark.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230510064849.GC1851@quark.localdomain> X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 09, 2023 at 11:48:49PM -0700, Eric Biggers wrote: > What seems to be missing is any explanation for what we're actually getting from > this extremely unusual solution that cannot be gained any other way. What is > unique about bcachefs that it really needs something like this? Ok, as promised: Background: all metadata in bcachefs is a structured as key/value pairs, and there's a common key format for all keys. struct bkey { /* 3 byte header */ u8 u64s; /* size of k/v in u64s */ u8 format; /* packed/unpacked, needs_whiteout */ u8 type; /* value type */ u8 pad; /* * Order of fields below is for little endian, they're in * reverse order on big endian (and byte swabbed as necessary * when reading foreign endian metadata) * * Since field order matches byte order, the key can be treated * as one large multi word integer for doing comparisons: */ u96 version; /* nonces, send/recv support */ u32 size; /* size of extent keys */ /* Below are the field used for ordering/comparison: */ u32 snapshot; u64 offset; u64 inode; /* Value is stored inline with key */ struct bch_val v; }; sizeof(struct bkey) == 40. An extent value that has one pointer and no checksum is 8 bytes, with one pointer and one 32 bit checksum 16 bytes, for 56 bytes total (key included). But for a given btree node, most of the key fields will typically be redundandant. An extents leaf node might have extents for all one inode number or a small range of inode numbers, snapshots may or may not be in use, etc. - clearly some compression is desirable here. The key observation is that key compression is possible if we have a compression function that preserves order, and an order-preserving compression function is possible if it's allowed to fail. That means we do comparisons on packed keys, which lets us skip _most_ unpack operations, for btree node resorts and for lookups within a node. Packing works by defining a format with an offset and a bit width for each field, so e.g. if all keys in a btree node have the same inode number the packed format can specify that inode number and then a field width of 0 bits for the inode field. Continuing the arithmetic from before, a packed extent key will typically be only 8 or 16 bytes, or 24-32 including the val, which means bkey packing cuts our metadata size roughly in half. (It also makes our key format somewhat self describing and gives us a mechanism by which we could add or extend fields in the future). ----------------------------------------------------- As mentioned before, since packed bkeys are still multi-word integers we can do some important operations without unpacking, but to iterate over keys, compare packed & unpacked keys in resort, etc. - we'll still need to unpack, so we need this operation to be as fast as possible. bkey.c __bch2_bkey_unpack_key() is the unspecialized version written in C, that works on any archictecture. It loops over the fields in a bkey_format, pulling them out of the input words and adding back the field offsets. It's got the absolute minimum number of branches - one per field, when deciding to advance to the next input word - but it can't be branchless and it's a whole ton of shifts and bitops. dynamic codegen lets us produce unpack functions that are fully branchless and _much_ smaller. For any given btree node we'll have a format where multiple fields have 0 field with - i.e. those fields are always constants. That code goes away, and also if the format can be byte aligned we can eliminate shifts and bitopts. Code size for the dynamically compiled unpack functions is roughly 10% that of the unspecialized C version. I hope that addresses some of the "what is this even for" questions :) Cheers, Kent