Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp1621781rwb; Wed, 14 Dec 2022 12:36:01 -0800 (PST) X-Google-Smtp-Source: AA0mqf5Lm3njSwsD/+pcT+1TjwpoMk6u7hGxopGTGFIPL+xy5O3OzH4+l0oskBQnMzvgZ/De2ZiM X-Received: by 2002:a05:6a00:e86:b0:575:6a68:65f8 with SMTP id bo6-20020a056a000e8600b005756a6865f8mr23286097pfb.18.1671050161510; Wed, 14 Dec 2022 12:36:01 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671050161; cv=none; d=google.com; s=arc-20160816; b=f8CZXj98OFnDV+uJR/1qH00h7zzRLEswg57zBpsgWJkc5O+8fi0eciK9Xud+8AAi1o IXsEEgm4P8EQfu7Pe4H6EqzFyo38/9fvtJYENfxpLQmtScm+ZwijZsndIzwdwv/9PMqL 4A1zRBimN8JfJ9Djz2ZON7zZJ+YW3Xi5bH2RlYT+zkYrdENe9YEacSnLV6eeeA1Hnhr1 Omxtjd8lW29mIGNUJI8FYfF1E0vD6OfAIn8Fid+fnR1bbp4eKreISQBw64IO2HfI/iwl xnZ9URUnSo7/TsgHcGyD4K+H3AePhQNbEOw9nA4rTVM3QGL/ijq1S9AGZKRP3xa395zC fGQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=kwrhXtgMYXeiwjp3se5QQZKhUTAzfpkzfSr2pEUJmfg=; b=ypmpKN5j1R9/mZU2QQFaVYXZZSzER+K+9wL0OKT8ir3gZxahExVn+E6fFmbwy+tmlI 5E8IC16wg26TGDDuJhDmXWyLghbIb9DH6zcT+djN7tCzCELp11i+vLJe1CwtRwtQh3mm eqOmpjcAfHGE/ZGvNHZHRxhcPwpwsrBWMA84pJPmNdEPa7+2xGADe/j7HPTjYyO2yX5V JVWqusecFsbIARq6RhHlekeT05D0vCrmefJVzMkoIYVVYvuaimWaEEFVQNqTcUqJ2Obx oLGcqP5LXTO2AZ7yirjEsqIFHc6DsZO889Hwzjsq/kGjl4NNGflMwbnxTUf9V0GbR+/v 8jYw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z7-20020a056a00240700b005698856bb2bsi715246pfh.330.2022.12.14.12.35.52; Wed, 14 Dec 2022 12:36:01 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229722AbiLNUKd (ORCPT + 69 others); Wed, 14 Dec 2022 15:10:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46266 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229975AbiLNUJx (ORCPT ); Wed, 14 Dec 2022 15:09:53 -0500 Received: from fudo.makrotopia.org (fudo.makrotopia.org [IPv6:2a07:2ec0:3002::71]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BE1BC2C13D; Wed, 14 Dec 2022 12:02:12 -0800 (PST) Received: from local by fudo.makrotopia.org with esmtpsa (TLS1.3:TLS_AES_256_GCM_SHA384:256) (Exim 4.94.2) (envelope-from ) id 1p5Xwt-0001md-V8; Wed, 14 Dec 2022 21:01:52 +0100 Date: Wed, 14 Dec 2022 20:01:48 +0000 From: Daniel Golle To: Christoph Hellwig Cc: Richard Weinberger , Matthew Wilcox , Jens Axboe , "Martin K. Petersen" , Chaitanya Kulkarni , Wolfram Sang , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 1/4] init: move block device helpers from init/do_mounts.c Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS,T_PDS_OTHER_BAD_TLD autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 14, 2022 at 08:43:29AM -0800, Christoph Hellwig wrote: > On Tue, Dec 13, 2022 at 12:45:27PM +0000, Daniel Golle wrote: > > > Yes, but a completely non-standard format that nests inside an > > > partition. > > > > The reason for this current discussion (see subject line) is exactly > > that you didn't like the newly introduced partition type GUID which > > then calls the newly introduced partition parser taking care of the > > uImage.FIT content of a partition. > > Which is the exact nesting I'm complaining about. Why do you need > to use your format inside a GPT partition table? The GPT partition table is typically written only once to an eMMC- based device in factory. Firmware images (typically uImage.FIT) are stored in partitions because there are sometimes two of them (for A/B dual-boot, or recovery/production dual-boot). As a working device firmware consists of kernel, dtb and rootfs, all these three things have to be written and used together, typically they also come together in one file for firmware upgrade (ie. rootfs appended to kernel, tarballs, or uImage.FIT containing all three of them). As the size of kernel and rootfs cannot be determined accurately at the time the device is made, having individual GPT partitions for kernel and rootfs ends up to either being a limit to future groth of the kernel image or wastes space by overestimating the kernel size. Changing the GPT partitioning when updating the device to match the exact sizes is also not an option as a damage to the GPT would then present a single point of failure (backup GPT also wouldn't help much here), so for dual-boot to actually be meaningful, we shouldn't ever write to any parts of the disk/flash which affect more than one of the dual-boot options. > What you're doing is bascially nesting a partition table format > inside another one, which doesn't make any sense at all. See the last paragraph of this message for good reasons why one would want to do exactly that. > > > This block driver (if built-into the kernel and relied upon to expose > > the block device used as root filesystem) will need to identify the > > lower device it should work on. And for that the helper functions such > > as devt_from_devname() need to be available for that driver. > > And devt_from_devname must not be used by more non-init code. It is > bad it got exposed at all, but new users are not acceptable. I assume that implementing anything similar using blk_lookup_devt in the driver itself is then also not acceptable, right? Yet another option would be to implement a way to acquire this information from device tree. Ie. have a reference to the disk device as well as an unsigned integer in the 'chosen' node which the bootloader can use to communicate this to the kernel. Example: chosen { bootdev = <&mmc0 3>; }; It's a bit more tricky for ubiblock or mtdblock devices because they don't have *any* representation in device tree at all at this point. In case of an MTD partition (for mtdblockX) we would just reference the mtd partition in the same way. To do this cleanly with NAND/UBI, I'd start with adding device-tree-based attaching of an MTD partition to UBI using a device-tree attribute 'compatible = "linux,ubi"' on the MTD partition. We could then have sub-nodes referencing specific UBI volumes, to select them for use with ubiblock but also for those nodes then being a valid reference for use with the to-be-introduced 'bootdev' attribute in 'chosen'. Does that sound acceptable from your perspective? > > > A block representation is the common denominator of all the > > above. Sure, I could implement splitting MTD devices according to > > uImage.FIT and then add MTD support to squashfs. Then implement > > splitting of UBI volumes and add UBI support to squashfs. > > Implementing MTD and/or UBI support would allow you to build a > kernel without CONFIG_BLOCK, which will save you a lot more than > the 64k you were whining about above. Even devices with NOR flash may still want support for removable block devices like USB pendrives or SD cards... Many home-routers got only 8MiB of NOR flash and yet come with USB 2.0 ports intended for a pendrive which is then shared via Samba. Also, heavily customzied per-device kernel builds would never scale up to support thousands of devices -- hence OpenWrt uses the exact same kernel build for many devices, which makes both, the build process and also debugging kernel bugs, much much easier (or even doable at all). > > > > None of this explains the silly nesting inside the GPT partition. > > > It is not needed for the any use cases and the root probem here. > > > > So where would you store the uImage (which will have to exist > > even to just load kernel and DTB in U-Boot, even without containing > > the root filesystem) on devices with eMMC then? > > Straight on the block device, where else? As the first few blocks are typically used for bootloader code and bootloader environment, we would then need to hard-code the offset(s) of the uImage.FIT on the block device. Imho this becomes messy and just using partitions seemed like a straight forward solution. And what about dual-boot systems where you have more than one firmware image? Hard-code more offsets? For each device? In a way, I was considering this by using blkdevparts= cmdline option instead of GPT, but > > > Are you suggesting to come up with an entirely new type of partition > > table only for that purpose? Which will require its own tools and > > implementation in both, U-Boot and Linux? What would be the benefit > > over just using GPT partitioning? > > Why do you need another layer of partitioning instead of storing > all your information either in the uImage, or in some other > partition format of your choice? The reason is the different life-cycle of the device main partition table, bootloader, bootloader environment, ... on one hand and each firmware image on a dual boot system on the other hand. Hence there is more than just one uImage: typically bootloader, bootloader environment, firmware A (uImage.FIT) and firmware B. Relace "A" and "B" with "recovery" and "production", depending on the dual-boot style implemented. Therefore re-writing the whole disk during firmware upgrades is not an option because it is risky, eg. in case of a sudden power failure we could end up with a hard-bricked system. So to me it makes sense that for a firmware upgrade, we write only to one partition and don't touch GPT or anything else on the device. So in case something goes wrong, the device will still boot, the bootloader will realize that the uImage.FIT in one partition is broken (uImage.FIT also comes with hashes to ensure image integrity) and it will load something else (from another partition) instead.