Received: by 2002:ab2:620c:0:b0:1ef:ffd0:ce49 with SMTP id o12csp1632633lqt; Wed, 20 Mar 2024 09:26:47 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCU/HggyLmtiPGuljnaSVo/WF2vrKMSLlw2wYDqOkYRI+DdJGWK7EfRLN0BvUd6673489lGrYDP0lS8IjovZq2HK3P+MOLugdc8xcKeFnw== X-Google-Smtp-Source: AGHT+IFd56XBgGxAucJ3YGKm8g9l8xz2eYJoPN4sePCt4DgzorGDyKbMrl78C8+iAvumIjlw7hDc X-Received: by 2002:a17:903:41cf:b0:1df:16b:9cb8 with SMTP id u15-20020a17090341cf00b001df016b9cb8mr3203597ple.2.1710952007418; Wed, 20 Mar 2024 09:26:47 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1710952007; cv=pass; d=google.com; s=arc-20160816; b=Xs4TBoGfHaxcW0Qqv0QAyTqWG5Ozta4Nh+WAJjkK3Mw/SOvOV+Eh5JCRpZ2zHdqmtJ awz1f6NkEsugrMiGglSizWo7ImOe9ZG+RYpb9/Ba1ZdG4wWoyrxStl2JuBDsET2yBeUa MxpiCXzurM7i2QOoh7jc11HtmgDbrrJHFK7swETSA4uP3jmjez3sqAMoYC/gof+JIzPj gWsZWJhzpirinUQ1lNIP/JtKMKFf6MYiPgHOGQln4TJ8VmZQiMnNMPeiS1IFSYTSrcpv /z0SHHhxKLaBpMd49P24FF+1PhtuaIC4UNeeVNSDph7ypOEKR6Q0Nmc9mSPqA+0Y/u7e HWzw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:subject:cc:to:from:content-language :user-agent:mime-version:list-unsubscribe:list-subscribe:list-id :precedence:date:message-id:dkim-signature; bh=XwaIVcPDjFTmbLNC9o97C6vWpfRvDbHX60DuN8YwN14=; fh=2E1naitM09E10dLq2pY5E4dUw/g7vGMsO4lEl0k+9R8=; b=dKUPdC7NBuz3CVCDaQ4vpqR27nPaAm1NWPFo0YhzZ3oYiiNMmblp0OEGs3EszpCOPF Nxh4+thkTOXkVXj/Qi24RvemCA1TzyETYuYRaZ0HPpCUOtnXzga0mKv7B9xjmLaQnJXO yqDzi49mIzWviC/RlBnbWIDjyx8v9UOwDnxZTEcLwaoZv8aZTsf7doBNLp3iane3VymF WY+028yjb14jQhuqi0cEHsHtI91CauZ8kc6eSJIqmk38yHvknlxuJX+lvk7+I3z66hg2 09cD3gU2beHj0khS8gycGvDOJcgfjQibPD5FLRWG/x81F/X4jyhWW4nYrZDTcJvUKHRu Am+A==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b="w/9nXy0I"; arc=pass (i=1 spf=pass spfdomain=efficios.com dkim=pass dkdomain=efficios.com dmarc=pass fromdomain=efficios.com); spf=pass (google.com: domain of linux-kernel+bounces-109164-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-109164-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id y18-20020a170902e19200b001dbeba136a2si12414454pla.300.2024.03.20.09.26.47 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Mar 2024 09:26:47 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-109164-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b="w/9nXy0I"; arc=pass (i=1 spf=pass spfdomain=efficios.com dkim=pass dkdomain=efficios.com dmarc=pass fromdomain=efficios.com); spf=pass (google.com: domain of linux-kernel+bounces-109164-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-109164-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 1538B283443 for ; Wed, 20 Mar 2024 16:26:47 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0E19054FAD; Wed, 20 Mar 2024 16:26:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="w/9nXy0I" Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C5D854F91 for ; Wed, 20 Mar 2024 16:26:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=167.114.26.122 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710951998; cv=none; b=SweALANrgLv/Rlz8DGnO/3K5uh5I0O4SUQ3/9cwAw3WH71xOFMv0/ZSP/U1JgwE4WhfU7d8a1z47VIm5YlGwCLv8UApU0PBFMzXFGY2fQZa+D8GxhB4PnbSNx0oCf/zP6GPCnHwTOriZp8xaLvtNP2PyWj9TzU3TQm7P0URLcRQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710951998; c=relaxed/simple; bh=FfQXqQTqxprasFQA5/334WseHrA55WQtEW3YznheWUI=; h=Message-ID:Date:MIME-Version:From:To:Cc:Subject:Content-Type; b=JB/aqNjyHKUnF6PFlHFgzyJzboNW/p40Byu8rZNq8wUzoKHV4HZnbtavRAn7i08bmQEUqXe7dHjabtLf5imeFQ7WePfjk5dDd4fW4tYM0G1Da5p9epn6lDsvD+pykzAnJs/toAR3F2lwJ6bwWKR/bqUVdY2Pd4Jsq+HO9Oy3KTo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com; spf=pass smtp.mailfrom=efficios.com; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b=w/9nXy0I; arc=none smtp.client-ip=167.114.26.122 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=efficios.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1710951988; bh=FfQXqQTqxprasFQA5/334WseHrA55WQtEW3YznheWUI=; h=Date:From:To:Cc:Subject:From; b=w/9nXy0IGUf9pVQPcvRVZOLdrt53zmoG0+uXJLi1hY32TQU9I8wIBUm2PBbjdstVS 0aACOB5YcpOuvrhjTdfdJSusQswbJI9G2ZcjSujk5h3/63N98KfmzMY6FqpMPikhOD NfKYaxUeQ6wvqk8ft3H7n6nuisqjLjORCEBiHOAoDAh1v/nFveuW4D9nGilxO6o6q6 cx16Y2By/qFIchJ5ijgwEmnwRKI4il/d4lTXj9MbUF44HaOaqOJPk/M3w1m92cCMVS 4w0pmVeS4xQUQouK6QlFBPAdwgifskZZWZ0mAW+89U5Dhijkupg/gAADVDlHITpB/+ yHaho5bVXjTeg== Received: from [172.16.0.134] (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4V0DVq6FgCzkh1; Wed, 20 Mar 2024 12:26:27 -0400 (EDT) Message-ID: <218bd8f1-d382-4024-a90f-59b5fef5184a@efficios.com> Date: Wed, 20 Mar 2024 12:26:47 -0400 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US From: Mathieu Desnoyers To: "carlos@redhat.com" , DJ Delorie , Florian Weimer Cc: Olivier Dion , Michael Jeanson , libc-alpha , paulmck , Peter Zijlstra , Boqun Feng , linux-kernel , Linus Torvalds , Dennis Zhou , Tejun Heo , Christoph Lameter , linux-mm Subject: [RFC] A new per-cpu memory allocator for userspace in librseq Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hi! When looking at what is missing make librseq a generally usable project to support per-cpu data structures in user-space, I noticed that what we miss is a per-cpu memory allocator conceptually similar to what the Linux kernel internally provides [1]. The per-CPU memory allocator is analogous to TLS (Thread-Local Storage) memory: TLS is Thread-Local Storage, whereas the per-CPU memory allocator provides CPU-Local Storage. My goal is to improve locality and remove the need to waste precious cache lines with padding when indexing per-cpu data as an array of items. So we decided to go ahead and implement a per-cpu allocator for userspace in the librseq project [2,3] with the following characteristics: * Allocations are performed in memory pools (mempool). Allocations are power of 2, fixed sized, configured at pool creation. * Memory pools can be added to a pool set to allow allocation of variable size records. * Allocating "items" from a memory pool allocates memory for all CPUs. * The "stride" to index per-cpu data is user-configurable. Indexing per-cpu data from an allocated pointer is as simple as: (uintptr_t) ptr + (cpu * stride) Where the multiplication is actually a shift because stride is a power of 2 constant. * Pools consist of a linked list of "ranges" (a stride worth of item allocation), thus making the pool extensible when running out of space, up to a user-configurable limit. * Freeing a pointer only requires the pointer to free as input (and the pool stride constant). Finding the range and pool associated with the pointer is done by applying a mask to the pointer. The memory mappings of the ranges are aligned to make this mask find the range base, and thus allow accessing the range structure placed in a header page immediately before. One interesting problem we faced is what should be done to prevent wasting memory due to allocation of useless pages in a system where there are lots of configured CPUs, but very few are actually used by the application due to a combination of cpu affinity, cpusets, and cpu hotplug. Minimizing the amount of page allocation while offering the ability to allocate zeroed (or pre-initialized) items is the crux of this issue. We thus came up with two approaches based on copy-on-write (COW) to tackle this, which we call the "pool populate policy": * RSEQ_MEMPOOL_POPULATE_COW_INIT (default): Rely on copy-on-write (COW) of per-cpu pages to populate per-cpu pages from the initial values pages on first write. The COW_INIT approach maps an extra "initial values" stride with each pool range as MAP_SHARED from a memfd. All per-cpu strides map these initial values as MAP_PRIVATE, so the first write access from an active CPU will trigger a COW page allocation. The downside of this scheme is that its use of MAP_SHARED is not compatible with using the pool from children processes after fork, and its use of COW is not compatible with shared memory use-cases. * RSEQ_MEMPOOL_POPULATE_COW_ZERO: Rely on copy-on-write (COW) of per-cpu pages to populate per-cpu pages from the zero page on first write. As long as the user only uses malloc, zmalloc, or malloc_init with zeroed content to allocate items, it does not trigger COW of all per-cpu pages, leaving in place the zero page until an active CPU writes to its per-cpu item. The COW_ZERO approach maps the per-cpu strides as private anonymous memory, and therefore only triggers COW page allocation when a CPU writes over those zero pages. As a downside, this scheme will trigger COW page allocation for all possible CPUs when using zmalloc_init() to populate non-zeroed initial values for an item. Its upsides are that this scheme can be used across fork and eventually can be used over shared memory. Other noteworthy features are that this mempool allocator can be used as a global allocator as well. It has an optional "robust" attribute which enables checks for memory corruption and double-free. Users with more custom use-cases can register an "init" callback to be called for after each new range/cpu are allocated. Feedback is welcome ! Thanks, Mathieu [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/percpu.h [2] https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/include/rseq/mempool.h [3] https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/src/rseq-mempool.c -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com