Received: by 2002:a89:48b:0:b0:1f5:f2ab:c469 with SMTP id a11csp599800lqd; Wed, 24 Apr 2024 11:04:43 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCU3j1i1jSkDIHIs4T4c3YUzVHKt15vF9T+BHLxUkI/sc/Mlk/Zty4hLn1hdmWNaRGCC+T0AjKUh45aMC8gGRQW58dlr1qY3xvkhe3YY3A== X-Google-Smtp-Source: AGHT+IH+EXLN5eAvbBi8gUB/7vMnIyGtY8Tqsen+XdM6SQqngwiFPWMg8o9MD03Q2BuAxw8a81Fm X-Received: by 2002:a17:906:2342:b0:a58:873a:6bde with SMTP id m2-20020a170906234200b00a58873a6bdemr1956689eja.44.1713981883327; Wed, 24 Apr 2024 11:04:43 -0700 (PDT) Return-Path: Received: from second.openwall.net (second.openwall.net. [193.110.157.125]) by mx.google.com with SMTP id ji22-20020a170907981600b00a522e8b6972si9519900ejc.933.2024.04.24.11.04.43 for ; Wed, 24 Apr 2024 11:04:43 -0700 (PDT) Received-SPF: pass (google.com: domain of oss-security-return-30081-linux.lists.archive=gmail.com@lists.openwall.com designates 193.110.157.125 as permitted sender) client-ip=193.110.157.125; Authentication-Results: mx.google.com; arc=fail (signature failed); spf=pass (google.com: domain of oss-security-return-30081-linux.lists.archive=gmail.com@lists.openwall.com designates 193.110.157.125 as permitted sender) smtp.mailfrom="oss-security-return-30081-linux.lists.archive=gmail.com@lists.openwall.com"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=exodusintel.com Received: (qmail 32475 invoked by uid 550); 24 Apr 2024 18:04:22 -0000 Mailing-List: contact oss-security-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: oss-security@lists.openwall.com Delivered-To: mailing list oss-security@lists.openwall.com Delivered-To: moderator for oss-security@lists.openwall.com Received: (qmail 21760 invoked from network); 24 Apr 2024 16:46:20 -0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=kAMLvgIMn261KYAO+rW9peuqDmwtZmltRsmJ6ph4YcEGHv42RKItoglcLuOxzWFanHGJta4saQouoX4rSfuOnDTSZzQap6PyPfF2aHRaf24ZJat92W2BaClIF/GnT7qrafWSKpnqAiRBwqGolCgjYfs1RbrhBd8zED4ALD418/wDqRv1EdlXxnyE3FKKW9J+wQsU5jX0hMophUkD8Q0AWw+maxUEba0opP64MxhNfsB+PkZ0iXyuv4wGgv77f8ANQ5y7s3FLrJgknuLwdzULvR+R9T85y9am99qyUFx1ptbIyP08uX/bgBgRI+Kpg+0L823QBvS7B+TCY6rs5s8AtQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=mgSkEcNUkWub3/m9Bj/03Kjh4OnQhnj0+a4CvXUsIjE=; b=ZtZi62XthK0/t7opM3Tk6l3HUFrSY4ErILrWlB4YS1LrnPe8J6W01vcxKIEuY440v3a+BgnmTnAqSsA9UIlcqEqmxnJaXKOPFXJKXlmljS3543gydcjoDKLcQz9qJxUDQJda0jqPc+O/f2bVZ9X2mPQdxgkAvYu7u6Kj7T1phMHt21rVT6AOtsVlnsTWupJiC8Slf0nWn14RS7XTvJ8vA1U/nxFYxFDqYh+uxdNIP3OlMpwDCvfYqpxzuBNZOL+jeIweAlxqh9UaInf8A4I1rlOOqNqkeUvAiCa2lvb8VXPuUNkdYW1jznl4jsTPNWinRg/r4b7O2ckObnz/oOCTiQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=exodusintel.com; dmarc=pass action=none header.from=exodusintel.com; dkim=pass header.d=exodusintel.com; arc=none From: =?iso-8859-1?Q?Oriol_Castej=F3n?= To: "oss-security@lists.openwall.com" Thread-Topic: CVE-2024-0582 - Linux kernel use-after-free vulnerability in io_uring, writeup and exploit strategy Thread-Index: AQHalXj96Gza3raCo06P5CTKY/duJw== Date: Wed, 24 Apr 2024 16:46:08 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=exodusintel.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: BY3PR05MB8321:EE_|SJ0PR05MB8757:EE_ x-ms-office365-filtering-correlation-id: c145c2ee-4951-4e65-81d5-08dc647e0aa1 x-ipw-groupmember: False x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: =?iso-8859-1?Q?Pm+Jpj6Hljn1yYWnqdEfMgaqxlm6SUI0HZOoLwC+FE02eBybrR89mJVA7D?= =?iso-8859-1?Q?1S5xLJZNtjAauGrG1lbjsfZs+tXeJ13jGcWaG8SycjkFiJj0gIjftnsZN4?= =?iso-8859-1?Q?jgABsEI48MV81IpzsMEbLqh7jHrgf7gGAoonkFXJ+uoLNFENs53mcvBuK1?= =?iso-8859-1?Q?/wJfidyKVEKC36Ec8KBrqDIQwz0eIM82v2y427wuT5hccT7ZDOiDKuwS5u?= =?iso-8859-1?Q?4ENDPOYaEStaHyj6irvM87ZJVSH8QdhVExDBLNApwi1GpuacUFfSYDBx70?= =?iso-8859-1?Q?VTDTY8yRtqi3oIW9HjFVvHmJKqHsF+rzfKRx5mWrwQTczOnNABTVJ49jg9?= =?iso-8859-1?Q?jgCo+kXekGkmDJyPnrRrY2nalkVep3YIvc91vN9aa65N7dr6yWpGc9ybc+?= =?iso-8859-1?Q?rmC0HuQihibcnvEQne1HQp/DNnaCnvgPLLd5uDroXlwLPkA8Daa+znV/5d?= =?iso-8859-1?Q?ioYj0y1bqRqfcALyJXXiARBwz3ZK9Q366XA9NDdKfS8h5E8vlfHbHKNDBk?= =?iso-8859-1?Q?BhejnF1O9be31x6A2LVLg/+xwGEg9dg9BzNZGeZH3wI3FAH/cFtKzD6Dey?= =?iso-8859-1?Q?W6s2HEdJUPG2nOY8NdtvLiWzQGbT0mnZZynTyC1a8zYdjigTTlGx6Etxw/?= =?iso-8859-1?Q?q8EHUK8FN/eAlO0y7nzoOvDNoUWGKch8xM7cC5fSDP2Ag0OvrGM+BZzqEF?= =?iso-8859-1?Q?gghmdgepcUt2qAeYd6c9lIj9FBGPunHqSQHRHuctTwI2XwfkuToovs4ah8?= =?iso-8859-1?Q?C6h4dP3yrDvL9pT2B+2WELLctriZpVJLWsiq37uHRLYyMkY7Cj5kEUWi0o?= =?iso-8859-1?Q?jhptsZgxd0ZkgXPlDY/iBBHB8HldlMB29FGZaL5vLeXcu+Hzc9xpwdhv+2?= =?iso-8859-1?Q?KN826V/Qvs8IgB9dIL7YxzYWuHdHVoj6t7Nrio4cQ/hDChhko3xNR47mGO?= =?iso-8859-1?Q?+Bli4hxUl4fS9E0wFtQW2M5WiLShGYMYrUxexC5BN0MzCOCcAF3fI95dI1?= =?iso-8859-1?Q?+ztNw8Z00TRjOFtf4LrnxbmkrzOi86c8d03Dya7UVD7AiUwOIXWuIE52/F?= =?iso-8859-1?Q?HI+raV7bCa2LH5CvYe8SyhusGlm8cON3wvnvI87b2EPCu/cwTJbjHYO/Wo?= =?iso-8859-1?Q?jj/5TQMWFmRL8Ri4VPYdo5u6xuIS+XatZUKs+zYTFq7fdX2Evf/jtOBem0?= =?iso-8859-1?Q?IStsc+IGLqeLL3m6upRxECJHEe3Nkp8F7q33OeEAHHdmy5en5rq4odYxdN?= =?iso-8859-1?Q?0NuIl1QImooCW02eKBgt0ZylNMcKqI1ajZfOPjKH+n0MeSmhOVkQX6suP9?= =?iso-8859-1?Q?g1NBLdBmuN92d25Qd5tRzMdgkw=3D=3D?= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:BY3PR05MB8321.namprd05.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230031)(366007)(1800799015)(376005)(38070700009);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?Q?q4STvwSw4cVfheCWUtLdo644Wj5aoumuoizyMawFtFbQjQqpKNDOKoiFWE?= =?iso-8859-1?Q?Snd11KkLd4ogZbRFPZXhnyrC20VtH+0bZFq9OFYZ1mRpYwqE8IYfy6xwvi?= =?iso-8859-1?Q?49tCynelbSPnOCRZ5o+1bE60+5Ue9NHfOS7pyc12c9BgvYDgVe4H1RAF+a?= =?iso-8859-1?Q?SUs87gaJZoDGLFPoKC3YcL03NaWiWMeXAya3oPMjxEbsQIREys5dg89bYB?= =?iso-8859-1?Q?F4DoiSWUgxRvbCOcHx9Ms/GWj+EfzoeH0UoQVrH6s4wbd65dmPGnhiEv81?= =?iso-8859-1?Q?T2+3Nq+U4xnXftTxx4NQKmMgkcCj/zKgi8fd9V+lMgCiPQ7scwlrLmw67s?= =?iso-8859-1?Q?HeAvv3b2+QKovUTkHJjMnQQcUaFDd3Yw/gxzuNRHhhNP6G086BrZTWn+sQ?= =?iso-8859-1?Q?/T2nVdQ99Qxtyni3oePnQN7mr65sxrotJxqpWVuZPMpP0FljEHC/sZaLj9?= =?iso-8859-1?Q?d6GivEV5nvRaAQkA26ZIchPq21843XhJI1ooju7Fx9DQJTe7YdKHkztx5C?= =?iso-8859-1?Q?14fFhLVZxfZIHknBsWJiJ5dLcq8M+pdgKu2OuUOuQ6Yy0m6yXsl24fZt1q?= =?iso-8859-1?Q?Mc7uM57jlClhaj81/hZeGxJTeGenyy43bm8g2/WANzpTch+2ryNBGP0P1l?= =?iso-8859-1?Q?Q9aFF0ixzgZEXng5hQixtyiTGVc4iQmTauSauehb4e/NPn5gd1KqvoBRiu?= =?iso-8859-1?Q?SEt71CKSFPrmH8WtNu6lu24xujfZw0bjTQORPZszO/jj7NwXZ308wA6+Yj?= =?iso-8859-1?Q?irYpCGbR1byNpMv6CtpAP1tJxkxggJLr2CjKyUzkCUaoVfpJ502x9CQIGw?= =?iso-8859-1?Q?mG9pBuAGqVQ3O9FWlZiqUync9QCSovlN7yGGYuVnR/JQdE3I7y/XwUbBdS?= =?iso-8859-1?Q?yfW/HXebJLB7j61oZ21HwZFH5hWqz4A96LwvskOxSxLh8qYaQfQnDJBkG5?= =?iso-8859-1?Q?aJ2H4raZidRNQWDm3CnkKolyi6ZdrGIDJqyJ+Mjfi0fu72v4vT1X4o+vVg?= =?iso-8859-1?Q?XB+EK1TSICwxgsBdgc4/pX7o8LJEhNfOmeNdW0agOD89yFh3DDsjYm86fh?= =?iso-8859-1?Q?YAwTKMrdx77DleXpcQ8OAMfqHC6Yz+JiLtXInd6vqmqLYqVlcAVDupYtI/?= =?iso-8859-1?Q?fPnWK6JysvdAOpkKm3MvV54r6htwzRzlXH66aNCNCeQEmSyFMAwKvSfguP?= =?iso-8859-1?Q?WEuw5gGsGT7TdULoQRZ88LME4mWfNs9x+OVRNVf1cdaa1UNq4JdhKJFsCj?= =?iso-8859-1?Q?x78x8s6nJA7WVVMTn26cf1ny1nBxSup71PfIpdAZJlYQIHreQYR8kTeqA5?= =?iso-8859-1?Q?5rgpfBVhI/0tKTdr7jNu0cf1mWb8Nr0mreLsEusKEBiiBDajI2GHOe8o0r?= =?iso-8859-1?Q?HhTIy8YitYDjYpQgW0tqZSvJDZvrY0kkFH3YD/bBYorX4nGjOOeBiDEbe3?= =?iso-8859-1?Q?zmedJEJCJeMsfpkM4ZAF1GoAGryx0To4FwKu6fdSJnQYK42/SxCAuK5fIU?= =?iso-8859-1?Q?LCDtsDWBv/1Fe/fmozasF96fB3e1hjhmb1H1XlAvhwOhqj8mzRY1A9gz1g?= =?iso-8859-1?Q?9vAcolpHzbsIf/VlRleAHJ7ryObVBpRu+epNd8bnH9iO1eGchy78e9CgcH?= =?iso-8859-1?Q?bQjUn0BEiwrEHMyBIU7PkUK2CSuwFN4aUlCOIfN7iHMe9iJ4IR3wurFQ?= =?iso-8859-1?Q?=3D=3D?= Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: exodusintel.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: BY3PR05MB8321.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: c145c2ee-4951-4e65-81d5-08dc647e0aa1 X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Apr 2024 16:46:08.6763 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3492a56f-acf8-4963-a9f2-c584d03f4554 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: VHTGxyyswTxURwdR6m9E6ejnUhlXhzLX4mEs1JJ+xnyGT7LNR91p82w5BerkBDAGEVn9NPb2foNbY3pypYYU5SdH7nTqDXBEdhyBHLVJEIE= X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ0PR05MB8757 Subject: [oss-security] CVE-2024-0582 - Linux kernel use-after-free vulnerability in io_uring, writeup and exploit strategy Hi all,=0A= =0A= a use-after-free vulnerability in the io_uring subsystem of the Linux=0A= kernel (CVE-2024-0582) was identified last November by Jann Horn from=0A= Google Project Zero, see:=0A= =0A= https://bugs.chromium.org/p/project-zero/issues/detail?id=3D2504=0A= =0A= The issue was introduced by the following commit, which was included=0A= in version 6.4 of the Linux kernel:=0A= =0A= https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?= id=3Dc56e022c0a27=0A= =0A= The issue was fixed in the following commit, which was included in the=0A= stable release 6.6.5:=0A= =0A= https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?= id=3Dc392cbecd8ec=0A= =0A= Below are the details of the vulnerability, as well as an exploitation=0A= strategy that was successful to exploit the patch gap in Ubuntu. The=0A= contents of this message (plus some images) were originally published=0A= in the following blog: =0A= =0A= https://blog.exodusintel.com/2024/03/27/mind-the-patch-gap-exploiting-an-io= _uring-vulnerability-in-ubuntu/=0A= =0A= Additionally, a brief summary of the implemented fix, which was not=0A= included in the original blog post, is provided at the end of this=0A= message.=0A= =0A= =0A= ## Preliminaries=0A= =0A= The io_uring interface is an asynchronous I/O API for Linux created by=0A= Jens Axboe and introduced in the Linux kernel version 5.1. Its goal=0A= is to improve performance of applications with a high number of I/O=0A= operations. It provides interfaces similar to functions like =0A= `read()` and `write()`, for example, but requests are satisfied in an=0A= asynchronous manner to avoid the context switching overhead caused by=0A= blocking system calls.=0A= =0A= The io_uring interface has been a bountiful target for a lot of=0A= vulnerability research; it was disabled in ChromeOS, production=0A= Google servers, and restricted in Android. As such, there are many=0A= blog posts that explain it with a lot of detail. Some relevant=0A= references are the following:=0A= - [Put an io_uring on it - Exploiting the Linux Kernel]=0A= (https://chomp.ie/Blog+Posts/Put+an+io_uring+on+it+-+Exploiting+the+Linux= +Kernel),=0A= a writeup for an exploit targeting an io_uring operation that=0A= provides the same functionality (`IORING_OP_PROVIDE_BUFFERS`) as=0A= the vulnerability discussed here (`IORING_REGISTER_PBUF_RING`), and=0A= that has also a broad overview of this subsystem.=0A= - [CVE-2022-29582 An io_uring vulnerability]=0A= (https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/),=0A= where a cross-cache exploit is described. While the exploit=0A= described in our blog post is not strictly speaking cross-cache,=0A= there is some similarity between the two exploit strategies. It=0A= also provides an explanation of slab caches and the page allocator=0A= relevant to our exploit strategy.=0A= - [Escaping the Google kCTF Container with a Data-Only Exploit]=0A= (https://h0mbre.github.io/kCTF_Data_Only_Exploit/), where a=0A= different strategy for data-only exploit of an io_uring=0A= vulnerability is described.=0A= - [Conquering the memory through io_uring - Analysis of CVE-2023-2598]=0A= (https://anatomic.rip/cve-2023-2598/), a writeup of a vulnerability=0A= that yields a very similar exploit primitive to ours. In this case,=0A= however, the exploit strategy relies on manipulating a structure=0A= associated with a socket, instead of manipulating file structures.=0A= =0A= In the next subsections we give an overview of the io_uring interface.=0A= We pay special attention to the Provided Buffer Ring functionality,=0A= which is relevant to the vulnerability discussed in this post. The=0A= reader can also check "[What is io_uring?]=0A= (https://unixism.net/loti/what_is_io_uring.html)", as well as the=0A= above references for alternative overviews of this subsystem.=0A= =0A= =0A= ### The io_uring Interface=0A= =0A= The basis of io_uring is a set of two ring buffers used for=0A= communication between user and kernel space. These are:=0A= =0A= - The *submission queue* (SQ), which contains submission queue=0A= entries (SQEs) describing a request for an I/O operation, such as=0A= reading or writing to a file, etc.=0A= - The *completion queue* (CQ), which contains completion queue=0A= entries (CQEs) that correspond to SQEs that have been processed and=0A= completed.=0A= =0A= This model allows executing a number of I/O requests to be performed=0A= asynchronously using a single system call, while in a synchronous=0A= manner each request would have typically corresponded to a single=0A= system call. This reduces the overhead caused by blocking system=0A= calls, thus improving performance. Moreover, the use of shared=0A= buffers also reduces the overhead as no data between user and=0A= kernelspace has to be transferred.=0A= =0A= The io_uring API consists of three system calls:=0A= =0A= - `io_uring_setup()`=0A= - `io_uring_register()`=0A= - `io_uring_enter()`=0A= =0A= #### The `io_uring_setup()` System Call=0A= =0A= The `io_uring_setup()` system call sets up a context for an io_uring=0A= instance, that is, a submission and a completion queue with the=0A= indicated number of entries each one. Its prototype is the=0A= following:=0A= =0A= ```c=0A= int io_uring_setup(u32 entries, struct io_uring_params *p);=0A= ```=0A= =0A= Its arguments are:=0A= =0A= - `entries`: It determines how many elements the SQ and CQ must have=0A= at the minimum.=0A= - `params`: It can be used by the application to pass options to the=0A= kernel, and by the kernel to pass information to the application=0A= about the ring buffers.=0A= =0A= On success, the return value of this system call is a file descriptor=0A= that can be later used to perform operation on the io_uring instance.=0A= =0A= #### The `io_uring_register()` System Call=0A= =0A= The `io_uring_register()` system call allows registering resources,=0A= such as user buffers, files, etc., for use in an io_uring instance.=0A= Registering such resources makes the kernel map them, avoiding future=0A= copies to and from userspace, thus improving performance. Its=0A= prototype is the following:=0A= =0A= ```c=0A= int io_uring_register(unsigned int fd, unsigned int opcode, void *arg =0A= unsigned int nr_args);=0A= ```=0A= =0A= Its arguments are:=0A= =0A= - `fd`: The io_uring file descriptor returned by the=0A= `io_uring_setup()` system call.=0A= - `opcode`: The specific operation to be executed. It can have certain=0A= values such as `IORING_REGISTER_BUFFERS`, to register user buffers,=0A= or `IORING_UNREGISTER_BUFFERS`, to release the previously=0A= registered buffers.=0A= - `arg`: Arguments passed to the operation being executed. Their type=0A= depends on the specific `opcode` being passed.=0A= - `nr_args`: Number of arguments in `args` being passed.=0A= =0A= On success, the return value of this system call is either zero or a positi= ve value, depending on the `opcode` used.=0A= =0A= ##### Provided Buffer Rings=0A= =0A= An application might need to have different types of registered=0A= buffers for different I/O requests. Since kernel version 5.7, to=0A= facilitate managing these different sets of buffers, io_uring allows=0A= the application to register a pool of buffers that are identified by=0A= a group ID. This is done using the `IORING_REGISTER_PBUF_RING` opcode=0A= in the `io_uring_register()` system call.=0A= =0A= More precisely, the application starts by allocating a set of buffers=0A= that it wants to register. Then, it makes the =0A= `io_uring_register()` system call with opcode=0A= `IORING_REGISTER_PBUF_RING`, specifying a group ID with which these=0A= buffers should be associated, a start address of the buffers, the=0A= length of each buffer, the number of buffers, and a starting buffer=0A= ID. This can be done for multiple sets of buffers, each one having a=0A= different group ID.=0A= =0A= Finally, when submitting a request, the application can use the=0A= `IOSQE_BUFFER_SELECT` flag and provide the desired group ID to=0A= indicate that a provided buffer ring from the corresponding set=0A= should be used. When the operation has been completed, the buffer ID=0A= of the buffer used for the operation is passed to the application via=0A= the corresponding CQE.=0A= =0A= Provided buffer rings can be unregistered via the =0A= `io_uring_register()` system call using the =0A= `IORING_UNREGISTER_PBUF_RING` opcode.=0A= =0A= ##### User-mapped Provided Buffer Rings=0A= =0A= In addition to the buffers allocated by the application, since kernel=0A= version 6.4, io_uring allows a user to delegate the allocation of=0A= provided buffer rings to the kernel. This is done using the=0A= `IOU_PBUF_RING_MMAP` flag passed as an argument to =0A= `io_uring_register()`. In this case, the application does not need =0A= to previously allocate these buffers, and therefore the start address=0A= of the buffers does not have to be passed to the system call. Then,=0A= after `io_uring_register()` returns, the application can `mmap()` the=0A= buffers into userspace with the offset set as:=0A= =0A= ```c =0A= IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT)=0A= ```=0A= =0A= where `bgid` is the corresponding group ID. These offsets, as well as=0A= others used to `mmap()` the io_uring data, are defined in=0A= `include/uapi/linux/io_uring.h`:=0A= =0A= ```c=0A= /*=0A= * Magic offsets for the application to mmap the data it needs=0A= */=0A= #define IORING_OFF_SQ_RING 0ULL #define=0A= IORING_OFF_CQ_RING 0x8000000ULL #define=0A= IORING_OFF_SQES 0x10000000ULL #define=0A= IORING_OFF_PBUF_RING 0x80000000ULL #define=0A= IORING_OFF_PBUF_SHIFT 16 #define=0A= IORING_OFF_MMAP_MASK 0xf8000000ULL =0A= ```=0A= =0A= The function that handles such an `mmap()` call is `io_uring_mmap()`:=0A= =0A= ```c=0A= // Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/io_uring= .c#L3439=0A= =0A= static __cold int io_uring_mmap(struct file *file, struct vm_area_struct *v= ma)=0A= {=0A= size_t sz =3D vma->vm_end - vma->vm_start;=0A= unsigned long pfn;=0A= void *ptr;=0A= =0A= ptr =3D io_uring_validate_mmap_request(file, vma->vm_pgoff, sz);=0A= if (IS_ERR(ptr))=0A= return PTR_ERR(ptr);=0A= =0A= pfn =3D virt_to_phys(ptr) >> PAGE_SHIFT;=0A= return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);=0A= }=0A= ```=0A= =0A= Note that `remap_pfn_range()` ultimately creates a mapping with the=0A= `VM_PFNMAP` flag set, which means that the MM subsystem will treat=0A= the base pages as raw page frame number mappings wihout an associated=0A= page structure. In particular, the core kernel will not keep=0A= reference counts of these pages, and keeping track of it is the=0A= responsability of the calling code (in this case, the io_uring=0A= subsystem).=0A= =0A= =0A= #### The `io_uring_enter()` System Call=0A= =0A= The `io_uring_enter()` system call is used to initiate and complete=0A= I/O using the SQ and CQ that have been previously set up via the=0A= `io_uring_setup()` system call. Its prototype is the following:=0A= =0A= ```c=0A= int io_uring_enter(unsigned int fd, unsigned int to_submit, =0A= unsigned int min_complete, unsigned int flags, sigset_t *sig);=0A= ```=0A= =0A= Its arguments are:=0A= =0A= - `fd`: The io_uring file descriptor returned by the =0A= `io_uring_setup()` system call.=0A= - `to_submit`: Specifies the number of I/Os to submit from the SQ.=0A= - `flags`: A bitmask value that allows specifying certain options,=0A= such as `IORING_ENTER_GETEVENTS`, `IORING_ENTER_SQ_WAKEUP`,=0A= `IORING_ENTER_SQ_WAIT`, etc.=0A= - `sig`: A pointer to a signal mask. If it is not `NULL`, the system=0A= call replaces the current signal mask by the one pointed to by=0A= `sig`, and when events become available in the CQ restores the=0A= original signal mask.=0A= =0A= =0A= ## Vulnerability=0A= =0A= The vulnerability can be triggered when an application registers a=0A= provided buffer ring with the `IOU_PBUF_RING_MMAP` flag. In this=0A= case, the kernel allocates the memory for the provided buffer ring,=0A= instead of it being done by the application. To access the buffers,=0A= the application has to `mmap()` them to get a virtual mapping. If the=0A= application later unregisters the provided buffer ring using the=0A= `IORING_UNREGISTER_PBUF_RING` opcode, the kernel frees this memory=0A= and returns it to the page allocator. However, it does not have any=0A= mechanism to check whether the memory has been previously unmapped in=0A= userspace. If this has not been done, the application has a valid=0A= memory mapping to freed pages that can be reallocated by the kernel=0A= for other purposes. From this point, reading or writing to these=0A= pages will trigger a use-after-free.=0A= =0A= The following code blocks show the affected parts of functions=0A= relevant to this vulnerability. Code snippets are demarcated by=0A= reference markers denoted by [N]. Lines not relevant to this=0A= vulnerability are replaced by a [Truncated] marker. The code=0A= corresponds to the Linux kernel version 6.5.3, which corresponds to=0A= the version used in the Ubuntu kernel `6.5.0-15-generic`.=0A= =0A= ### Registering User-mapped Provided Buffer Rings=0A= =0A= The handler of the `IORING_REGISTER_PBUF_RING` opcode for the=0A= `io_uring_register()` system call is the =0A= `io_register_pbuf_ring()` function, shown in the next listing.=0A= =0A= ```c=0A= // Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L= 537=0A= =0A= int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)=0A= {=0A= struct io_uring_buf_reg reg;=0A= struct io_buffer_list *bl, *free_bl =3D NULL;=0A= int ret;=0A= =0A= [1]=0A= =0A= if (copy_from_user(®, arg, sizeof(reg)))=0A= return -EFAULT;=0A= =0A= [Truncated]=0A= =0A= if (!is_power_of_2(reg.ring_entries))=0A= return -EINVAL;=0A= =0A= [2]=0A= =0A= /* cannot disambiguate full vs empty due to head/tail size */=0A= if (reg.ring_entries >=3D 65536)=0A= return -EINVAL;=0A= =0A= if (unlikely(reg.bgid < BGID_ARRAY && !ctx->io_bl)) {=0A= int ret =3D io_init_bl_list(ctx);=0A= if (ret)=0A= return ret;=0A= }=0A= =0A= bl =3D io_buffer_get_list(ctx, reg.bgid);=0A= if (bl) {=0A= /* if mapped buffer ring OR classic exists, don't allow */=0A= if (bl->is_mapped || !list_empty(&bl->buf_list))=0A= return -EEXIST;=0A= } else {=0A= =0A= [3]=0A= =0A= free_bl =3D bl =3D kzalloc(sizeof(*bl), GFP_KERNEL);=0A= if (!bl)=0A= return -ENOMEM;=0A= }=0A= =0A= [4]=0A= =0A= if (!(reg.flags & IOU_PBUF_RING_MMAP))=0A= ret =3D io_pin_pbuf_ring(®, bl);=0A= else=0A= ret =3D io_alloc_pbuf_ring(®, bl);=0A= =0A= [Truncated]=0A= =0A= return ret;=0A= }=0A= ```=0A= =0A= The function starts by copying the provided arguments into an=0A= `io_uring_buf_reg` structure reg [1]. Then, it checks that the=0A= desired number of entries is a power of two and is strictly less than=0A= 65536 [2]. Note that this implies that the maximum number of allowed=0A= entries is 32768.=0A= =0A= Next, it checks whether a provided buffer list with the specified=0A= group ID `reg.bgid` exists and, in case it does not, an=0A= `io_buffer_list` structure is allocated and its address is stored in=0A= the variable `bl` [3]. Finally, if the provided arguments have the =0A= flag `IOU_PBUF_RING_MMAP` set, the `io_alloc_pbuf_ring()` function is=0A= called [4], passing in the address of the structure `reg`, which=0A= contains the arguments passed to the system call, and the pointer to=0A= the allocated buffer list structure `bl`.=0A= =0A= ```c=0A= // Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L= 519=0A= =0A= static int io_alloc_pbuf_ring(struct io_uring_buf_reg *reg,=0A= struct io_buffer_list *bl)=0A= {=0A= gfp_t gfp =3D GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP;= =0A= size_t ring_size;=0A= void *ptr;=0A= =0A= [5]=0A= =0A= ring_size =3D reg->ring_entries * sizeof(struct io_uring_buf_ring);=0A= =0A= [6]=0A= =0A= ptr =3D (void *) __get_free_pages(gfp, get_order(ring_size));=0A= if (!ptr)=0A= return -ENOMEM;=0A= =0A= [7]=0A= =0A= bl->buf_ring =3D ptr;=0A= bl->is_mapped =3D 1;=0A= bl->is_mmap =3D 1;=0A= return 0;=0A= }=0A= ```=0A= =0A= The `io_alloc_pbuf_ring()` function takes the number of ring entries=0A= specified in `reg->ring_entries` and computes the resulting size=0A= `ring_size` by multiplying it by the size of the `io_uring_buf_ring`=0A= structure [5], which is 16 bytes. Then, it requests a number of pages=0A= from the page allocator that can fit this size via a call to=0A= `__get_free_pages()` [6]. Note that for the maximum number of allowed=0A= ring entries, 32768, `ring_size` is 524288 and thus the maximum=0A= number of 4096-byte pages that can be retrieved is 128. The address=0A= of the first page is then stored in the `io_buffer_list` structure,=0A= more precisely in `bl->buf_ring` [7]. Also, `bl->is_mapped` and=0A= `bl->is_mmap` are set to 1.=0A= =0A= ### Unregistering Provided Buffer Rings=0A= =0A= The handler of the `IORING_UNREGISTER_PBUF_RING` opcode for the=0A= `io_uring_register()` system call is the =0A= `io_unregister_pbuf_ring()` function, shown in the next listing.=0A= =0A= ```c=0A= // Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L= 601=0A= =0A= int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)=0A= {=0A= struct io_uring_buf_reg reg;=0A= struct io_buffer_list *bl;=0A= =0A= [8]=0A= =0A= if (copy_from_user(®, arg, sizeof(reg)))=0A= return -EFAULT;=0A= if (reg.resv[0] || reg.resv[1] || reg.resv[2])=0A= return -EINVAL;=0A= if (reg.flags)=0A= return -EINVAL;=0A= =0A= [9]=0A= =0A= bl =3D io_buffer_get_list(ctx, reg.bgid);=0A= if (!bl)=0A= return -ENOENT;=0A= if (!bl->is_mapped)=0A= return -EINVAL;=0A= =0A= [10]=0A= =0A= __io_remove_buffers(ctx, bl, -1U);=0A= if (bl->bgid >=3D BGID_ARRAY) {=0A= xa_erase(&ctx->io_bl_xa, bl->bgid);=0A= kfree(bl);=0A= }=0A= return 0;=0A= }=0A= ```=0A= =0A= Again, the function starts by copying the provided arguments into a=0A= `io_uring_buf_reg` structure `reg` [8]. Then, it retrieves the =0A= provided buffer list corresponding to the group ID specified in=0A= `reg.bgid` and stores its address in the variable `bl` [9]. Finally,=0A= it passes `bl` to the function `__io_remove_buffers()` [10].=0A= =0A= ```c=0A= // Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L= 209=0A= =0A= static int __io_remove_buffers(struct io_ring_ctx *ctx,=0A= struct io_buffer_list *bl, unsigned nbufs)=0A= {=0A= unsigned i =3D 0;=0A= =0A= /* shouldn't happen */=0A= if (!nbufs)=0A= return 0;=0A= =0A= if (bl->is_mapped) {=0A= i =3D bl->buf_ring->tail - bl->head;=0A= if (bl->is_mmap) {=0A= struct page *page;=0A= =0A= [11]=0A= =0A= page =3D virt_to_head_page(bl->buf_ring);=0A= =0A= [12]=0A= =0A= if (put_page_testzero(page))=0A= free_compound_page(page);=0A= bl->buf_ring =3D NULL;=0A= bl->is_mmap =3D 0;=0A= } else if (bl->buf_nr_pages) {=0A= =0A= [Truncated]=0A= ```=0A= =0A= In case the buffer list structure has the `is_mapped` and `is_mmap`=0A= flags set, which is the case when the buffer ring was registered with=0A= the `IOU_PBUF_RING_MMAP` flag [7], the function reaches [11]. Then,=0A= the `page` structure of the head page corresponding to the virtual=0A= address of the buffer ring `bl->buf_ring` is obtained. Finally, all=0A= the pages forming the compound page with head `page` are freed at=0A= [12], thus returning them to the page allocator.=0A= =0A= Note that if the provided buffer ring is set up with=0A= `IOU_PBUF_RING_MMAP`, that is, it has been allocated by the kernel=0A= and not the application, the userspace application is expected to=0A= have previously `mmap()`ed this memory. Moreover, recall that since =0A= the memory mapping was created with the `VM_PFNMAP` flag, the=0A= reference count of the page structure was not modified during this=0A= operation. In other words, in the code above there is no way for the=0A= kernel to know whether the application has unmapped the memory before=0A= freeing it via the call to `free_compound_page()`. If this has not=0A= happened, a use-after-free can be triggered by the application by=0A= just reading or writing to this memory.=0A= =0A= ## Exploitation=0A= =0A= The exploitation mechanism presented in this post relies on how memory=0A= allocation works on Linux, so the reader is expected to have some=0A= familiarity with it. As a refresher, we highlight the following=0A= facts:=0A= =0A= - The page allocator is in charge of managing memory pages, which are=0A= usually 4096 bytes. It keeps lists of free pages of order n, that=0A= is, memory chunks of page size multiplied by 2^n. These pages are=0A= served in a first-in-first-out basis.=0A= - The slab allocator sits on top of the buddy allocator and keeps=0A= caches of commonly used objects (dedicated caches) or fixed-size=0A= objects (generic caches), called slab caches, available for=0A= allocation in the kernel. There are several implementations of slab=0A= allocators, but for the purpose of this post only the SLUB=0A= allocator, the default in modern versions of the kernel, is=0A= relevant.=0A= - Slab caches are formed by multiple slabs, which are sets of one or=0A= more contiguous pages of memory. When a slab cache runs out of free=0A= slabs, which can happen if a large number of objects of the same=0A= type or size are allocated and not freed during a period of time,=0A= the operating system allocates a new slab by requesting free pages=0A= to the page allocator.=0A= =0A= One of such cache slabs is the `filp`, which contains `file`=0A= structures. A `file` structure, shown in the next listing, represents=0A= an open file.=0A= =0A= ```c=0A= // Source: https://elixir.bootlin.com/linux/v6.5.3/source/include/linux/fs.= h#L961=0A= =0A= struct file {=0A= union {=0A= struct llist_node f_llist;=0A= struct rcu_head f_rcuhead;=0A= unsigned int f_iocb_flags;=0A= };=0A= =0A= /*=0A= * Protects f_ep, f_flags.=0A= * Must not be taken from IRQ context.=0A= */=0A= spinlock_t f_lock;=0A= fmode_t f_mode;=0A= atomic_long_t f_count;=0A= struct mutex f_pos_lock;=0A= loff_t f_pos;=0A= unsigned int f_flags;=0A= struct fown_struct f_owner;=0A= const struct cred *f_cred;=0A= struct file_ra_state f_ra;=0A= struct path f_path;=0A= struct inode *f_inode; /* cached value */=0A= const struct file_operations *f_op;=0A= =0A= u64 f_version;=0A= #ifdef CONFIG_SECURITY=0A= void *f_security;=0A= #endif=0A= /* needed for tty driver, and maybe others */=0A= void *private_data;=0A= =0A= #ifdef CONFIG_EPOLL=0A= /* Used by fs/eventpoll.c to link all the hooks to this file */=0A= struct hlist_head *f_ep;=0A= #endif /* #ifdef CONFIG_EPOLL */=0A= struct address_space *f_mapping;=0A= errseq_t f_wb_err;=0A= errseq_t f_sb_err; /* for syncfs */=0A= } __randomize_layout=0A= __attribute__((aligned(4))); /* lest something weird decides that 2 is OK= */=0A= ```=0A= =0A= The most relevant fields for this exploit are the following:=0A= =0A= - `f_mode`: Determines whether the file is readable or writable.=0A= - `f_pos`: Determines the current reading or writing position.=0A= - `f_op`: The operations associated with the file. It determines the=0A= functions to be executed when certain system calls such as =0A= `read()`, `write()`, etc., are issued on the file. For files in=0A= `ext4` filesystems, this is equal to the `ext4_file_operations`=0A= variable.=0A= =0A= ### Strategy for a Data-Only Exploit=0A= =0A= The exploit primitive provides an attacker with read and write access=0A= to a certain number of free pages that have been returned to the page=0A= allocator. By opening a file a large number of times, the attacker=0A= can force the exhaustion of all the slabs in the `filp` cache, so=0A= that free pages are requested to the page allocator to create a new=0A= slab in this cache. In this case, further allocations of file=0A= structures will happen in the pages on which the attacker has read=0A= and write access, thus being able to modify them. In particular, for=0A= example, by modifying the `f_mode` field, the attacker can make a=0A= file that has been opened with read-only permissions to be writable.=0A= =0A= This strategy was implemented to successfully exploit the following=0A= versions of Ubuntu:=0A= =0A= - Ubuntu 22.04 Jammy Jellyfish LTS with kernel `6.5.0-15-generic`.=0A= - Ubuntu 22.04 Jammy Jellyfish LTS with kernel `6.5.0-17-generic`.=0A= - Ubuntu 23.10 Mantic Minotaur with kernel `6.5.0-15-generic`.=0A= - Ubuntu 23.10 Mantic Minotaur with kernel `6.5.0-17-generic`.=0A= =0A= The next subsections give more details on how this strategy can be=0A= carried out.=0A= =0A= #### Triggering the Vulnerability=0A= =0A= The strategy begins by triggering the vulnerability to obtain read and=0A= write access to freed pages. This can be done by executing the=0A= following steps:=0A= - Making an `io_uring_setup()` system call to set up the io_uring=0A= instance.=0A= - Making an `io_uring_register()` system call with opcode=0A= `IORING_REGISTER_PBUF_RING` and the `IOU_PBUF_RING_MMAP` flag, so=0A= that the kernel itself allocates the memory for the provided buffer=0A= ring.=0A= - `mmap()`ing the memory of the provided buffer ring with read and=0A= write permissions, using the io_uring file descriptor and the=0A= offset `IORING_OFF_PBUF_RING`. =0A= - Unregistering the provided buffer ring by making an=0A= `io_uring_register()` system call with opcode=0A= `IORING_UNREGISTER_PBUF_RING`.=0A= =0A= At this point, the pages corresponding to the provided buffer ring have bee= n returned to the page allocator, while the attacker still has a valid refe= rence to them.=0A= =0A= #### Spraying File Structures=0A= =0A= The next step is spawning a large number of child processes, each one=0A= opening the file `/etc/passwd` many times with read-only permissions.=0A= This forces the allocation of corresponding file structures in the=0A= kernel.=0A= =0A= By opening a large number of files, the attacker can force the=0A= exhaustion of the slabs in the `filp` cache. After that, new slabs=0A= will be allocated by requesting free pages from the page allocator.=0A= At some point, the pages that previously corresponded to the provided=0A= buffer ring, and to which the attacker still has read and write=0A= access, will be returned by the page allocator.=0A= =0A= Hence, all of the file structures created after this point will be=0A= allocated in the attacker-controlled memory region, giving them the=0A= possibility to modify the structures.=0A= =0A= Note that these child processes have to wait until indicated to=0A= proceed in the last stage of the exploit, so that the files are kept=0A= open and their corresponding structures are not freed.=0A= =0A= #### Locating a File Structure in Memory=0A= =0A= Although the attacker may have access to some slabs belonging to the=0A= `filp` cache, they don't know where they are within the memory=0A= region. To identify these slabs, however, the attacker can search for=0A= the `ext4_file_operations` address at the offset of the `file.f_op`=0A= field within the file structure. When one is found, it can be safely=0A= assumed that it corresponds to the file structure of one instance of=0A= the previously opened `/etc/passwd` file.=0A= =0A= Note that even when Kernel Address Space Layout Randomization=0A= (KASLR) is enabled, to identify the `ext4_file_operations` address in=0A= memory it is only necessary to know the offset of this symbol with=0A= respect to the `_text` symbol, so there is no need for a KASLR=0A= bypass. Indeed, given a value `val` of an unsigned integer found in=0A= memory at the corresponding offset, one can safely assume that it is=0A= the address of `ext4_file_operations` if:=0A= =0A= - `(val >> 32 & 0xffffffff) =3D=3D 0xffffffff`, i.e. the 32 most=0A= significant bits are all 1.=0A= - `(val & 0xfffff) =3D=3D (ext4_fops_offset & 0xfffff)`, i.e. the 20 least= =0A= significant bits of `val` and `ext4_fops_offset`, the offset of=0A= `ext4_file_operations` with respect to `_text`, are the same.=0A= =0A= #### Changing File Permissions and Adding a Backdoor Account=0A= =0A= Once a file structure corresponding to the `/etc/passwd` file is=0A= located in the memory region accessible by the attacker, it can be=0A= modified at will. In particular, setting the `FMODE_WRITE` and=0A= `FMODE_CAN_WRITE` flags in the `file.f_mode` field of the found=0A= structure will make the `/etc/passwd` file writable when using the=0A= corresponding file descriptor.=0A= =0A= Moreover, setting the `file.f_pos` field of the found file structure=0A= to the current size of the `/etc/passwd` file, the attacker can=0A= ensure that any data written to it is appended at the end of the=0A= file.=0A= =0A= To finish, the attacker can signal all the child processes spawned in=0A= the second stage to try to write to the opened `/etc/passwd` file.=0A= While most of all of such attempts will fail, as the file was opened=0A= with read-only permissions, the one corresponding to the modified=0A= file structure, which has write permissions enabled due to the=0A= modification of the `file->f_mode` field, will succeed.=0A= =0A= =0A= ## The Fix=0A= =0A= As mentioned above, a fix for this vulnerability was introduced in=0A= the Linux kernel in commit c392cbecd8ec.=0A= =0A= The main points of this fix are the following:=0A= =0A= - A field `io_buf_list` in the io_uring context structure is added.=0A= This is a list of `io_buf_free` structures, which contain the=0A= addresses of buffer rings allocated by the kernel that will have to=0A= be freed eventually.=0A= =0A= - When the kernel allocates a provided buffer ring with=0A= `io_alloc_pbuf_ring()`, it stores its address in an `io_buf_free`=0A= structure, which is then added to the `io_buf_list` list.=0A= =0A= - Within the `__io_remove_buffers()` function, the pages corresponding=0A= to `bl->buf_ring` are no longer freed.=0A= =0A= - Only when the io_uring context is freed (which happens when the=0A= references to the io_uring device file drop to 0, and therefore=0A= when no userspace mapping to the buffer ring can exist), the pages=0A= of the provided buffer rings stored in the `io_buf_list` are=0A= freed.=0A=