Received-SPF: pass (google.com: domain of oss-security-return-30081-linux.lists.archive=gmail.com@lists.openwall.com designates 193.110.157.125 as permitted sender) client-ip=193.110.157.125;
Mailing-List: contact oss-security-help@lists.openwall.com; run by ezmlm
Precedence: bulk
Reply-To: oss-security@lists.openwall.com
From: =?iso-8859-1?Q?Oriol_Castej=F3n?= <Oriol.Castejon@exodusintel.com>
To: "oss-security@lists.openwall.com" <oss-security@lists.openwall.com>
Thread-Topic: CVE-2024-0582 - Linux kernel use-after-free vulnerability in
 io_uring, writeup and exploit strategy
Thread-Index: AQHalXj96Gza3raCo06P5CTKY/duJw==
Date: Wed, 24 Apr 2024 16:46:08 +0000
Message-ID:
 <BY3PR05MB8321706B2D4FB21E18520CEC8D112@BY3PR05MB8321.namprd05.prod.outlook.com>
Accept-Language: en-US
Content-Language: en-US
msip_labels:
authentication-results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=exodusintel.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: BY3PR05MB8321:EE_|SJ0PR05MB8757:EE_
x-ms-office365-filtering-correlation-id: c145c2ee-4951-4e65-81d5-08dc647e0aa1
x-ipw-groupmember: False
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info:
 =?iso-8859-1?Q?Pm+Jpj6Hljn1yYWnqdEfMgaqxlm6SUI0HZOoLwC+FE02eBybrR89mJVA7D?=
 =?iso-8859-1?Q?1S5xLJZNtjAauGrG1lbjsfZs+tXeJ13jGcWaG8SycjkFiJj0gIjftnsZN4?=
 =?iso-8859-1?Q?jgABsEI48MV81IpzsMEbLqh7jHrgf7gGAoonkFXJ+uoLNFENs53mcvBuK1?=
 =?iso-8859-1?Q?/wJfidyKVEKC36Ec8KBrqDIQwz0eIM82v2y427wuT5hccT7ZDOiDKuwS5u?=
 =?iso-8859-1?Q?4ENDPOYaEStaHyj6irvM87ZJVSH8QdhVExDBLNApwi1GpuacUFfSYDBx70?=
 =?iso-8859-1?Q?VTDTY8yRtqi3oIW9HjFVvHmJKqHsF+rzfKRx5mWrwQTczOnNABTVJ49jg9?=
 =?iso-8859-1?Q?jgCo+kXekGkmDJyPnrRrY2nalkVep3YIvc91vN9aa65N7dr6yWpGc9ybc+?=
 =?iso-8859-1?Q?rmC0HuQihibcnvEQne1HQp/DNnaCnvgPLLd5uDroXlwLPkA8Daa+znV/5d?=
 =?iso-8859-1?Q?ioYj0y1bqRqfcALyJXXiARBwz3ZK9Q366XA9NDdKfS8h5E8vlfHbHKNDBk?=
 =?iso-8859-1?Q?BhejnF1O9be31x6A2LVLg/+xwGEg9dg9BzNZGeZH3wI3FAH/cFtKzD6Dey?=
 =?iso-8859-1?Q?W6s2HEdJUPG2nOY8NdtvLiWzQGbT0mnZZynTyC1a8zYdjigTTlGx6Etxw/?=
 =?iso-8859-1?Q?q8EHUK8FN/eAlO0y7nzoOvDNoUWGKch8xM7cC5fSDP2Ag0OvrGM+BZzqEF?=
 =?iso-8859-1?Q?gghmdgepcUt2qAeYd6c9lIj9FBGPunHqSQHRHuctTwI2XwfkuToovs4ah8?=
 =?iso-8859-1?Q?C6h4dP3yrDvL9pT2B+2WELLctriZpVJLWsiq37uHRLYyMkY7Cj5kEUWi0o?=
 =?iso-8859-1?Q?jhptsZgxd0ZkgXPlDY/iBBHB8HldlMB29FGZaL5vLeXcu+Hzc9xpwdhv+2?=
 =?iso-8859-1?Q?KN826V/Qvs8IgB9dIL7YxzYWuHdHVoj6t7Nrio4cQ/hDChhko3xNR47mGO?=
 =?iso-8859-1?Q?+Bli4hxUl4fS9E0wFtQW2M5WiLShGYMYrUxexC5BN0MzCOCcAF3fI95dI1?=
 =?iso-8859-1?Q?+ztNw8Z00TRjOFtf4LrnxbmkrzOi86c8d03Dya7UVD7AiUwOIXWuIE52/F?=
 =?iso-8859-1?Q?HI+raV7bCa2LH5CvYe8SyhusGlm8cON3wvnvI87b2EPCu/cwTJbjHYO/Wo?=
 =?iso-8859-1?Q?jj/5TQMWFmRL8Ri4VPYdo5u6xuIS+XatZUKs+zYTFq7fdX2Evf/jtOBem0?=
 =?iso-8859-1?Q?IStsc+IGLqeLL3m6upRxECJHEe3Nkp8F7q33OeEAHHdmy5en5rq4odYxdN?=
 =?iso-8859-1?Q?0NuIl1QImooCW02eKBgt0ZylNMcKqI1ajZfOPjKH+n0MeSmhOVkQX6suP9?=
 =?iso-8859-1?Q?g1NBLdBmuN92d25Qd5tRzMdgkw=3D=3D?=
x-forefront-antispam-report:
 CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:BY3PR05MB8321.namprd05.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230031)(366007)(1800799015)(376005)(38070700009);DIR:OUT;SFP:1102;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0:
 =?iso-8859-1?Q?q4STvwSw4cVfheCWUtLdo644Wj5aoumuoizyMawFtFbQjQqpKNDOKoiFWE?=
 =?iso-8859-1?Q?Snd11KkLd4ogZbRFPZXhnyrC20VtH+0bZFq9OFYZ1mRpYwqE8IYfy6xwvi?=
 =?iso-8859-1?Q?49tCynelbSPnOCRZ5o+1bE60+5Ue9NHfOS7pyc12c9BgvYDgVe4H1RAF+a?=
 =?iso-8859-1?Q?SUs87gaJZoDGLFPoKC3YcL03NaWiWMeXAya3oPMjxEbsQIREys5dg89bYB?=
 =?iso-8859-1?Q?F4DoiSWUgxRvbCOcHx9Ms/GWj+EfzoeH0UoQVrH6s4wbd65dmPGnhiEv81?=
 =?iso-8859-1?Q?T2+3Nq+U4xnXftTxx4NQKmMgkcCj/zKgi8fd9V+lMgCiPQ7scwlrLmw67s?=
 =?iso-8859-1?Q?HeAvv3b2+QKovUTkHJjMnQQcUaFDd3Yw/gxzuNRHhhNP6G086BrZTWn+sQ?=
 =?iso-8859-1?Q?/T2nVdQ99Qxtyni3oePnQN7mr65sxrotJxqpWVuZPMpP0FljEHC/sZaLj9?=
 =?iso-8859-1?Q?d6GivEV5nvRaAQkA26ZIchPq21843XhJI1ooju7Fx9DQJTe7YdKHkztx5C?=
 =?iso-8859-1?Q?14fFhLVZxfZIHknBsWJiJ5dLcq8M+pdgKu2OuUOuQ6Yy0m6yXsl24fZt1q?=
 =?iso-8859-1?Q?Mc7uM57jlClhaj81/hZeGxJTeGenyy43bm8g2/WANzpTch+2ryNBGP0P1l?=
 =?iso-8859-1?Q?Q9aFF0ixzgZEXng5hQixtyiTGVc4iQmTauSauehb4e/NPn5gd1KqvoBRiu?=
 =?iso-8859-1?Q?SEt71CKSFPrmH8WtNu6lu24xujfZw0bjTQORPZszO/jj7NwXZ308wA6+Yj?=
 =?iso-8859-1?Q?irYpCGbR1byNpMv6CtpAP1tJxkxggJLr2CjKyUzkCUaoVfpJ502x9CQIGw?=
 =?iso-8859-1?Q?mG9pBuAGqVQ3O9FWlZiqUync9QCSovlN7yGGYuVnR/JQdE3I7y/XwUbBdS?=
 =?iso-8859-1?Q?yfW/HXebJLB7j61oZ21HwZFH5hWqz4A96LwvskOxSxLh8qYaQfQnDJBkG5?=
 =?iso-8859-1?Q?aJ2H4raZidRNQWDm3CnkKolyi6ZdrGIDJqyJ+Mjfi0fu72v4vT1X4o+vVg?=
 =?iso-8859-1?Q?XB+EK1TSICwxgsBdgc4/pX7o8LJEhNfOmeNdW0agOD89yFh3DDsjYm86fh?=
 =?iso-8859-1?Q?YAwTKMrdx77DleXpcQ8OAMfqHC6Yz+JiLtXInd6vqmqLYqVlcAVDupYtI/?=
 =?iso-8859-1?Q?fPnWK6JysvdAOpkKm3MvV54r6htwzRzlXH66aNCNCeQEmSyFMAwKvSfguP?=
 =?iso-8859-1?Q?WEuw5gGsGT7TdULoQRZ88LME4mWfNs9x+OVRNVf1cdaa1UNq4JdhKJFsCj?=
 =?iso-8859-1?Q?x78x8s6nJA7WVVMTn26cf1ny1nBxSup71PfIpdAZJlYQIHreQYR8kTeqA5?=
 =?iso-8859-1?Q?5rgpfBVhI/0tKTdr7jNu0cf1mWb8Nr0mreLsEusKEBiiBDajI2GHOe8o0r?=
 =?iso-8859-1?Q?HhTIy8YitYDjYpQgW0tqZSvJDZvrY0kkFH3YD/bBYorX4nGjOOeBiDEbe3?=
 =?iso-8859-1?Q?zmedJEJCJeMsfpkM4ZAF1GoAGryx0To4FwKu6fdSJnQYK42/SxCAuK5fIU?=
 =?iso-8859-1?Q?LCDtsDWBv/1Fe/fmozasF96fB3e1hjhmb1H1XlAvhwOhqj8mzRY1A9gz1g?=
 =?iso-8859-1?Q?9vAcolpHzbsIf/VlRleAHJ7ryObVBpRu+epNd8bnH9iO1eGchy78e9CgcH?=
 =?iso-8859-1?Q?bQjUn0BEiwrEHMyBIU7PkUK2CSuwFN4aUlCOIfN7iHMe9iJ4IR3wurFQ?=
 =?iso-8859-1?Q?=3D=3D?=
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: exodusintel.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: BY3PR05MB8321.namprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: c145c2ee-4951-4e65-81d5-08dc647e0aa1
X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Apr 2024 16:46:08.6763
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 3492a56f-acf8-4963-a9f2-c584d03f4554
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: VHTGxyyswTxURwdR6m9E6ejnUhlXhzLX4mEs1JJ+xnyGT7LNR91p82w5BerkBDAGEVn9NPb2foNbY3pypYYU5SdH7nTqDXBEdhyBHLVJEIE=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ0PR05MB8757
Subject: [oss-security] CVE-2024-0582 - Linux kernel use-after-free vulnerability in
 io_uring, writeup and exploit strategy

Hi all,=0A=
=0A=
a use-after-free vulnerability in the io_uring subsystem of the Linux=0A=
kernel (CVE-2024-0582) was identified last November by Jann Horn from=0A=
Google Project Zero, see:=0A=
=0A=
https://bugs.chromium.org/p/project-zero/issues/detail?id=3D2504=0A=
=0A=
The issue was introduced by the following commit, which was included=0A=
in version 6.4 of the Linux kernel:=0A=
=0A=
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?=
id=3Dc56e022c0a27=0A=
=0A=
The issue was fixed in the following commit, which was included in the=0A=
stable release 6.6.5:=0A=
=0A=
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?=
id=3Dc392cbecd8ec=0A=
=0A=
Below are the details of the vulnerability, as well as an exploitation=0A=
strategy that was successful to exploit the patch gap in Ubuntu. The=0A=
contents of this message (plus some images) were originally published=0A=
in the following blog: =0A=
=0A=
https://blog.exodusintel.com/2024/03/27/mind-the-patch-gap-exploiting-an-io=
_uring-vulnerability-in-ubuntu/=0A=
=0A=
Additionally, a brief summary of the implemented fix, which was not=0A=
included in the original blog post, is provided at the end of this=0A=
message.=0A=
=0A=
=0A=
## Preliminaries=0A=
=0A=
The io_uring interface is an asynchronous I/O API for Linux created by=0A=
Jens Axboe and introduced in the Linux kernel version 5.1. Its goal=0A=
is to improve performance of applications with a high number of I/O=0A=
operations. It provides interfaces similar to functions like =0A=
`read()` and `write()`, for example, but requests are satisfied in an=0A=
asynchronous manner to avoid the context switching overhead caused by=0A=
blocking system calls.=0A=
=0A=
The io_uring interface has been a bountiful target for a lot of=0A=
vulnerability research; it was disabled in ChromeOS, production=0A=
Google servers, and restricted in Android. As such, there are many=0A=
blog posts that explain it with a lot of detail. Some relevant=0A=
references are the following:=0A=
- [Put an io_uring on it - Exploiting the Linux Kernel]=0A=
  (https://chomp.ie/Blog+Posts/Put+an+io_uring+on+it+-+Exploiting+the+Linux=
+Kernel),=0A=
  a writeup for an exploit targeting an io_uring operation that=0A=
  provides the same functionality (`IORING_OP_PROVIDE_BUFFERS`) as=0A=
  the vulnerability discussed here (`IORING_REGISTER_PBUF_RING`), and=0A=
  that has also a broad overview of this subsystem.=0A=
- [CVE-2022-29582 An io_uring vulnerability]=0A=
  (https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/),=0A=
  where a cross-cache exploit is described. While the exploit=0A=
  described in our blog post is not strictly speaking cross-cache,=0A=
  there is some similarity between the two exploit strategies. It=0A=
  also provides an explanation of slab caches and the page allocator=0A=
  relevant to our exploit strategy.=0A=
- [Escaping the Google kCTF Container with a Data-Only Exploit]=0A=
  (https://h0mbre.github.io/kCTF_Data_Only_Exploit/), where a=0A=
  different strategy for data-only exploit of an io_uring=0A=
  vulnerability is described.=0A=
- [Conquering the memory through io_uring - Analysis of CVE-2023-2598]=0A=
  (https://anatomic.rip/cve-2023-2598/), a writeup of a vulnerability=0A=
  that yields a very similar exploit primitive to ours. In this case,=0A=
  however, the exploit strategy relies on manipulating a structure=0A=
  associated with a socket, instead of manipulating file structures.=0A=
=0A=
In the next subsections we give an overview of the io_uring interface.=0A=
We pay special attention to the Provided Buffer Ring functionality,=0A=
which is relevant to the vulnerability discussed in this post. The=0A=
reader can also check "[What is io_uring?]=0A=
(https://unixism.net/loti/what_is_io_uring.html)", as well as the=0A=
above references for alternative overviews of this subsystem.=0A=
=0A=
=0A=
### The io_uring Interface=0A=
=0A=
The basis of io_uring is a set of two ring buffers used for=0A=
communication between user and kernel space. These are:=0A=
=0A=
- The *submission queue* (SQ), which contains submission queue=0A=
  entries (SQEs) describing a request for an I/O operation, such as=0A=
  reading or writing to a file, etc.=0A=
- The *completion queue* (CQ), which contains completion queue=0A=
  entries (CQEs) that correspond to SQEs that have been processed and=0A=
  completed.=0A=
=0A=
This model allows executing a number of I/O requests to be performed=0A=
asynchronously using a single system call, while in a synchronous=0A=
manner each request would have typically corresponded to a single=0A=
system call. This reduces the overhead caused by blocking system=0A=
calls, thus improving performance. Moreover, the use of shared=0A=
buffers also reduces the overhead as no data between user and=0A=
kernelspace has to be transferred.=0A=
=0A=
The io_uring API consists of three system calls:=0A=
=0A=
- `io_uring_setup()`=0A=
- `io_uring_register()`=0A=
- `io_uring_enter()`=0A=
=0A=
#### The `io_uring_setup()` System Call=0A=
=0A=
The `io_uring_setup()` system call sets up a context for an io_uring=0A=
instance, that is, a submission and a completion queue with the=0A=
indicated number of entries each one. Its prototype is the=0A=
following:=0A=
=0A=
```c=0A=
int io_uring_setup(u32 entries, struct io_uring_params *p);=0A=
```=0A=
=0A=
Its arguments are:=0A=
=0A=
- `entries`: It determines how many elements the SQ and CQ must have=0A=
  at the minimum.=0A=
- `params`: It can be used by the application to pass options to the=0A=
  kernel, and by the kernel to pass information to the application=0A=
  about the ring buffers.=0A=
=0A=
On success, the return value of this system call is a file descriptor=0A=
that can be later used to perform operation on the io_uring instance.=0A=
=0A=
#### The `io_uring_register()` System Call=0A=
=0A=
The `io_uring_register()` system call allows registering resources,=0A=
such as user buffers, files, etc., for use in an io_uring instance.=0A=
Registering such resources makes the kernel map them, avoiding future=0A=
copies to and from userspace, thus improving performance. Its=0A=
prototype is the following:=0A=
=0A=
```c=0A=
int io_uring_register(unsigned int fd, unsigned int opcode, void *arg =0A=
     unsigned int nr_args);=0A=
```=0A=
=0A=
Its arguments are:=0A=
=0A=
- `fd`: The io_uring file descriptor returned by the=0A=
  `io_uring_setup()` system call.=0A=
- `opcode`: The specific operation to be executed. It can have certain=0A=
  values such as `IORING_REGISTER_BUFFERS`, to register user buffers,=0A=
  or `IORING_UNREGISTER_BUFFERS`, to release the previously=0A=
  registered buffers.=0A=
- `arg`: Arguments passed to the operation being executed. Their type=0A=
  depends on the specific `opcode` being passed.=0A=
- `nr_args`: Number of arguments in `args` being passed.=0A=
=0A=
On success, the return value of this system call is either zero or a positi=
ve value, depending on the `opcode` used.=0A=
=0A=
##### Provided Buffer Rings=0A=
=0A=
An application might need to have different types of registered=0A=
buffers for different I/O requests. Since kernel version 5.7, to=0A=
facilitate managing these different sets of buffers, io_uring allows=0A=
the application to register a pool of buffers that are identified by=0A=
a group ID. This is done using the `IORING_REGISTER_PBUF_RING` opcode=0A=
in the `io_uring_register()` system call.=0A=
=0A=
More precisely, the application starts by allocating a set of buffers=0A=
that it wants to register. Then, it makes the =0A=
`io_uring_register()` system call with opcode=0A=
`IORING_REGISTER_PBUF_RING`, specifying a group ID with which these=0A=
buffers should be associated, a start address of the buffers, the=0A=
length of each buffer, the number of buffers, and a starting buffer=0A=
ID. This can be done for multiple sets of buffers, each one having a=0A=
different group ID.=0A=
=0A=
Finally, when submitting a request, the application can use the=0A=
`IOSQE_BUFFER_SELECT` flag and provide the desired group ID to=0A=
indicate that a provided buffer ring from the corresponding set=0A=
should be used. When the operation has been completed, the buffer ID=0A=
of the buffer used for the operation is passed to the application via=0A=
the corresponding CQE.=0A=
=0A=
Provided buffer rings can be unregistered via the =0A=
`io_uring_register()` system call using the =0A=
`IORING_UNREGISTER_PBUF_RING` opcode.=0A=
=0A=
##### User-mapped Provided Buffer Rings=0A=
=0A=
In addition to the buffers allocated by the application, since kernel=0A=
version 6.4, io_uring allows a user to delegate the allocation of=0A=
provided buffer rings to the kernel. This is done using the=0A=
`IOU_PBUF_RING_MMAP` flag passed as an argument to =0A=
`io_uring_register()`. In this case, the application does not need =0A=
to previously allocate these buffers, and therefore the start address=0A=
of the buffers does not have to be passed to the system call. Then,=0A=
after `io_uring_register()` returns, the application can `mmap()` the=0A=
buffers into userspace with the offset set as:=0A=
=0A=
```c =0A=
IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT)=0A=
```=0A=
=0A=
where `bgid` is the corresponding group ID. These offsets, as well as=0A=
others used to `mmap()` the io_uring data, are defined in=0A=
`include/uapi/linux/io_uring.h`:=0A=
=0A=
```c=0A=
/*=0A=
 * Magic offsets for the application to mmap the data it needs=0A=
 */=0A=
#define IORING_OFF_SQ_RING			0ULL #define=0A=
 IORING_OFF_CQ_RING			0x8000000ULL #define=0A=
 IORING_OFF_SQES				0x10000000ULL #define=0A=
 IORING_OFF_PBUF_RING		0x80000000ULL #define=0A=
 IORING_OFF_PBUF_SHIFT		16 #define=0A=
 IORING_OFF_MMAP_MASK		0xf8000000ULL =0A=
```=0A=
=0A=
The function that handles such an `mmap()` call is `io_uring_mmap()`:=0A=
=0A=
```c=0A=
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/io_uring=
.c#L3439=0A=
=0A=
static __cold int io_uring_mmap(struct file *file, struct vm_area_struct *v=
ma)=0A=
{=0A=
	size_t sz =3D vma->vm_end - vma->vm_start;=0A=
	unsigned long pfn;=0A=
	void *ptr;=0A=
=0A=
	ptr =3D io_uring_validate_mmap_request(file, vma->vm_pgoff, sz);=0A=
	if (IS_ERR(ptr))=0A=
		return PTR_ERR(ptr);=0A=
=0A=
	pfn =3D virt_to_phys(ptr) >> PAGE_SHIFT;=0A=
	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);=0A=
}=0A=
```=0A=
=0A=
Note that `remap_pfn_range()` ultimately creates a mapping with the=0A=
`VM_PFNMAP` flag set, which means that the MM subsystem will treat=0A=
the base pages as raw page frame number mappings wihout an associated=0A=
page structure. In particular, the core kernel will not keep=0A=
reference counts of these pages, and keeping track of it is the=0A=
responsability of the calling code (in this case, the io_uring=0A=
subsystem).=0A=
=0A=
=0A=
#### The `io_uring_enter()` System Call=0A=
=0A=
The `io_uring_enter()` system call is used to initiate and complete=0A=
I/O using the SQ and CQ that have been previously set up via the=0A=
`io_uring_setup()` system call. Its prototype is the following:=0A=
=0A=
```c=0A=
int io_uring_enter(unsigned int fd, unsigned int to_submit, =0A=
	unsigned int min_complete, unsigned int flags, sigset_t *sig);=0A=
```=0A=
=0A=
Its arguments are:=0A=
=0A=
- `fd`: The io_uring file descriptor returned by the =0A=
  `io_uring_setup()` system call.=0A=
- `to_submit`: Specifies the number of I/Os to submit from the SQ.=0A=
- `flags`: A bitmask value that allows specifying certain options,=0A=
  such as `IORING_ENTER_GETEVENTS`, `IORING_ENTER_SQ_WAKEUP`,=0A=
  `IORING_ENTER_SQ_WAIT`, etc.=0A=
- `sig`: A pointer to a signal mask. If it is not `NULL`, the system=0A=
  call replaces the current signal mask by the one pointed to by=0A=
  `sig`, and when events become available in the CQ restores the=0A=
  original signal mask.=0A=
=0A=
=0A=
## Vulnerability=0A=
=0A=
The vulnerability can be triggered when an application registers a=0A=
provided buffer ring with the `IOU_PBUF_RING_MMAP` flag. In this=0A=
case, the kernel allocates the memory for the provided buffer ring,=0A=
instead of it being done by the application. To access the buffers,=0A=
the application has to `mmap()` them to get a virtual mapping. If the=0A=
application later unregisters the provided buffer ring using the=0A=
`IORING_UNREGISTER_PBUF_RING` opcode, the kernel frees this memory=0A=
and returns it to the page allocator. However, it does not have any=0A=
mechanism to check whether the memory has been previously unmapped in=0A=
userspace. If this has not been done, the application has a valid=0A=
memory mapping to freed pages that can be reallocated by the kernel=0A=
for other purposes. From this point, reading or writing to these=0A=
pages will trigger a use-after-free.=0A=
=0A=
The following code blocks show the affected parts of functions=0A=
relevant to this vulnerability. Code snippets are demarcated by=0A=
reference markers denoted by [N]. Lines not relevant to this=0A=
vulnerability are replaced by a [Truncated] marker. The code=0A=
corresponds to the Linux kernel version 6.5.3, which corresponds to=0A=
the version used in the Ubuntu kernel `6.5.0-15-generic`.=0A=
=0A=
### Registering User-mapped Provided Buffer Rings=0A=
=0A=
The handler of the `IORING_REGISTER_PBUF_RING` opcode for the=0A=
`io_uring_register()` system call is the =0A=
`io_register_pbuf_ring()` function, shown in the next listing.=0A=
=0A=
```c=0A=
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L=
537=0A=
=0A=
int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)=0A=
{=0A=
	struct io_uring_buf_reg reg;=0A=
	struct io_buffer_list *bl, *free_bl =3D NULL;=0A=
	int ret;=0A=
=0A=
[1]=0A=
=0A=
	if (copy_from_user(&reg, arg, sizeof(reg)))=0A=
		return -EFAULT;=0A=
=0A=
[Truncated]=0A=
=0A=
	if (!is_power_of_2(reg.ring_entries))=0A=
		return -EINVAL;=0A=
=0A=
[2]=0A=
=0A=
	/* cannot disambiguate full vs empty due to head/tail size */=0A=
	if (reg.ring_entries >=3D 65536)=0A=
		return -EINVAL;=0A=
=0A=
	if (unlikely(reg.bgid < BGID_ARRAY && !ctx->io_bl)) {=0A=
		int ret =3D io_init_bl_list(ctx);=0A=
		if (ret)=0A=
			return ret;=0A=
	}=0A=
=0A=
	bl =3D io_buffer_get_list(ctx, reg.bgid);=0A=
	if (bl) {=0A=
		/* if mapped buffer ring OR classic exists, don't allow */=0A=
		if (bl->is_mapped || !list_empty(&bl->buf_list))=0A=
			return -EEXIST;=0A=
	} else {=0A=
=0A=
[3]=0A=
=0A=
		free_bl =3D bl =3D kzalloc(sizeof(*bl), GFP_KERNEL);=0A=
		if (!bl)=0A=
			return -ENOMEM;=0A=
	}=0A=
=0A=
[4]=0A=
=0A=
	if (!(reg.flags & IOU_PBUF_RING_MMAP))=0A=
		ret =3D io_pin_pbuf_ring(&reg, bl);=0A=
	else=0A=
		ret =3D io_alloc_pbuf_ring(&reg, bl);=0A=
=0A=
[Truncated]=0A=
=0A=
	return ret;=0A=
}=0A=
```=0A=
=0A=
The function starts by copying the provided arguments into an=0A=
`io_uring_buf_reg` structure reg [1]. Then, it checks that the=0A=
desired number of entries is a power of two and is strictly less than=0A=
65536 [2]. Note that this implies that the maximum number of allowed=0A=
entries is 32768.=0A=
=0A=
Next, it checks whether a provided buffer list with the specified=0A=
group ID `reg.bgid` exists and, in case it does not, an=0A=
`io_buffer_list` structure is allocated and its address is stored in=0A=
the variable `bl` [3]. Finally, if the provided arguments have the =0A=
flag `IOU_PBUF_RING_MMAP` set, the `io_alloc_pbuf_ring()` function is=0A=
called [4], passing in the address of the structure `reg`, which=0A=
contains the arguments passed to the system call, and the pointer to=0A=
the allocated buffer list structure `bl`.=0A=
=0A=
```c=0A=
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L=
519=0A=
=0A=
static int io_alloc_pbuf_ring(struct io_uring_buf_reg *reg,=0A=
			      struct io_buffer_list *bl)=0A=
{=0A=
	gfp_t gfp =3D GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP;=
=0A=
	size_t ring_size;=0A=
	void *ptr;=0A=
=0A=
[5]=0A=
=0A=
	ring_size =3D reg->ring_entries * sizeof(struct io_uring_buf_ring);=0A=
=0A=
[6]=0A=
=0A=
	ptr =3D (void *) __get_free_pages(gfp, get_order(ring_size));=0A=
	if (!ptr)=0A=
		return -ENOMEM;=0A=
=0A=
[7]=0A=
=0A=
	bl->buf_ring =3D ptr;=0A=
	bl->is_mapped =3D 1;=0A=
	bl->is_mmap =3D 1;=0A=
	return 0;=0A=
}=0A=
```=0A=
=0A=
The `io_alloc_pbuf_ring()` function takes the number of ring entries=0A=
specified in `reg->ring_entries` and computes the resulting size=0A=
`ring_size` by multiplying it by the size of the `io_uring_buf_ring`=0A=
structure [5], which is 16 bytes. Then, it requests a number of pages=0A=
from the page allocator that can fit this size via a call to=0A=
`__get_free_pages()` [6]. Note that for the maximum number of allowed=0A=
ring entries, 32768, `ring_size` is 524288 and thus the maximum=0A=
number of 4096-byte pages that can be retrieved is 128. The address=0A=
of the first page is then stored in the `io_buffer_list` structure,=0A=
more precisely in `bl->buf_ring` [7]. Also, `bl->is_mapped` and=0A=
`bl->is_mmap` are set to 1.=0A=
=0A=
### Unregistering Provided Buffer Rings=0A=
=0A=
The handler of the `IORING_UNREGISTER_PBUF_RING` opcode for the=0A=
`io_uring_register()` system call is the =0A=
`io_unregister_pbuf_ring()` function, shown in the next listing.=0A=
=0A=
```c=0A=
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L=
601=0A=
=0A=
int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)=0A=
{=0A=
	struct io_uring_buf_reg reg;=0A=
	struct io_buffer_list *bl;=0A=
=0A=
[8]=0A=
=0A=
    if (copy_from_user(&reg, arg, sizeof(reg)))=0A=
		return -EFAULT;=0A=
	if (reg.resv[0] || reg.resv[1] || reg.resv[2])=0A=
		return -EINVAL;=0A=
	if (reg.flags)=0A=
		return -EINVAL;=0A=
=0A=
[9]=0A=
=0A=
	bl =3D io_buffer_get_list(ctx, reg.bgid);=0A=
	if (!bl)=0A=
		return -ENOENT;=0A=
	if (!bl->is_mapped)=0A=
		return -EINVAL;=0A=
=0A=
[10]=0A=
=0A=
	__io_remove_buffers(ctx, bl, -1U);=0A=
	if (bl->bgid >=3D BGID_ARRAY) {=0A=
		xa_erase(&ctx->io_bl_xa, bl->bgid);=0A=
		kfree(bl);=0A=
	}=0A=
	return 0;=0A=
}=0A=
```=0A=
=0A=
Again, the function starts by copying the provided arguments into a=0A=
`io_uring_buf_reg` structure `reg` [8]. Then, it retrieves the =0A=
provided buffer list corresponding to the group ID specified in=0A=
`reg.bgid` and stores its address in the variable `bl` [9]. Finally,=0A=
it passes `bl` to the function `__io_remove_buffers()` [10].=0A=
=0A=
```c=0A=
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L=
209=0A=
=0A=
static int __io_remove_buffers(struct io_ring_ctx *ctx,=0A=
			       struct io_buffer_list *bl, unsigned nbufs)=0A=
{=0A=
	unsigned i =3D 0;=0A=
=0A=
	/* shouldn't happen */=0A=
	if (!nbufs)=0A=
		return 0;=0A=
=0A=
	if (bl->is_mapped) {=0A=
		i =3D bl->buf_ring->tail - bl->head;=0A=
		if (bl->is_mmap) {=0A=
			struct page *page;=0A=
=0A=
[11]=0A=
=0A=
			page =3D virt_to_head_page(bl->buf_ring);=0A=
            =0A=
[12]=0A=
=0A=
			if (put_page_testzero(page))=0A=
				free_compound_page(page);=0A=
			bl->buf_ring =3D NULL;=0A=
			bl->is_mmap =3D 0;=0A=
		} else if (bl->buf_nr_pages) {=0A=
=0A=
[Truncated]=0A=
```=0A=
=0A=
In case the buffer list structure has the `is_mapped` and `is_mmap`=0A=
flags set, which is the case when the buffer ring was registered with=0A=
the `IOU_PBUF_RING_MMAP` flag [7], the function reaches [11]. Then,=0A=
the `page` structure of the head page corresponding to the virtual=0A=
address of the buffer ring `bl->buf_ring` is obtained. Finally, all=0A=
the pages forming the compound page with head `page` are freed at=0A=
[12], thus returning them to the page allocator.=0A=
=0A=
Note that if the provided buffer ring is set up with=0A=
`IOU_PBUF_RING_MMAP`, that is, it has been allocated by the kernel=0A=
and not the application, the userspace application is expected to=0A=
have previously `mmap()`ed this memory. Moreover, recall that since =0A=
the memory mapping was created with the `VM_PFNMAP` flag, the=0A=
reference count of the page structure was not modified during this=0A=
operation. In other words, in the code above there is no way for the=0A=
kernel to know whether the application has unmapped the memory before=0A=
freeing it via the call to `free_compound_page()`. If this has not=0A=
happened, a use-after-free can be triggered by the application by=0A=
just reading or writing to this memory.=0A=
=0A=
## Exploitation=0A=
=0A=
The exploitation mechanism presented in this post relies on how memory=0A=
allocation works on Linux, so the reader is expected to have some=0A=
familiarity with it. As a refresher, we highlight the following=0A=
facts:=0A=
=0A=
- The page allocator is in charge of managing memory pages, which are=0A=
  usually 4096 bytes. It keeps lists of free pages of order n, that=0A=
  is, memory chunks of page size multiplied by 2^n. These pages are=0A=
  served in a first-in-first-out basis.=0A=
- The slab allocator sits on top of the buddy allocator and keeps=0A=
  caches of commonly used objects (dedicated caches) or fixed-size=0A=
  objects (generic caches), called slab caches, available for=0A=
  allocation in the kernel. There are several implementations of slab=0A=
  allocators, but for the purpose of this post only the SLUB=0A=
  allocator, the default in modern versions of the kernel, is=0A=
  relevant.=0A=
- Slab caches are formed by multiple slabs, which are sets of one or=0A=
  more contiguous pages of memory. When a slab cache runs out of free=0A=
  slabs, which can happen if a large number of objects of the same=0A=
  type or size are allocated and not freed during a period of time,=0A=
  the operating system allocates a new slab by requesting free pages=0A=
  to the page allocator.=0A=
=0A=
One of such cache slabs is the `filp`, which contains `file`=0A=
structures. A `file` structure, shown in the next listing, represents=0A=
an open file.=0A=
=0A=
```c=0A=
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/include/linux/fs.=
h#L961=0A=
=0A=
struct file {=0A=
	union {=0A=
		struct llist_node	f_llist;=0A=
		struct rcu_head 	f_rcuhead;=0A=
		unsigned int 		f_iocb_flags;=0A=
	};=0A=
=0A=
	/*=0A=
	 * Protects f_ep, f_flags.=0A=
	 * Must not be taken from IRQ context.=0A=
	 */=0A=
	spinlock_t		f_lock;=0A=
	fmode_t			f_mode;=0A=
	atomic_long_t		f_count;=0A=
	struct mutex		f_pos_lock;=0A=
	loff_t			f_pos;=0A=
	unsigned int		f_flags;=0A=
	struct fown_struct	f_owner;=0A=
	const struct cred	*f_cred;=0A=
	struct file_ra_state	f_ra;=0A=
	struct path		f_path;=0A=
	struct inode		*f_inode;	/* cached value */=0A=
	const struct file_operations	*f_op;=0A=
=0A=
	u64			f_version;=0A=
#ifdef CONFIG_SECURITY=0A=
	void			*f_security;=0A=
#endif=0A=
	/* needed for tty driver, and maybe others */=0A=
	void			*private_data;=0A=
=0A=
#ifdef CONFIG_EPOLL=0A=
	/* Used by fs/eventpoll.c to link all the hooks to this file */=0A=
	struct hlist_head	*f_ep;=0A=
#endif /* #ifdef CONFIG_EPOLL */=0A=
	struct address_space	*f_mapping;=0A=
	errseq_t		f_wb_err;=0A=
	errseq_t		f_sb_err; /* for syncfs */=0A=
} __randomize_layout=0A=
  __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK=
 */=0A=
```=0A=
=0A=
The most relevant fields for this exploit are the following:=0A=
=0A=
- `f_mode`: Determines whether the file is readable or writable.=0A=
- `f_pos`: Determines the current reading or writing position.=0A=
- `f_op`: The operations associated with the file. It determines the=0A=
  functions to be executed when certain system calls such as =0A=
  `read()`, `write()`, etc., are issued on the file. For files in=0A=
  `ext4` filesystems, this is equal to the `ext4_file_operations`=0A=
  variable.=0A=
=0A=
### Strategy for a Data-Only Exploit=0A=
=0A=
The exploit primitive provides an attacker with read and write access=0A=
to a certain number of free pages that have been returned to the page=0A=
allocator. By opening a file a large number of times, the attacker=0A=
can force the exhaustion of all the slabs in the `filp` cache, so=0A=
that free pages are requested to the page allocator to create a new=0A=
slab in this cache. In this case, further allocations of file=0A=
structures will happen in the pages on which the attacker has read=0A=
and write access, thus being able to modify them. In particular, for=0A=
example, by modifying the `f_mode` field, the attacker can make a=0A=
file that has been opened with read-only permissions to be writable.=0A=
=0A=
This strategy was implemented to successfully exploit the following=0A=
versions of Ubuntu:=0A=
=0A=
- Ubuntu 22.04 Jammy Jellyfish LTS with kernel `6.5.0-15-generic`.=0A=
- Ubuntu 22.04 Jammy Jellyfish LTS with kernel `6.5.0-17-generic`.=0A=
- Ubuntu 23.10 Mantic Minotaur with kernel `6.5.0-15-generic`.=0A=
- Ubuntu 23.10 Mantic Minotaur with kernel `6.5.0-17-generic`.=0A=
=0A=
The next subsections give more details on how this strategy can be=0A=
carried out.=0A=
=0A=
#### Triggering the Vulnerability=0A=
=0A=
The strategy begins by triggering the vulnerability to obtain read and=0A=
write access to freed pages. This can be done by executing the=0A=
following steps:=0A=
- Making an `io_uring_setup()` system call to set up the io_uring=0A=
  instance.=0A=
- Making an `io_uring_register()` system call with opcode=0A=
  `IORING_REGISTER_PBUF_RING` and the `IOU_PBUF_RING_MMAP` flag,  so=0A=
  that the kernel itself allocates the memory for the provided buffer=0A=
  ring.=0A=
- `mmap()`ing the memory of the provided buffer ring with read and=0A=
  write permissions, using the io_uring file descriptor and the=0A=
  offset `IORING_OFF_PBUF_RING`. =0A=
- Unregistering the provided buffer ring by making an=0A=
  `io_uring_register()` system call with opcode=0A=
  `IORING_UNREGISTER_PBUF_RING`.=0A=
=0A=
At this point, the pages corresponding to the provided buffer ring have bee=
n returned to the page allocator, while the attacker still has a valid refe=
rence to them.=0A=
=0A=
#### Spraying File Structures=0A=
=0A=
The next step is spawning a large number of child processes, each one=0A=
opening the file `/etc/passwd` many times with read-only permissions.=0A=
This forces the allocation of corresponding file structures in the=0A=
kernel.=0A=
=0A=
By opening a large number of files, the attacker can force the=0A=
exhaustion of the slabs in the `filp` cache. After that, new slabs=0A=
will be allocated by requesting free pages from the page allocator.=0A=
At some point, the pages that previously corresponded to the provided=0A=
buffer ring, and to which the attacker still has read and write=0A=
access, will be returned by the page allocator.=0A=
=0A=
Hence, all of the file structures created after this point will be=0A=
allocated in the attacker-controlled memory region, giving them the=0A=
possibility to modify the structures.=0A=
=0A=
Note that these child processes have to wait until indicated to=0A=
proceed in the last stage of the exploit, so that the files are kept=0A=
open and their corresponding structures are not freed.=0A=
=0A=
#### Locating a File Structure in Memory=0A=
=0A=
Although the attacker may have access to some slabs belonging to the=0A=
`filp` cache, they don't know where they are within the memory=0A=
region. To identify these slabs, however, the attacker can search for=0A=
the `ext4_file_operations` address at the offset of the `file.f_op`=0A=
field within the file structure. When one is found, it can be safely=0A=
assumed that it corresponds to the file structure of one instance of=0A=
the previously opened `/etc/passwd` file.=0A=
=0A=
Note that even when Kernel Address Space Layout Randomization=0A=
(KASLR) is enabled, to identify the `ext4_file_operations` address in=0A=
memory it is only necessary to know the offset of this symbol with=0A=
respect to the `_text` symbol, so there is no need for a KASLR=0A=
bypass. Indeed, given a value `val` of an unsigned integer found in=0A=
memory at the corresponding offset, one can safely assume that it is=0A=
the address of `ext4_file_operations` if:=0A=
=0A=
- `(val >> 32 & 0xffffffff) =3D=3D 0xffffffff`, i.e. the 32 most=0A=
  significant bits are all 1.=0A=
- `(val & 0xfffff) =3D=3D (ext4_fops_offset & 0xfffff)`, i.e. the 20 least=
=0A=
  significant bits of `val` and `ext4_fops_offset`, the offset of=0A=
  `ext4_file_operations` with respect to `_text`, are the same.=0A=
=0A=
#### Changing File Permissions and Adding a Backdoor Account=0A=
=0A=
Once a file structure corresponding to the `/etc/passwd` file is=0A=
located in the memory region accessible by the attacker, it can be=0A=
modified at will. In particular, setting the `FMODE_WRITE` and=0A=
`FMODE_CAN_WRITE` flags in the `file.f_mode` field of the found=0A=
structure will make the `/etc/passwd` file writable when using the=0A=
corresponding file descriptor.=0A=
=0A=
Moreover, setting the `file.f_pos` field of the found file structure=0A=
to the current size of the `/etc/passwd` file, the attacker can=0A=
ensure that any data written to it is appended at the end of the=0A=
file.=0A=
=0A=
To finish, the attacker can signal all the child processes spawned in=0A=
the second stage to try to write to the opened `/etc/passwd` file.=0A=
While most of all of such attempts will fail, as the file was opened=0A=
with read-only permissions, the one corresponding to the modified=0A=
file structure, which has write permissions enabled due to the=0A=
modification of the `file->f_mode` field, will succeed.=0A=
=0A=
=0A=
## The Fix=0A=
=0A=
As mentioned above, a fix for this vulnerability was introduced in=0A=
the Linux kernel in commit c392cbecd8ec.=0A=
=0A=
The main points of this fix are the following:=0A=
=0A=
- A field `io_buf_list` in the io_uring context structure is added.=0A=
  This is a list of `io_buf_free` structures, which contain the=0A=
  addresses of buffer rings allocated by the kernel that will have to=0A=
  be freed eventually.=0A=
=0A=
- When the kernel allocates a provided buffer ring with=0A=
  `io_alloc_pbuf_ring()`, it stores its address in an `io_buf_free`=0A=
  structure, which is then added to the `io_buf_list` list.=0A=
=0A=
- Within the `__io_remove_buffers()` function, the pages corresponding=0A=
  to `bl->buf_ring` are no longer freed.=0A=
=0A=
- Only when the io_uring context is freed (which happens when the=0A=
  references to the io_uring device file drop to 0, and therefore=0A=
  when no userspace mapping to the buffer ring can exist), the pages=0A=
  of the provided buffer rings stored in the `io_buf_list` are=0A=
  freed.=0A=