(reposted from my comments at http://lwn.net/Articles/593918/)
I may have thought of a way to subvert SEAL_WRITE using O_DIRECT and
asynchronous I/O. I am not sure if this is a real problem or not, but
better to ask, right?
The exploit would go like this:
1) mmap() the shared memory
2) open some *other* file with O_DIRECT
3) prepare a read-type iocb for the *other* file pointing to the
mmap()ed memory
4) io_submit(), but don't wait for it to complete
5) munmap() the shared memory
6) SEAL_WRITE the shared memory
7) the "sealed" memory is overwritten by DMA from the disk drive at some
point in the future when the I/O completes
So this exploit effectively changes the contents of the supposedly
write-protected memory after SEAL_WRITE is granted.
For O_DIRECT the kernel pins the submitted pages in memory for DMA by
incrementing the page reference counts when the I/O is submitted,
allowing the pages to be modified by DMA even if they are no longer
mapped in the address space of the process. This is different from a
regular read(), which uses the CPU to copy the data and will fail if the
pages are not mapped.
I am sure there are also other direct I/O mechanisms in the kernel that
can be used to setup a DMA transfer to change the contents of unmapped
memory; the SCSI generic driver comes to mind.
I suppose SEAL_WRITE could iterate over all the pages in the file and
check to make sure no page refcount is greater than the "expected"
value, and return an error instead of granting the seal if a page is
found with an unexpected extra reference that might have been added by
e.g. get_user_pages() for direct I/O. But looking over shmem_set_seals()
in patch 2/6, it doesn't seem to do that.
Tony Battersby
Cybernetics
Hi
On Thu, Apr 10, 2014 at 11:33 PM, Tony Battersby <[email protected]> wrote:
> For O_DIRECT the kernel pins the submitted pages in memory for DMA by
> incrementing the page reference counts when the I/O is submitted,
> allowing the pages to be modified by DMA even if they are no longer
> mapped in the address space of the process. This is different from a
> regular read(), which uses the CPU to copy the data and will fail if the
> pages are not mapped.
Can you please provide an example code-path? For instance,
file_read_actor() does not pin any pages but only keeps the user-space
address and resolves it once it has data to write.
Thanks
David
On 04/10/2014 05:22 PM, David Herrmann wrote:
> Hi
>
> On Thu, Apr 10, 2014 at 11:33 PM, Tony Battersby <[email protected]> wrote:
>> For O_DIRECT the kernel pins the submitted pages in memory for DMA by
>> incrementing the page reference counts when the I/O is submitted,
>> allowing the pages to be modified by DMA even if they are no longer
>> mapped in the address space of the process. This is different from a
>> regular read(), which uses the CPU to copy the data and will fail if the
>> pages are not mapped.
>
> Can you please provide an example code-path? For instance,
> file_read_actor() does not pin any pages but only keeps the user-space
> address and resolves it once it has data to write.
This may be an issue for anything in the kernel that calls
get_user_pages and holds onto the result at any time that mmap_sem isn't
held.
I don't know exactly what does that, but RDMA comes to mind. So does
(ugh!) vmsplice, although I suspect that vmsplice doesn't write.
--Andy
Andy Lutomirski wrote:
> On 04/10/2014 05:22 PM, David Herrmann wrote:
>
>> Hi
>>
>> On Thu, Apr 10, 2014 at 11:33 PM, Tony Battersby <[email protected]> wrote:
>>
>>> For O_DIRECT the kernel pins the submitted pages in memory for DMA by
>>> incrementing the page reference counts when the I/O is submitted,
>>> allowing the pages to be modified by DMA even if they are no longer
>>> mapped in the address space of the process. This is different from a
>>> regular read(), which uses the CPU to copy the data and will fail if the
>>> pages are not mapped.
>>>
>> Can you please provide an example code-path? For instance,
>> file_read_actor() does not pin any pages but only keeps the user-space
>> address and resolves it once it has data to write.
>>
>
> This may be an issue for anything in the kernel that calls
> get_user_pages and holds onto the result at any time that mmap_sem isn't
> held.
>
>
Exactly. For O_DIRECT, that would be the call to get_user_pages_fast()
from dio_refill_pages() in fs/direct-io.c, which is ultimately called
from blkdev_direct_IO().
>From the comment for get_user_pages_fast(): "Attempt to pin user pages
in memory..."
Tony
Hi
On Fri, Apr 11, 2014 at 3:43 PM, Tony Battersby <[email protected]> wrote:
> Exactly. For O_DIRECT, that would be the call to get_user_pages_fast()
> from dio_refill_pages() in fs/direct-io.c, which is ultimately called
> from blkdev_direct_IO().
If you drop mmap_sem after pinning a page without taking a write-ref,
you break i_mmap_writable / VM_DENYWRITE. In memfd I rely on
i_mmap_writable to work, same thing is done by exec() (and the old,
now disabled, MAP_DENYWRITE).
I don't know whether I should care. I mean, everyone pinning pages and
writing to it without holding the mmap_sem has to take a write-ref for
each page or it breaks i_mmap_writable. So this seems to be a bug in
direct-IO, not in anyone relying on it, right?
Thanks
David
On 04/11/2014 02:31 PM, David Herrmann wrote:
> Hi
>
> On Fri, Apr 11, 2014 at 3:43 PM, Tony Battersby <[email protected]> wrote:
>> Exactly. For O_DIRECT, that would be the call to get_user_pages_fast()
>> from dio_refill_pages() in fs/direct-io.c, which is ultimately called
>> from blkdev_direct_IO().
>
> If you drop mmap_sem after pinning a page without taking a write-ref,
> you break i_mmap_writable / VM_DENYWRITE. In memfd I rely on
> i_mmap_writable to work, same thing is done by exec() (and the old,
> now disabled, MAP_DENYWRITE).
>
> I don't know whether I should care. I mean, everyone pinning pages and
> writing to it without holding the mmap_sem has to take a write-ref for
> each page or it breaks i_mmap_writable. So this seems to be a bug in
> direct-IO, not in anyone relying on it, right?
A quick grep of the kernel tree finds exactly zero code paths
incrementing i_mmap_writable outside of mmap and fork.
Or do you mean a different kind of write ref? What am I missing here?
--Andy
Hi
On Fri, Apr 11, 2014 at 11:36 PM, Andy Lutomirski <[email protected]> wrote:
> A quick grep of the kernel tree finds exactly zero code paths
> incrementing i_mmap_writable outside of mmap and fork.
>
> Or do you mean a different kind of write ref? What am I missing here?
Sorry, I meant i_writecount.
Thanks
David
On Fri, Apr 11, 2014 at 2:42 PM, David Herrmann <[email protected]> wrote:
> Hi
>
> On Fri, Apr 11, 2014 at 11:36 PM, Andy Lutomirski <[email protected]> wrote:
>> A quick grep of the kernel tree finds exactly zero code paths
>> incrementing i_mmap_writable outside of mmap and fork.
>>
>> Or do you mean a different kind of write ref? What am I missing here?
>
> Sorry, I meant i_writecount.
I bet this is missing from lots of places. For example, I can't find
any write_access stuff in the rdma code.
I suspect that the VM_DENYWRITE code is just generally racy.
--Andy
Hi
On Sat, Apr 12, 2014 at 12:07 AM, Andy Lutomirski <[email protected]> wrote:
> I bet this is missing from lots of places. For example, I can't find
> any write_access stuff in the rdma code.
>
> I suspect that the VM_DENYWRITE code is just generally racy.
So what does S_IMMUTABLE do to prevent such races? I somehow suspect
it's broken in that regard, too.
I really dislike pinning pages like this, but if people want to keep
it I guess I have to scan all shmem-inode pages before changing seals.
Thanks
David