Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752307Ab3ENKcN (ORCPT ); Tue, 14 May 2013 06:32:13 -0400 Received: from mail-vc0-f178.google.com ([209.85.220.178]:48396 "EHLO mail-vc0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751141Ab3ENKcM (ORCPT ); Tue, 14 May 2013 06:32:12 -0400 MIME-Version: 1.0 In-Reply-To: References: Date: Tue, 14 May 2013 18:32:11 +0800 Message-ID: Subject: Re: Yet another page fault deadlock From: Ming Lei To: Dmitry Maluka Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4272 Lines: 96 On Tue, May 14, 2013 at 2:13 AM, Dmitry Maluka wrote: > Hi, > > Sometimes we run into an interesting deadlock on mm->mmap_sem. I see it > is similar to these deadlocks: > > https://lkml.org/lkml/2005/2/22/123 > https://lkml.org/lkml/2001/9/17/105 > > in the sense that it is too triggered by page faults and is explained by > "fair" rwsem semantics (pending write lock blocks further read locks). > > First, let me describe the prerequisites. It is an embedded MIPS > platform. We have 2 custom kernel drivers (call them A and B): > > - Driver A implements hardware encryption/decryption. It acts both as a > char device driver and as an in-kernel library with an API allowing > other kernel modules to encrypt/decrypt data. Important point: driver A > uses a single mutex (call it A_mutex) to protect all its operations, > regardless of whether they are requested by user space or by another > kernel module. > > - Driver B is a block device driver implementing a transparent encrypted > storage. It uses driver A's in-kernel API for encryption during write > and decryption during read. > > We have squashfs mounted on a block device provided by driver B. And we > have a user-space process with a plenty of threads in it (call them > thread 1, 2, 3, ...). > > Now, the sequence leading to the deadlock: > > 1. Thread 1 needs to encrypt or decrypt some data. It uses char device > interface provided by driver A. Upon driver entry, it first locks A_mutex. > > 2. Thread 2 reads from a mmap'ed file on squashfs. Page fault is > generated. do_page_fault() read-locks mm->mmap_sem. Then squashfs > filemap fault handler is called, then read request is sent to driver B, > then driver B calls an API function from driver A. This function first > tries to lock A_mutex, and hangs on it. > > 3. Thread 3 does a syscall which requires mm->mmap_sem write-locked > (sometimes it is mmap, sometimes mprotect). It hangs on mm->mmap_sem. > > 4. Thread 1 proceeds with handling the request from user space from step > 1. During copy_to_user() or copy_from_user() page fault is generated. > do_page_fault() tries to read-lock mm->mmap_sem and hangs on it. If the user buffer passed to driver A is mapped against file on the block device, single thread 1 may still deadlock on the mutex A. > > This deadlock does not happen if we memset() the entire user space > buffer in thread 1 before doing the syscall. I.e. we make sure that the It can't be avoided 100% with the memset() workaround since the user buffer might be swapped out. > buffer is fully mapped before the request to driver A, preventing demand > paging during copy_to/from_user(). We are currently using it as a > workaround. > > So... I realize that in our case the deadlock is caused by our > proprietary component (driver A) whose authors were smart guys but not > farsighted enough to anticipate this scenario. Now we are considering > reworking driver A to make all copy_to/from_user() calls without A_mutex > locked. This should remove the deadlock source, AFAICS. Looks there are some similar examples, one of them is b31ca3f5df( sysfs: fix deadlock). > > However, it looks like a general internal kernel architecture problem. > The whole page fault handling procedure is done with mm->mmap_sem > read-held, and due to rwsem semantics, down_read/down_write/down_read > deadlock may happen if two threads are getting page fault and a third > thread is trying to write-lock mm->mmap_sem. So all the code performing > page fault handling procedure should be especially careful about > avoiding such deadlock. But this is a complex procedure involving > different subsystems, particularly, arbitrary block device driver. So > any block device driver should be implemented with this in mind. While > this is probably not documented anywhere. Maybe it is good to document the lock usage, but the rule isn't much complicated: if one lock may be held under mmap_sem, the lock can't be held before copy_to/from_user(), :-) Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/