Running xfstest generic/036, we hit VM_BUG_ON() in nfs_direct_IO().
036 toggles O_DIRECT flag while IO is going on. We cannot simply remove
the VM_BUG_ON() there because we'll have a deadlock in the code path.
inode->i_mutex is taken when calling into ->direct_IO.
And nfs_file_direct_write() would grab inode->i_mutex again.
nfs_file_write() and generic_file_write_iter() checks for O_DIRECT twice,
and it creates a race window if user space is playing with O_DIRECT flag.
Fix it by calling generic_perform_write() instead, so that nfs_direct_IO()
is only invoked in swap on nfs case.
Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/file.c | 41 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 40 insertions(+), 1 deletion(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 2ab6f00..e98604a 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -662,6 +662,45 @@ static int nfs_need_sync_write(struct file *filp, struct inode *inode)
return 0;
}
+static ssize_t nfs_file_buffer_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct file *file = iocb->ki_filp;
+ struct address_space *mapping = file->f_mapping;
+ struct inode *inode = mapping->host;
+ ssize_t result = 0;
+ size_t count = iov_iter_count(from);
+ loff_t pos = iocb->ki_pos;
+ int ret;
+
+ mutex_lock(&inode->i_mutex);
+ /* We can write back this queue in page reclaim */
+ current->backing_dev_info = mapping->backing_dev_info;
+ ret = generic_write_checks(file, &pos, &count, 0);
+ if (ret)
+ goto out;
+
+ if (!count)
+ goto out;
+
+ iov_iter_truncate(from, count);
+ ret = file_remove_suid(file);
+ if (ret)
+ goto out;
+
+ ret = file_update_time(file);
+ if (ret)
+ goto out;
+
+ result = generic_perform_write(file, from, pos);
+ if (likely(result >= 0))
+ iocb->ki_pos = pos + result;
+
+out:
+ current->backing_dev_info = NULL;
+ mutex_unlock(&inode->i_mutex);
+ return result ? result : ret;
+}
+
ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
@@ -697,7 +736,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
if (!count)
goto out;
- result = generic_file_write_iter(iocb, from);
+ result = nfs_file_buffer_write(iocb, from);
if (result > 0)
written = result;
--
1.9.1
Hi Tao,
On Mon, Jan 19, 2015 at 4:15 AM, Peng Tao <[email protected]> wrote:
> Running xfstest generic/036, we hit VM_BUG_ON() in nfs_direct_IO().
> 036 toggles O_DIRECT flag while IO is going on. We cannot simply remove
> the VM_BUG_ON() there because we'll have a deadlock in the code path.
> inode->i_mutex is taken when calling into ->direct_IO.
> And nfs_file_direct_write() would grab inode->i_mutex again.
>
> nfs_file_write() and generic_file_write_iter() checks for O_DIRECT twice,
> and it creates a race window if user space is playing with O_DIRECT flag.
> Fix it by calling generic_perform_write() instead, so that nfs_direct_IO()
> is only invoked in swap on nfs case.
>
> Suggested-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Peng Tao <[email protected]>
> ---
> fs/nfs/file.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 2ab6f00..e98604a 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -662,6 +662,45 @@ static int nfs_need_sync_write(struct file *filp, struct inode *inode)
> return 0;
> }
>
> +static ssize_t nfs_file_buffer_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> + struct file *file = iocb->ki_filp;
> + struct address_space *mapping = file->f_mapping;
> + struct inode *inode = mapping->host;
> + ssize_t result = 0;
> + size_t count = iov_iter_count(from);
> + loff_t pos = iocb->ki_pos;
> + int ret;
> +
> + mutex_lock(&inode->i_mutex);
> + /* We can write back this queue in page reclaim */
> + current->backing_dev_info = mapping->backing_dev_info;
> + ret = generic_write_checks(file, &pos, &count, 0);
> + if (ret)
> + goto out;
> +
> + if (!count)
> + goto out;
> +
> + iov_iter_truncate(from, count);
> + ret = file_remove_suid(file);
> + if (ret)
> + goto out;
> +
> + ret = file_update_time(file);
> + if (ret)
> + goto out;
> +
> + result = generic_perform_write(file, from, pos);
> + if (likely(result >= 0))
> + iocb->ki_pos = pos + result;
> +
> +out:
> + current->backing_dev_info = NULL;
> + mutex_unlock(&inode->i_mutex);
> + return result ? result : ret;
> +}
> +
> ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
> {
> struct file *file = iocb->ki_filp;
> @@ -697,7 +736,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
> if (!count)
> goto out;
>
> - result = generic_file_write_iter(iocb, from);
> + result = nfs_file_buffer_write(iocb, from);
> if (result > 0)
> written = result;
>
I still haven't got an answer to my question as to why we need to do
all this if we don't actually want anything other than the swapfile
code to use nfs_direct_IO(}? The xfstest with toggling O_DIRECT at
random doesn't reflect any sane behaviour that we want to support; all
we want to do is ensure that it doesn't deadlock. Why wouldn't the
proposal that we add a line to return '0' if !IS_SWAPFILE(inode)
suffice?
Cheers
Trond
Hi Trond,
On Mon, Jan 19, 2015 at 9:17 PM, Trond Myklebust
<[email protected]> wrote:
> Hi Tao,
>
> On Mon, Jan 19, 2015 at 4:15 AM, Peng Tao <[email protected]> wrote:
>> Running xfstest generic/036, we hit VM_BUG_ON() in nfs_direct_IO().
>> 036 toggles O_DIRECT flag while IO is going on. We cannot simply remove
>> the VM_BUG_ON() there because we'll have a deadlock in the code path.
>> inode->i_mutex is taken when calling into ->direct_IO.
>> And nfs_file_direct_write() would grab inode->i_mutex again.
>>
>> nfs_file_write() and generic_file_write_iter() checks for O_DIRECT twice,
>> and it creates a race window if user space is playing with O_DIRECT flag.
>> Fix it by calling generic_perform_write() instead, so that nfs_direct_IO()
>> is only invoked in swap on nfs case.
>>
>> Suggested-by: Christoph Hellwig <[email protected]>
>> Signed-off-by: Peng Tao <[email protected]>
>> ---
>> fs/nfs/file.c | 41 ++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 40 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> index 2ab6f00..e98604a 100644
>> --- a/fs/nfs/file.c
>> +++ b/fs/nfs/file.c
>> @@ -662,6 +662,45 @@ static int nfs_need_sync_write(struct file *filp, struct inode *inode)
>> return 0;
>> }
>>
>> +static ssize_t nfs_file_buffer_write(struct kiocb *iocb, struct iov_iter *from)
>> +{
>> + struct file *file = iocb->ki_filp;
>> + struct address_space *mapping = file->f_mapping;
>> + struct inode *inode = mapping->host;
>> + ssize_t result = 0;
>> + size_t count = iov_iter_count(from);
>> + loff_t pos = iocb->ki_pos;
>> + int ret;
>> +
>> + mutex_lock(&inode->i_mutex);
>> + /* We can write back this queue in page reclaim */
>> + current->backing_dev_info = mapping->backing_dev_info;
>> + ret = generic_write_checks(file, &pos, &count, 0);
>> + if (ret)
>> + goto out;
>> +
>> + if (!count)
>> + goto out;
>> +
>> + iov_iter_truncate(from, count);
>> + ret = file_remove_suid(file);
>> + if (ret)
>> + goto out;
>> +
>> + ret = file_update_time(file);
>> + if (ret)
>> + goto out;
>> +
>> + result = generic_perform_write(file, from, pos);
>> + if (likely(result >= 0))
>> + iocb->ki_pos = pos + result;
>> +
>> +out:
>> + current->backing_dev_info = NULL;
>> + mutex_unlock(&inode->i_mutex);
>> + return result ? result : ret;
>> +}
>> +
>> ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
>> {
>> struct file *file = iocb->ki_filp;
>> @@ -697,7 +736,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
>> if (!count)
>> goto out;
>>
>> - result = generic_file_write_iter(iocb, from);
>> + result = nfs_file_buffer_write(iocb, from);
>> if (result > 0)
>> written = result;
>>
>
> I still haven't got an answer to my question as to why we need to do
> all this if we don't actually want anything other than the swapfile
> code to use nfs_direct_IO(}?
Sorry that I was under the impression that you were asking advice from
Al and thus skipped your proposal as Al did not response.
> The xfstest with toggling O_DIRECT at
> random doesn't reflect any sane behaviour that we want to support; all
> we want to do is ensure that it doesn't deadlock. Why wouldn't the
> proposal that we add a line to return '0' if !IS_SWAPFILE(inode)
> suffice?
>
The reason for ->direct_IO being called here is because we check
O_DIRECT flag twice in the code path and it creates a window for user
space application to flip O_DIRECT flag value. The patch fixes that
and it matches other filesystem's implementation, which is to check
O_DIRECT only once. I agree with you that we can return 0 in
nfs_direct_IO for non-swapfile case if we want to simply declare not
to support such O_DIRECT toggling behaviour. Given that the generic
layer falls back to buffer IO if ->direct_IO returns 0, it will have
same effect as this patch.
Cheers,
Tao