2009-04-14 06:15:59

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 0/6] IO pinning(get_user_pages()) vs fork race fix


Linux Device Drivers, Third Edition, Chapter 15: Memory Mapping and DMA says

get_user_pages is a low-level memory management function, with a suitably complex
interface. It also requires that the mmap reader/writer semaphore for the address
space be obtained in read mode before the call. As a result, calls to get_user_pages
usually look something like:

down_read(&current->mm->mmap_sem);
result = get_user_pages(current, current->mm, ...);
up_read(&current->mm->mmap_sem);

The return value is the number of pages actually mapped, which could be fewer than
the number requested (but greater than zero).

but, it isn't true. mmap_sem isn't only used for vma traversal, but also prevent vs-fork race.
up_read(mmap_sem) mean end of critical section, IOW after up_read() code is fork unsafe.
(access_process_vm() explain proper get_user_pages() usage)

Oh well, We have many wrong caller now. What is the best fix method?

Nick Piggin and Andrea Arcangeli proposed to change get_user_pages() semantics as caller expected.
see "[PATCH] fork vs gup(-fast) fix" thead in linux-mm
but Linus NACKed it.

Thus I made caller change approach patch series. it is made for discuss to compare Nick's approach.
I don't hope submit it yet.

Nick, This version fixed vmsplice and aio issue (you pointed). I hope to hear your opiniton ;)



ChangeLog:
V2 -> V3
o remove early decow logic
o introduce prevent unmap logic
o fix nfs-directio
o fix aio
o fix bio (only bandaid fix)

V1 -> V2
o fix aio+dio case

TODO
o implement down_write_killable()
o fix kvm (need?)
o fix get_arg_page() (Why this function don't use mmap_sem?)


2009-04-14 06:17:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

Subject: [PATCH] mm: Don't unmap gup()ed page

Currently, following test program will fail.

forkscrewreverse-2.c
================================================
#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <memory.h>
#include <pthread.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#define FILESIZE (40*1024*1024)
#define BUFSIZE (40*1024*1024)

static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static const char *filename = "file.dat";
static int fd;
static void *buffer;
#define PAGE_SIZE 4096

void
dump_buffer(char *buf, int len)
{
int i;
int last_off, last_val;

last_off = -1;
last_val = -1;

for (i = 0; i < len; i++) {
if (last_off < 0) {
last_off = i;
last_val = buf[i];
continue;
}

if (buf[i] != last_val) {
printf("%d - %d: %x\n", last_off, i - 1, last_val);
last_off = i;
last_val = buf[i];
}
}

if (last_off != len - 1)
printf("%d - %d: %x\n", last_off, i-1, last_val);
}

static void store(void)
{
int i;

if (usleep(100*1000) == -1)
perror("usleep"), exit(1);

printf("child storing\n"); fflush(stdout);
for (i = 0; i < BUFSIZE; i++)
((char *)buffer)[i] = 0xff;
printf("child storing end\n"); fflush(stdout);
_exit(0);
}

static void *writer(void *arg)
{
int i;

if (pthread_mutex_lock(&lock) == -1)
perror("pthread_mutex_lock"), exit(1);

printf("thread writing\n"); fflush(stdout);
for (i = 0; i < FILESIZE / BUFSIZE; i++) {
size_t count = BUFSIZE;
ssize_t ret;

do {
ret = write(fd, buffer, count);
if (ret == -1) {
if (errno != EINTR)
perror("write"), exit(1);
ret = 0;
}
count -= ret;
} while (count);
}
printf("thread writing done\n"); fflush(stdout);

if (pthread_mutex_unlock(&lock) == -1)
perror("pthread_mutex_lock"), exit(1);

return NULL;
}

int main(int argc, char *argv[])
{
int i;
int status;
pthread_t writer_thread;
pid_t store_proc;

posix_memalign(&buffer, PAGE_SIZE, BUFSIZE);
printf("Write buffer: %p.\n", buffer);

for (i = 0; i < BUFSIZE; i++)
((char *)buffer)[i] = 0x00;

fd = open(filename, O_RDWR|O_DIRECT);
if (fd == -1)
perror("open"), exit(1);

if (pthread_mutex_lock(&lock) == -1)
perror("pthread_mutex_lock"), exit(1);

if (pthread_create(&writer_thread, NULL, writer, NULL) == -1)
perror("pthred_create"), exit(1);

store_proc = fork();
if (store_proc == -1)
perror("fork"), exit(1);

if (pthread_mutex_unlock(&lock) == -1)
perror("pthread_mutex_lock"), exit(1);

if (!store_proc)
store();

if (usleep(50*1000) == -1)
perror("usleep"), exit(1);

printf("parent storing\n"); fflush(stdout);
for (i = 0; i < BUFSIZE; i++)
((char *)buffer)[i] = 0x11;

do {
pid_t w;
w = waitpid(store_proc, &status, WUNTRACED | WCONTINUED);
if (w == -1)
perror("waitpid"), exit(1);
} while (!WIFEXITED(status) && !WIFSIGNALED(status));

if (pthread_join(writer_thread, NULL) == -1)
perror("pthread_join"), exit(1);

close(fd);
fd = open(filename, O_RDWR|O_DIRECT);
if (fd == -1)
perror("open"), exit(1);

if (read(fd, buffer, BUFSIZE) < 0)
perror("read buffer"), exit(1);

if (memchr(buffer, 0xff, BUFSIZE) != NULL)
fprintf(stderr, " test failed !!!!!!!!!!!!!!!\n\n");

dump_buffer(buffer, BUFSIZE);

exit(0);
}
===============================================================

It because following scenario happend.

CPU0 CPU1 CPU2 note
(parent) (writer thread) (child)
==============================================================================
fill 0
create writer thread
fork()
write()
| get_user_pages(read) inc page_count
|
fill 0x11 | COW break
| page get new page.
| (then, child get original page as writable)
|
| fill 0xff child change DIO targetted page
|
v

The root cause is, reuse_swap_page() don't consider get_user_pages()'s ref
counting-up. it only consider map_count.

this patch change reuse_swap_page() to check page_count(). and only change reuse_swap_page
makes following side-effect. then the patch also change try_to_unmap().

CPU0 CPU1 CPU2 note
(thread1) (thread2)
=============================================================================
DIO read()
| get_user_pages(write) inc page_count
|
|
| try_to_unmap() the page is unmaped from
| process's pte.
|
do_wp_page() | page fault and
| reuse_swap_page() return 0,
| then, COW break happend and
| process get new copyed page.
| DIO read result will lost.
v


Now, reuse_swap_cache() behave as before commit c475a8ab age, and read-side
get_user_pages() and get_user_pages_fast() become fork safe.

This patch doesn't only fix DirectIO, but also fix other get_user_pages() read-side caller
(e.g. futex, vmsplice, et al.)


btw, if you want to write-side get_user_pages, you should prevent fork by mmap_sem
to critical section.

obiously wrong example code:

down_read(&current->mm->mmap_sem);
get_user_pages(current, current->mm, addr, 1, 1, 0, &page, NULL);
up_read(&current->mm->mmap_sem);

up_read(&current->mm->mmap_sem) mean end of critical section, then, this code is
obiously fork unsafe.


Signed-off-by: KOSAKI Motohiro <[email protected]>
Sugessted-by: Linus Torvalds <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: [email protected]
---
mm/rmap.c | 21 +++++++++++++++++++++
mm/swapfile.c | 10 +++++++++-
2 files changed, 30 insertions(+), 1 deletion(-)

Index: b/mm/swapfile.c
===================================================================
--- a/mm/swapfile.c 2009-04-11 21:38:33.000000000 +0900
+++ b/mm/swapfile.c 2009-04-11 21:38:45.000000000 +0900
@@ -533,6 +533,8 @@ static inline int page_swapcount(struct
* to it. And as a side-effect, free up its swap: because the old content
* on disk will never be read, and seeking back there to write new content
* later would only waste time away from clustering.
+ * Caller must hold pte_lock. try_to_unmap() decrement page::_mapcount
+ * and get_user_pages() increment page::_count under pte_lock.
*/
int reuse_swap_page(struct page *page)
{
@@ -547,7 +549,13 @@ int reuse_swap_page(struct page *page)
SetPageDirty(page);
}
}
- return count == 1;
+
+ /*
+ * If we can re-use the swap page _and_ the end
+ * result has only one user (the mapping), then
+ * we reuse the whole page
+ */
+ return count + page_count(page) == 2;
}

/*
Index: b/mm/rmap.c
===================================================================
--- a/mm/rmap.c 2009-04-11 21:38:33.000000000 +0900
+++ b/mm/rmap.c 2009-04-12 00:58:58.000000000 +0900
@@ -773,6 +773,27 @@ static int try_to_unmap_one(struct page
goto out;

/*
+ * Don't pull an anonymous page out from under get_user_pages.
+ * GUP carefully breaks COW and raises page count (while holding
+ * pte_lock, as we have here) to make sure that the page
+ * cannot be freed. If we unmap that page here, a user write
+ * access to the virtual address will bring back the page, but
+ * its raised count will (ironically) be taken to mean it's not
+ * an exclusive swap page, do_wp_page will replace it by a copy
+ * page, and the user never get to see the data GUP was holding
+ * the original page for.
+ *
+ * This test is also useful for when swapoff (unuse_process) has
+ * to drop page lock: its reference to the page stops existing
+ * ptes from being unmapped, so swapoff can make progress.
+ */
+ if (PageSwapCache(page) &&
+ page_count(page) != page_mapcount(page) + 2) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+
+ /*
* If the page is mlock()d, we cannot swap it out.
* If it's recently referenced (perhaps page_referenced
* skipped over this mm) then we should reactivate it.

2009-04-14 06:18:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH v3 2/6] mm, directio: fix fork vs direct-io race (read(2) side IOW gup(write) side)

Subject: [PATCH] mm, directio: fix fork vs direct-io race


ChangeLog:
V2 -> V3
o remove early decow logic

V1 -> V2
o add dio+aio logic

===============================================

Currently, following testcase is failed.

& dma_thread -a 512 -w 40

========== dma_thread.c =======
/* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */

#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <memory.h>
#include <pthread.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#define FILESIZE (12*1024*1024)
#define READSIZE (1024*1024)

#define FILENAME "test_%.04d.tmp"
#define FILECOUNT 100
#define MIN_WORKERS 2
#define MAX_WORKERS 256
#define PAGE_SIZE 4096

#define true 1
#define false 0

typedef int bool;

bool done = false;
int workers = 2;

#define PATTERN (0xfa)

static void
usage (void)
{
fprintf(stderr, "\nUsage: dma_thread [-h | -a <alignment> [ -w <workers>]\n"
"\nWith no arguments, generate test files and exit.\n"
"-h Display this help and exit.\n"
"-a align read buffer to offset <alignment>.\n"
"-w number of worker threads, 2 (default) to 256,\n"
" defaults to number of cores.\n\n"

"Run first with no arguments to generate files.\n"
"Then run with -a <alignment> = 512 or 0. \n");
}

typedef struct {
pthread_t tid;
int worker_number;
int fd;
int offset;
int length;
int pattern;
unsigned char *buffer;
} worker_t;


void *worker_thread(void * arg)
{
int bytes_read;
int i,k;
worker_t *worker = (worker_t *) arg;
int offset = worker->offset;
int fd = worker->fd;
unsigned char *buffer = worker->buffer;
int pattern = worker->pattern;
int length = worker->length;

if (lseek(fd, offset, SEEK_SET) < 0) {
fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n",
offset, fd, strerror(errno));
exit(1);
}

bytes_read = read(fd, buffer, length);
if (bytes_read != length) {
fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n",
fd, bytes_read, strerror(errno));
exit(1);
}

/* Corruption check */
for (i = 0; i < length; i++) {
if (buffer[i] != pattern) {
printf("Bad data at 0x%.06x: %p, \n", i, buffer + i);
printf("Data dump starting at 0x%.06x:\n", i - 8);
printf("Expect 0x%x followed by 0x%x:\n",
pattern, PATTERN);

for (k = 0; k < 16; k++) {
printf("%02x ", buffer[i - 8 + k]);
if (k == 7) {
printf("\n");
}
}

printf("\n");
abort();
}
}

return 0;
}

void *fork_thread (void *arg)
{
pid_t pid;

while (!done) {
pid = fork();
if (pid == 0) {
exit(0);
} else if (pid < 0) {
fprintf(stderr, "Failed to fork child.\n");
exit(1);
}
waitpid(pid, NULL, 0 );
usleep(100);
}

return NULL;

}

int main(int argc, char *argv[])
{
unsigned char *buffer = NULL;
char filename[1024];
int fd;
bool dowrite = true;
pthread_t fork_tid;
int c, n, j;
worker_t *worker;
int align = 0;
int offset, rc;

workers = sysconf(_SC_NPROCESSORS_ONLN);

while ((c = getopt(argc, argv, "a:hw:")) != -1) {
switch (c) {
case 'a':
align = atoi(optarg);
if (align < 0 || align > PAGE_SIZE) {
printf("Bad alignment %d.\n", align);
exit(1);
}
dowrite = false;
break;

case 'h':
usage();
exit(0);
break;

case 'w':
workers = atoi(optarg);
if (workers < MIN_WORKERS || workers > MAX_WORKERS) {
fprintf(stderr, "Worker count %d not between "
"%d and %d, inclusive.\n",
workers, MIN_WORKERS, MAX_WORKERS);
usage();
exit(1);
}
dowrite = false;
break;

default:
usage();
exit(1);
}
}

if (argc > 1 && (optind < argc)) {
fprintf(stderr, "Bad command line.\n");
usage();
exit(1);
}

if (dowrite) {

buffer = malloc(FILESIZE);
if (buffer == NULL) {
fprintf(stderr, "Failed to malloc write buffer.\n");
exit(1);
}

for (n = 1; n <= FILECOUNT; n++) {
sprintf(filename, FILENAME, n);
fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666);
if (fd < 0) {
printf("create failed(%s): %s.\n", filename, strerror(errno));
exit(1);
}
memset(buffer, n, FILESIZE);
printf("Writing file %s.\n", filename);
if (write(fd, buffer, FILESIZE) != FILESIZE) {
printf("write failed (%s)\n", filename);
}

close(fd);
fd = -1;
}

free(buffer);
buffer = NULL;

printf("done\n");
exit(0);
}

printf("Using %d workers.\n", workers);

worker = malloc(workers * sizeof(worker_t));
if (worker == NULL) {
fprintf(stderr, "Failed to malloc worker array.\n");
exit(1);
}

for (j = 0; j < workers; j++) {
worker[j].worker_number = j;
}

printf("Using alignment %d.\n", align);

posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align);
printf("Read buffer: %p.\n", buffer);
for (n = 1; n <= FILECOUNT; n++) {

sprintf(filename, FILENAME, n);
for (j = 0; j < workers; j++) {
if ((worker[j].fd = open(filename, O_RDONLY|O_DIRECT)) < 0) {
fprintf(stderr, "Failed to open %s: %s.\n",
filename, strerror(errno));
exit(1);
}

worker[j].pattern = n;
}

printf("Reading file %d.\n", n);

for (offset = 0; offset < FILESIZE; offset += READSIZE) {
memset(buffer, PATTERN, READSIZE + align);
for (j = 0; j < workers; j++) {
worker[j].offset = offset + j * PAGE_SIZE;
worker[j].buffer = buffer + align + j * PAGE_SIZE;
worker[j].length = PAGE_SIZE;
}
/* The final worker reads whatever is left over. */
worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1);

done = 0;

rc = pthread_create(&fork_tid, NULL, fork_thread, NULL);
if (rc != 0) {
fprintf(stderr, "Can't create fork thread: %s.\n",
strerror(rc));
exit(1);
}

for (j = 0; j < workers; j++) {
rc = pthread_create(&worker[j].tid,
NULL,
worker_thread,
worker + j);
if (rc != 0) {
fprintf(stderr, "Can't create worker thread %d: %s.\n",
j, strerror(rc));
exit(1);
}
}

for (j = 0; j < workers; j++) {
rc = pthread_join(worker[j].tid, NULL);
if (rc != 0) {
fprintf(stderr, "Failed to join worker thread %d: %s.\n",
j, strerror(rc));
exit(1);
}
}

/* Let the fork thread know it's ok to exit */
done = 1;

rc = pthread_join(fork_tid, NULL);
if (rc != 0) {
fprintf(stderr, "Failed to join fork thread: %s.\n",
strerror(rc));
exit(1);
}
}

/* Close the fd's for the next file. */
for (j = 0; j < workers; j++) {
close(worker[j].fd);
}
}

return 0;
}
========== dma_thread.c =======

Because following scenario happend.

CPU0 CPU1 CPU2 note
(fork thread) (worker thread1) (worker thread2)
==========================================================================================
read()
| get_user_pages()
|
fork | inc map_count and wprotect
|
| read()
| | get_user_pages() COW break, CPU2 get copyed page,
| | but CPU1 still point to original page.
| | then the result of CPU1 transfer will be lost.
v |
|
|
v


Actually, get_user_pages() (and get_user_pages_fast()) don't provide any pinning operation.
Caller must prevent fork in critical section.
access_process_vm() explain standard fork protection way, it use mmap_sem.

but, mmap_sem is very easy contended lock. it cause large performance regression to DirectIO.
Then, this patch introduce new lock for another fork prevent mechanism.
Almost application don't fork while DirectIO in progress, then mm_pinned_sem doesn't contend in almost case.


Also, this patch fix following aio+dio testcase.

========== forkscrew.c ========
/*
* Copyright 2009, Red Hat, Inc.
*
* Author: Jeff Moyer <[email protected]>
*
* This program attempts to expose a race between O_DIRECT I/O and the fork()
* path in a multi-threaded program. In order to reliably reproduce the
* problem, it is best to perform a dd from the device under test to /dev/null
* as this makes the read I/O slow enough to orchestrate the problem.
*
* Running: ./forkscrew
*
* It is expected that a file name "data" exists in the current working
* directory, and that its contents are something other than 0x2a. A simple
* dd if=/dev/zero of=data bs=1M count=1 should be sufficient.
*/
#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <pthread.h>
#include <libaio.h>

pthread_cond_t worker_cond = PTHREAD_COND_INITIALIZER;
pthread_mutex_t worker_mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t fork_cond = PTHREAD_COND_INITIALIZER;
pthread_mutex_t fork_mutex = PTHREAD_MUTEX_INITIALIZER;

char *buffer;
int fd;

/* pattern filled into the in-memory buffer */
#define PATTERN 0x2a // '*'

void
usage(void)
{
fprintf(stderr,
"\nUsage: forkscrew\n"
"it is expected that a file named \"data\" is the current\n"
"working directory. It should be at least 3*pagesize in size\n"
);
}

void
dump_buffer(char *buf, int len)
{
int i;
int last_off, last_val;

last_off = -1;
last_val = -1;

for (i = 0; i < len; i++) {
if (last_off < 0) {
last_off = i;
last_val = buf[i];
continue;
}

if (buf[i] != last_val) {
printf("%d - %d: %d\n", last_off, i - 1, last_val);
last_off = i;
last_val = buf[i];
}
}

if (last_off != len - 1)
printf("%d - %d: %d\n", last_off, i-1, last_val);
}

int
check_buffer(char *bufp, int len, int pattern)
{
int i;

for (i = 0; i < len; i++) {
if (bufp[i] == pattern)
return 1;
}
return 0;
}

void *
forker_thread(void *arg)
{
pthread_mutex_lock(&fork_mutex);
pthread_cond_signal(&fork_cond);
pthread_cond_wait(&fork_cond, &fork_mutex);
switch (fork()) {
case 0:
sleep(1);
printf("child dumping buffer:\n");
dump_buffer(buffer + 512, 2*getpagesize());
exit(0);
case -1:
perror("fork");
exit(1);
default:
break;
}
pthread_cond_signal(&fork_cond);
pthread_mutex_unlock(&fork_mutex);

wait(NULL);
return (void *)0;
}

void *
worker(void *arg)
{
int first = (int)arg;
char *bufp;
int pagesize = getpagesize();
int ret;
int corrupted = 0;

if (first) {
io_context_t aioctx;
struct io_event event;
struct iocb *iocb = malloc(sizeof *iocb);
if (!iocb) {
perror("malloc");
exit(1);
}
memset(&aioctx, 0, sizeof(aioctx));
ret = io_setup(1, &aioctx);
if (ret != 0) {
errno = -ret;
perror("io_setup");
exit(1);
}
bufp = buffer + 512;
io_prep_pread(iocb, fd, bufp, pagesize, 0);

/* submit the I/O */
io_submit(aioctx, 1, &iocb);

/* tell the fork thread to run */
pthread_mutex_lock(&fork_mutex);
pthread_cond_signal(&fork_cond);

/* wait for the fork to happen */
pthread_cond_wait(&fork_cond, &fork_mutex);
pthread_mutex_unlock(&fork_mutex);

/* release the other worker to issue I/O */
pthread_mutex_lock(&worker_mutex);
pthread_cond_signal(&worker_cond);
pthread_mutex_unlock(&worker_mutex);

ret = io_getevents(aioctx, 1, 1, &event, NULL);
if (ret != 1) {
errno = -ret;
perror("io_getevents");
exit(1);
}
if (event.res != pagesize) {
errno = -event.res;
perror("read error");
exit(1);
}

io_destroy(aioctx);

/* check buffer, should be corrupt */
if (check_buffer(bufp, pagesize, PATTERN)) {
printf("worker 0 failed check\n");
dump_buffer(bufp, pagesize);
corrupted = 1;
}

} else {

bufp = buffer + 512 + pagesize;

pthread_mutex_lock(&worker_mutex);
pthread_cond_signal(&worker_cond); /* tell main we're ready */
/* wait for the first I/O and the fork */
pthread_cond_wait(&worker_cond, &worker_mutex);
pthread_mutex_unlock(&worker_mutex);

/* submit overlapping I/O */
ret = read(fd, bufp, pagesize);
if (ret != pagesize) {
perror("read");
exit(1);
}
/* check buffer, should be fine */
if (check_buffer(bufp, pagesize, PATTERN)) {
printf("worker 1 failed check -- abnormal\n");
dump_buffer(bufp, pagesize);
corrupted = 1;
}
}

return (void *)corrupted;
}

int
main(int argc, char **argv)
{
pthread_t workers[2];
pthread_t forker;
int ret, rc = 0;
void *thread_ret;
int pagesize = getpagesize();

fd = open("data", O_DIRECT|O_RDONLY);
if (fd < 0) {
perror("open");
exit(1);
}

ret = posix_memalign(&buffer, pagesize, 3 * pagesize);
if (ret != 0) {
errno = ret;
perror("posix_memalign");
exit(1);
}
memset(buffer, PATTERN, 3*pagesize);

pthread_mutex_lock(&fork_mutex);
ret = pthread_create(&forker, NULL, forker_thread, NULL);
pthread_cond_wait(&fork_cond, &fork_mutex);
pthread_mutex_unlock(&fork_mutex);

pthread_mutex_lock(&worker_mutex);
ret |= pthread_create(&workers[0], NULL, worker, (void *)0);
if (ret) {
perror("pthread_create");
exit(1);
}
pthread_cond_wait(&worker_cond, &worker_mutex);
pthread_mutex_unlock(&worker_mutex);

ret = pthread_create(&workers[1], NULL, worker, (void *)1);
if (ret != 0) {
perror("pthread_create");
exit(1);
}

pthread_join(forker, NULL);
pthread_join(workers[0], &thread_ret);
if (thread_ret != 0)
rc = 1;
pthread_join(workers[1], &thread_ret);
if (thread_ret != 0)
rc = 1;

if (rc != 0) {
printf("parent dumping full buffer\n");
dump_buffer(buffer + 512, 2 * pagesize);
}

close(fd);
free(buffer);
exit(rc);
}
========== forkscrew.c ========


Signed-off-by: KOSAKI Motohiro <[email protected]>
Sugessted-by: Linus Torvalds <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Andy Grover <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
fs/direct-io.c | 16 ++++++++++++++++
include/linux/init_task.h | 1 +
include/linux/mm_types.h | 6 ++++++
kernel/fork.c | 3 +++
4 files changed, 26 insertions(+)

Index: b/fs/direct-io.c
===================================================================
--- a/fs/direct-io.c 2009-04-13 00:24:01.000000000 +0900
+++ b/fs/direct-io.c 2009-04-13 01:36:37.000000000 +0900
@@ -131,6 +131,9 @@ struct dio {
int is_async; /* is IO async ? */
int io_error; /* IO error in completion path */
ssize_t result; /* IO result */
+
+ /* fork exclusive stuff */
+ struct mm_struct *mm;
};

/*
@@ -244,6 +247,12 @@ static int dio_complete(struct dio *dio,
/* lockdep: non-owner release */
up_read_non_owner(&dio->inode->i_alloc_sem);

+ if (dio->rw == READ) {
+ BUG_ON(!dio->mm);
+ up_read_non_owner(&dio->mm->mm_pinned_sem);
+ mmdrop(dio->mm);
+ }
+
if (ret == 0)
ret = dio->page_errors;
if (ret == 0)
@@ -942,6 +951,7 @@ direct_io_worker(int rw, struct kiocb *i
ssize_t ret = 0;
ssize_t ret2;
size_t bytes;
+ struct mm_struct *mm;

dio->inode = inode;
dio->rw = rw;
@@ -960,6 +970,12 @@ direct_io_worker(int rw, struct kiocb *i
spin_lock_init(&dio->bio_lock);
dio->refcount = 1;

+ if (rw == READ) {
+ mm = dio->mm = current->mm;
+ atomic_inc(&mm->mm_count);
+ down_read_non_owner(&mm->mm_pinned_sem);
+ }
+
/*
* In case of non-aligned buffers, we may need 2 more
* pages since we need to zero out first and last block.
Index: b/include/linux/init_task.h
===================================================================
--- a/include/linux/init_task.h 2009-04-13 00:24:01.000000000 +0900
+++ b/include/linux/init_task.h 2009-04-13 00:24:32.000000000 +0900
@@ -37,6 +37,7 @@ extern struct fs_struct init_fs;
.page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \
.mmlist = LIST_HEAD_INIT(name.mmlist), \
.cpu_vm_mask = CPU_MASK_ALL, \
+ .mm_pinned_sem = __RWSEM_INITIALIZER(name.mm_pinned_sem), \
}

#define INIT_SIGNALS(sig) { \
Index: b/include/linux/mm_types.h
===================================================================
--- a/include/linux/mm_types.h 2009-04-13 00:24:01.000000000 +0900
+++ b/include/linux/mm_types.h 2009-04-13 00:24:32.000000000 +0900
@@ -274,6 +274,12 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
+
+ /*
+ * if there are on-flight directio or similar pinning action,
+ * COW cause memory corruption. the sem protect it by preventing fork.
+ */
+ struct rw_semaphore mm_pinned_sem;
};

/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
Index: b/kernel/fork.c
===================================================================
--- a/kernel/fork.c 2009-04-13 00:24:01.000000000 +0900
+++ b/kernel/fork.c 2009-04-13 00:24:32.000000000 +0900
@@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm
unsigned long charge;
struct mempolicy *pol;

+ down_write(&oldmm->mm_pinned_sem);
down_write(&oldmm->mmap_sem);
flush_cache_dup_mm(oldmm);
/*
@@ -368,6 +369,7 @@ out:
up_write(&mm->mmap_sem);
flush_tlb_mm(oldmm);
up_write(&oldmm->mmap_sem);
+ up_write(&oldmm->mm_pinned_sem);
return retval;
fail_nomem_policy:
kmem_cache_free(vm_area_cachep, tmp);
@@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
mm_init_owner(mm, p);
+ init_rwsem(&mm->mm_pinned_sem);

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;

2009-04-14 06:19:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH v3 3/6] nfs, direct-io: fix fork vs direct-io race on nfs

Subject: [PATCH] nfs, direct-io: fix fork vs direct-io race on nfs

After fs/diorect-io.c fix, following testcase still fail on nfs running.
it's because nfs has own specific diorct-io implementation.

========== dma_thread.c =======
/* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */

#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <memory.h>
#include <pthread.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#define FILESIZE (12*1024*1024)
#define READSIZE (1024*1024)

#define FILENAME "test_%.04d.tmp"
#define FILECOUNT 100
#define MIN_WORKERS 2
#define MAX_WORKERS 256
#define PAGE_SIZE 4096

#define true 1
#define false 0

typedef int bool;

bool done = false;
int workers = 2;

#define PATTERN (0xfa)

static void
usage (void)
{
fprintf(stderr, "\nUsage: dma_thread [-h | -a <alignment> [ -w <workers>]\n"
"\nWith no arguments, generate test files and exit.\n"
"-h Display this help and exit.\n"
"-a align read buffer to offset <alignment>.\n"
"-w number of worker threads, 2 (default) to 256,\n"
" defaults to number of cores.\n\n"

"Run first with no arguments to generate files.\n"
"Then run with -a <alignment> = 512 or 0. \n");
}

typedef struct {
pthread_t tid;
int worker_number;
int fd;
int offset;
int length;
int pattern;
unsigned char *buffer;
} worker_t;


void *worker_thread(void * arg)
{
int bytes_read;
int i,k;
worker_t *worker = (worker_t *) arg;
int offset = worker->offset;
int fd = worker->fd;
unsigned char *buffer = worker->buffer;
int pattern = worker->pattern;
int length = worker->length;

if (lseek(fd, offset, SEEK_SET) < 0) {
fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n",
offset, fd, strerror(errno));
exit(1);
}

bytes_read = read(fd, buffer, length);
if (bytes_read != length) {
fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n",
fd, bytes_read, strerror(errno));
exit(1);
}

/* Corruption check */
for (i = 0; i < length; i++) {
if (buffer[i] != pattern) {
printf("Bad data at 0x%.06x: %p, \n", i, buffer + i);
printf("Data dump starting at 0x%.06x:\n", i - 8);
printf("Expect 0x%x followed by 0x%x:\n",
pattern, PATTERN);

for (k = 0; k < 16; k++) {
printf("%02x ", buffer[i - 8 + k]);
if (k == 7) {
printf("\n");
}
}

printf("\n");
abort();
}
}

return 0;
}

void *fork_thread (void *arg)
{
pid_t pid;

while (!done) {
pid = fork();
if (pid == 0) {
exit(0);
} else if (pid < 0) {
fprintf(stderr, "Failed to fork child.\n");
exit(1);
}
waitpid(pid, NULL, 0 );
usleep(100);
}

return NULL;

}

int main(int argc, char *argv[])
{
unsigned char *buffer = NULL;
char filename[1024];
int fd;
bool dowrite = true;
pthread_t fork_tid;
int c, n, j;
worker_t *worker;
int align = 0;
int offset, rc;

workers = sysconf(_SC_NPROCESSORS_ONLN);

while ((c = getopt(argc, argv, "a:hw:")) != -1) {
switch (c) {
case 'a':
align = atoi(optarg);
if (align < 0 || align > PAGE_SIZE) {
printf("Bad alignment %d.\n", align);
exit(1);
}
dowrite = false;
break;

case 'h':
usage();
exit(0);
break;

case 'w':
workers = atoi(optarg);
if (workers < MIN_WORKERS || workers > MAX_WORKERS) {
fprintf(stderr, "Worker count %d not between "
"%d and %d, inclusive.\n",
workers, MIN_WORKERS, MAX_WORKERS);
usage();
exit(1);
}
dowrite = false;
break;

default:
usage();
exit(1);
}
}

if (argc > 1 && (optind < argc)) {
fprintf(stderr, "Bad command line.\n");
usage();
exit(1);
}

if (dowrite) {

buffer = malloc(FILESIZE);
if (buffer == NULL) {
fprintf(stderr, "Failed to malloc write buffer.\n");
exit(1);
}

for (n = 1; n <= FILECOUNT; n++) {
sprintf(filename, FILENAME, n);
fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666);
if (fd < 0) {
printf("create failed(%s): %s.\n", filename, strerror(errno));
exit(1);
}
memset(buffer, n, FILESIZE);
printf("Writing file %s.\n", filename);
if (write(fd, buffer, FILESIZE) != FILESIZE) {
printf("write failed (%s)\n", filename);
}

close(fd);
fd = -1;
}

free(buffer);
buffer = NULL;

printf("done\n");
exit(0);
}

printf("Using %d workers.\n", workers);

worker = malloc(workers * sizeof(worker_t));
if (worker == NULL) {
fprintf(stderr, "Failed to malloc worker array.\n");
exit(1);
}

for (j = 0; j < workers; j++) {
worker[j].worker_number = j;
}

printf("Using alignment %d.\n", align);

posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align);
printf("Read buffer: %p.\n", buffer);
for (n = 1; n <= FILECOUNT; n++) {

sprintf(filename, FILENAME, n);
for (j = 0; j < workers; j++) {
if ((worker[j].fd = open(filename, O_RDONLY|O_DIRECT)) < 0) {
fprintf(stderr, "Failed to open %s: %s.\n",
filename, strerror(errno));
exit(1);
}

worker[j].pattern = n;
}

printf("Reading file %d.\n", n);

for (offset = 0; offset < FILESIZE; offset += READSIZE) {
memset(buffer, PATTERN, READSIZE + align);
for (j = 0; j < workers; j++) {
worker[j].offset = offset + j * PAGE_SIZE;
worker[j].buffer = buffer + align + j * PAGE_SIZE;
worker[j].length = PAGE_SIZE;
}
/* The final worker reads whatever is left over. */
worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1);

done = 0;

rc = pthread_create(&fork_tid, NULL, fork_thread, NULL);
if (rc != 0) {
fprintf(stderr, "Can't create fork thread: %s.\n",
strerror(rc));
exit(1);
}

for (j = 0; j < workers; j++) {
rc = pthread_create(&worker[j].tid,
NULL,
worker_thread,
worker + j);
if (rc != 0) {
fprintf(stderr, "Can't create worker thread %d: %s.\n",
j, strerror(rc));
exit(1);
}
}

for (j = 0; j < workers; j++) {
rc = pthread_join(worker[j].tid, NULL);
if (rc != 0) {
fprintf(stderr, "Failed to join worker thread %d: %s.\n",
j, strerror(rc));
exit(1);
}
}

/* Let the fork thread know it's ok to exit */
done = 1;

rc = pthread_join(fork_tid, NULL);
if (rc != 0) {
fprintf(stderr, "Failed to join fork thread: %s.\n",
strerror(rc));
exit(1);
}
}

/* Close the fd's for the next file. */
for (j = 0; j < workers; j++) {
close(worker[j].fd);
}
}

return 0;
}
========== dma_thread.c =======


Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
fs/nfs/direct.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

Index: b/fs/nfs/direct.c
===================================================================
--- a/fs/nfs/direct.c 2009-04-13 00:22:32.000000000 +0900
+++ b/fs/nfs/direct.c 2009-04-13 01:03:35.000000000 +0900
@@ -85,6 +85,8 @@ struct nfs_direct_req {
#define NFS_ODIRECT_DO_COMMIT (1) /* an unstable reply was received */
#define NFS_ODIRECT_RESCHED_WRITES (2) /* write verification failed */
struct nfs_writeverf verf; /* unstable write verifier */
+
+ struct mm_struct *mm; /* for fork() exclusive */
};

static void nfs_direct_write_complete(struct nfs_direct_req *dreq, struct inode *inode);
@@ -164,6 +166,7 @@ static inline struct nfs_direct_req *nfs
dreq->count = 0;
dreq->error = 0;
dreq->flags = 0;
+ dreq->mm = NULL;

return dreq;
}
@@ -216,6 +219,10 @@ static void nfs_direct_complete(struct n
res = (long) dreq->count;
aio_complete(dreq->iocb, res, 0);
}
+ if (dreq->mm) {
+ up_read_non_owner(&dreq->mm->mm_pinned_sem);
+ mmdrop(dreq->mm);
+ }
complete_all(&dreq->completion);

nfs_direct_req_release(dreq);
@@ -306,10 +313,8 @@ static ssize_t nfs_direct_read_schedule_
if (unlikely(!data))
break;

- down_read(&current->mm->mmap_sem);
- result = get_user_pages(current, current->mm, user_addr,
- data->npages, 1, 0, data->pagevec, NULL);
- up_read(&current->mm->mmap_sem);
+ result = get_user_pages_fast(user_addr, data->npages, 1,
+ data->pagevec);
if (result < 0) {
nfs_readdata_release(data);
break;
@@ -383,8 +388,12 @@ static ssize_t nfs_direct_read_schedule_
ssize_t result = -EINVAL;
size_t requested_bytes = 0;
unsigned long seg;
+ struct mm_struct *mm;

get_dreq(dreq);
+ mm = dreq->mm = current->mm;
+ atomic_inc(&mm->mm_count);
+ down_read_non_owner(&mm->mm_pinned_sem);

for (seg = 0; seg < nr_segs; seg++) {
const struct iovec *vec = &iov[seg];

2009-04-14 06:20:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH v3 4/6] aio: Don't inherit aio ring memory at fork

AIO folks, Am I missing anything?

===============
Subject: [RFC][PATCH] aio: Don't inherit aio ring memory at fork

Currently, mm_struct::ioctx_list member isn't copyed at fork. IOW aio context don't inherit at fork.
but only ring memory inherited. that's strange.

This patch mark DONTFORK to ring-memory too.
In addition, This patch has good side effect. it also fix "get_user_pages() vs fork" problem.

I think "man fork" also sould be changed. it only say

* The child does not inherit outstanding asynchronous I/O operations from
its parent (aio_read(3), aio_write(3)).

but aio_context_t (return value of io_setup(2)) also don't inherit in current implementaion.


Cc: Jeff Moyer <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: [email protected],
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/aio.c | 8 ++++++++
1 file changed, 8 insertions(+)

Index: b/fs/aio.c
===================================================================
--- a/fs/aio.c 2009-04-12 23:33:59.000000000 +0900
+++ b/fs/aio.c 2009-04-13 02:56:05.000000000 +0900
@@ -106,6 +106,7 @@ static int aio_setup_ring(struct kioctx
unsigned nr_events = ctx->max_reqs;
unsigned long size;
int nr_pages;
+ int ret;

/* Compensate for the ring buffer's head/tail overlap entry */
nr_events += 2; /* 1 is required, 2 for good luck */
@@ -140,6 +141,13 @@ static int aio_setup_ring(struct kioctx
return -EAGAIN;
}

+ /*
+ * aio context doesn't inherit while fork. (see mm_init())
+ * Then, aio ring also mark DONTFORK.
+ */
+ ret = sys_madvise(info->mmap_base, info->mmap_size, MADV_DONTFORK);
+ BUG_ON(ret);
+
dprintk("mmap address: 0x%08lx\n", info->mmap_base);
info->nr_pages = get_user_pages(current, ctx->mm,
info->mmap_base, nr_pages,

2009-04-14 06:22:13

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH v3 5/6] don't use bio-map in read() path

Who know proper fixing way?

=================
Subject: [Untested][RFC][PATCH] don't use bio-map in read() path

__bio_map_user_iov() has wrong usage of get_user_pages_fast().
it doesn't have prevent fork mechanism.

then, it sould be used read-side (memory to device transfer) gup only.

This patch is obviously temporally fix. we can implement fork safe bio_map_user()
the future...


Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: FUJITA Tomonori <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: James Bottomley <[email protected]>
---
block/blk-map.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: b/block/blk-map.c
===================================================================
--- a/block/blk-map.c 2009-02-21 16:53:21.000000000 +0900
+++ b/block/blk-map.c 2009-04-12 23:36:32.000000000 +0900
@@ -55,7 +55,7 @@ static int __blk_rq_map_user(struct requ
* direct dma. else, set up kernel bounce buffers
*/
uaddr = (unsigned long) ubuf;
- if (blk_rq_aligned(q, ubuf, len) && !map_data)
+ if (blk_rq_aligned(q, ubuf, len) && !map_data && !reading)
bio = bio_map_user(q, NULL, uaddr, len, reading, gfp_mask);
else
bio = bio_copy_user(q, map_data, uaddr, len, reading, gfp_mask);
@@ -208,7 +208,7 @@ int blk_rq_map_user_iov(struct request_q
}
}

- if (unaligned || (q->dma_pad_mask & len) || map_data)
+ if (unaligned || (q->dma_pad_mask & len) || map_data || read)
bio = bio_copy_user_iov(q, map_data, iov, iov_count, read,
gfp_mask);
else

2009-04-14 06:23:41

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c

I don't have NET-DMA usable device. I hope to get expert review.

=========================
Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c

down_read(mmap_sem)
get_user_pages()
up_read(mmap_sem)

is fork unsafe.
fix it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Maciej Sosnowski <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Chris Leech <[email protected]>
Cc: [email protected]
---
drivers/dma/iovlock.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)

Index: b/drivers/dma/iovlock.c
===================================================================
--- a/drivers/dma/iovlock.c 2009-02-21 16:53:23.000000000 +0900
+++ b/drivers/dma/iovlock.c 2009-04-13 04:46:02.000000000 +0900
@@ -94,18 +94,10 @@ struct dma_pinned_list *dma_pin_iovec_pa
pages += page_list->nr_pages;

/* pin pages down */
- down_read(&current->mm->mmap_sem);
- ret = get_user_pages(
- current,
- current->mm,
- (unsigned long) iov[i].iov_base,
- page_list->nr_pages,
- 1, /* write */
- 0, /* force */
- page_list->pages,
- NULL);
- up_read(&current->mm->mmap_sem);
-
+ down_read(&current->mm->mm_pinned_sem);
+ ret = get_user_pages_fast((unsigned long) iov[i].iov_base,
+ page_list->nr_pages, 1,
+ page_list->pages);
if (ret != page_list->nr_pages)
goto unpin;

@@ -127,6 +119,8 @@ void dma_unpin_iovec_pages(struct dma_pi
if (!pinned_list)
return;

+ up_read(&current->mm->mm_pinned_sem);
+
for (i = 0; i < pinned_list->nr_iovecs; i++) {
struct dma_page_list *page_list = &pinned_list->page_list[i];
for (j = 0; j < page_list->nr_pages; j++) {

2009-04-14 06:26:17

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/6] mm, directio: fix fork vs direct-io race (read(2) side IOW gup(write) side)


Oops, I forgot some cc. resend it.

> Subject: [PATCH] mm, directio: fix fork vs direct-io race
>
>
> ChangeLog:
> V2 -> V3
> o remove early decow logic
>
> V1 -> V2
> o add dio+aio logic
>
> ===============================================
>
> Currently, following testcase is failed.
>
> & dma_thread -a 512 -w 40
>
> ========== dma_thread.c =======
> /* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */
>
> #define _GNU_SOURCE 1
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <memory.h>
> #include <pthread.h>
> #include <getopt.h>
> #include <errno.h>
> #include <sys/types.h>
> #include <sys/wait.h>
>
> #define FILESIZE (12*1024*1024)
> #define READSIZE (1024*1024)
>
> #define FILENAME "test_%.04d.tmp"
> #define FILECOUNT 100
> #define MIN_WORKERS 2
> #define MAX_WORKERS 256
> #define PAGE_SIZE 4096
>
> #define true 1
> #define false 0
>
> typedef int bool;
>
> bool done = false;
> int workers = 2;
>
> #define PATTERN (0xfa)
>
> static void
> usage (void)
> {
> fprintf(stderr, "\nUsage: dma_thread [-h | -a <alignment> [ -w <workers>]\n"
> "\nWith no arguments, generate test files and exit.\n"
> "-h Display this help and exit.\n"
> "-a align read buffer to offset <alignment>.\n"
> "-w number of worker threads, 2 (default) to 256,\n"
> " defaults to number of cores.\n\n"
>
> "Run first with no arguments to generate files.\n"
> "Then run with -a <alignment> = 512 or 0. \n");
> }
>
> typedef struct {
> pthread_t tid;
> int worker_number;
> int fd;
> int offset;
> int length;
> int pattern;
> unsigned char *buffer;
> } worker_t;
>
>
> void *worker_thread(void * arg)
> {
> int bytes_read;
> int i,k;
> worker_t *worker = (worker_t *) arg;
> int offset = worker->offset;
> int fd = worker->fd;
> unsigned char *buffer = worker->buffer;
> int pattern = worker->pattern;
> int length = worker->length;
>
> if (lseek(fd, offset, SEEK_SET) < 0) {
> fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n",
> offset, fd, strerror(errno));
> exit(1);
> }
>
> bytes_read = read(fd, buffer, length);
> if (bytes_read != length) {
> fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n",
> fd, bytes_read, strerror(errno));
> exit(1);
> }
>
> /* Corruption check */
> for (i = 0; i < length; i++) {
> if (buffer[i] != pattern) {
> printf("Bad data at 0x%.06x: %p, \n", i, buffer + i);
> printf("Data dump starting at 0x%.06x:\n", i - 8);
> printf("Expect 0x%x followed by 0x%x:\n",
> pattern, PATTERN);
>
> for (k = 0; k < 16; k++) {
> printf("%02x ", buffer[i - 8 + k]);
> if (k == 7) {
> printf("\n");
> }
> }
>
> printf("\n");
> abort();
> }
> }
>
> return 0;
> }
>
> void *fork_thread (void *arg)
> {
> pid_t pid;
>
> while (!done) {
> pid = fork();
> if (pid == 0) {
> exit(0);
> } else if (pid < 0) {
> fprintf(stderr, "Failed to fork child.\n");
> exit(1);
> }
> waitpid(pid, NULL, 0 );
> usleep(100);
> }
>
> return NULL;
>
> }
>
> int main(int argc, char *argv[])
> {
> unsigned char *buffer = NULL;
> char filename[1024];
> int fd;
> bool dowrite = true;
> pthread_t fork_tid;
> int c, n, j;
> worker_t *worker;
> int align = 0;
> int offset, rc;
>
> workers = sysconf(_SC_NPROCESSORS_ONLN);
>
> while ((c = getopt(argc, argv, "a:hw:")) != -1) {
> switch (c) {
> case 'a':
> align = atoi(optarg);
> if (align < 0 || align > PAGE_SIZE) {
> printf("Bad alignment %d.\n", align);
> exit(1);
> }
> dowrite = false;
> break;
>
> case 'h':
> usage();
> exit(0);
> break;
>
> case 'w':
> workers = atoi(optarg);
> if (workers < MIN_WORKERS || workers > MAX_WORKERS) {
> fprintf(stderr, "Worker count %d not between "
> "%d and %d, inclusive.\n",
> workers, MIN_WORKERS, MAX_WORKERS);
> usage();
> exit(1);
> }
> dowrite = false;
> break;
>
> default:
> usage();
> exit(1);
> }
> }
>
> if (argc > 1 && (optind < argc)) {
> fprintf(stderr, "Bad command line.\n");
> usage();
> exit(1);
> }
>
> if (dowrite) {
>
> buffer = malloc(FILESIZE);
> if (buffer == NULL) {
> fprintf(stderr, "Failed to malloc write buffer.\n");
> exit(1);
> }
>
> for (n = 1; n <= FILECOUNT; n++) {
> sprintf(filename, FILENAME, n);
> fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666);
> if (fd < 0) {
> printf("create failed(%s): %s.\n", filename, strerror(errno));
> exit(1);
> }
> memset(buffer, n, FILESIZE);
> printf("Writing file %s.\n", filename);
> if (write(fd, buffer, FILESIZE) != FILESIZE) {
> printf("write failed (%s)\n", filename);
> }
>
> close(fd);
> fd = -1;
> }
>
> free(buffer);
> buffer = NULL;
>
> printf("done\n");
> exit(0);
> }
>
> printf("Using %d workers.\n", workers);
>
> worker = malloc(workers * sizeof(worker_t));
> if (worker == NULL) {
> fprintf(stderr, "Failed to malloc worker array.\n");
> exit(1);
> }
>
> for (j = 0; j < workers; j++) {
> worker[j].worker_number = j;
> }
>
> printf("Using alignment %d.\n", align);
>
> posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align);
> printf("Read buffer: %p.\n", buffer);
> for (n = 1; n <= FILECOUNT; n++) {
>
> sprintf(filename, FILENAME, n);
> for (j = 0; j < workers; j++) {
> if ((worker[j].fd = open(filename, O_RDONLY|O_DIRECT)) < 0) {
> fprintf(stderr, "Failed to open %s: %s.\n",
> filename, strerror(errno));
> exit(1);
> }
>
> worker[j].pattern = n;
> }
>
> printf("Reading file %d.\n", n);
>
> for (offset = 0; offset < FILESIZE; offset += READSIZE) {
> memset(buffer, PATTERN, READSIZE + align);
> for (j = 0; j < workers; j++) {
> worker[j].offset = offset + j * PAGE_SIZE;
> worker[j].buffer = buffer + align + j * PAGE_SIZE;
> worker[j].length = PAGE_SIZE;
> }
> /* The final worker reads whatever is left over. */
> worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1);
>
> done = 0;
>
> rc = pthread_create(&fork_tid, NULL, fork_thread, NULL);
> if (rc != 0) {
> fprintf(stderr, "Can't create fork thread: %s.\n",
> strerror(rc));
> exit(1);
> }
>
> for (j = 0; j < workers; j++) {
> rc = pthread_create(&worker[j].tid,
> NULL,
> worker_thread,
> worker + j);
> if (rc != 0) {
> fprintf(stderr, "Can't create worker thread %d: %s.\n",
> j, strerror(rc));
> exit(1);
> }
> }
>
> for (j = 0; j < workers; j++) {
> rc = pthread_join(worker[j].tid, NULL);
> if (rc != 0) {
> fprintf(stderr, "Failed to join worker thread %d: %s.\n",
> j, strerror(rc));
> exit(1);
> }
> }
>
> /* Let the fork thread know it's ok to exit */
> done = 1;
>
> rc = pthread_join(fork_tid, NULL);
> if (rc != 0) {
> fprintf(stderr, "Failed to join fork thread: %s.\n",
> strerror(rc));
> exit(1);
> }
> }
>
> /* Close the fd's for the next file. */
> for (j = 0; j < workers; j++) {
> close(worker[j].fd);
> }
> }
>
> return 0;
> }
> ========== dma_thread.c =======
>
> Because following scenario happend.
>
> CPU0 CPU1 CPU2 note
> (fork thread) (worker thread1) (worker thread2)
> ==========================================================================================
> read()
> | get_user_pages()
> |
> fork | inc map_count and wprotect
> |
> | read()
> | | get_user_pages() COW break, CPU2 get copyed page,
> | | but CPU1 still point to original page.
> | | then the result of CPU1 transfer will be lost.
> v |
> |
> |
> v
>
>
> Actually, get_user_pages() (and get_user_pages_fast()) don't provide any pinning operation.
> Caller must prevent fork in critical section.
> access_process_vm() explain standard fork protection way, it use mmap_sem.
>
> but, mmap_sem is very easy contended lock. it cause large performance regression to DirectIO.
> Then, this patch introduce new lock for another fork prevent mechanism.
> Almost application don't fork while DirectIO in progress, then mm_pinned_sem doesn't contend in almost case.
>
>
> Also, this patch fix following aio+dio testcase.
>
> ========== forkscrew.c ========
> /*
> * Copyright 2009, Red Hat, Inc.
> *
> * Author: Jeff Moyer <[email protected]>
> *
> * This program attempts to expose a race between O_DIRECT I/O and the fork()
> * path in a multi-threaded program. In order to reliably reproduce the
> * problem, it is best to perform a dd from the device under test to /dev/null
> * as this makes the read I/O slow enough to orchestrate the problem.
> *
> * Running: ./forkscrew
> *
> * It is expected that a file name "data" exists in the current working
> * directory, and that its contents are something other than 0x2a. A simple
> * dd if=/dev/zero of=data bs=1M count=1 should be sufficient.
> */
> #define _GNU_SOURCE 1
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <sys/types.h>
> #include <sys/wait.h>
>
> #include <pthread.h>
> #include <libaio.h>
>
> pthread_cond_t worker_cond = PTHREAD_COND_INITIALIZER;
> pthread_mutex_t worker_mutex = PTHREAD_MUTEX_INITIALIZER;
> pthread_cond_t fork_cond = PTHREAD_COND_INITIALIZER;
> pthread_mutex_t fork_mutex = PTHREAD_MUTEX_INITIALIZER;
>
> char *buffer;
> int fd;
>
> /* pattern filled into the in-memory buffer */
> #define PATTERN 0x2a // '*'
>
> void
> usage(void)
> {
> fprintf(stderr,
> "\nUsage: forkscrew\n"
> "it is expected that a file named \"data\" is the current\n"
> "working directory. It should be at least 3*pagesize in size\n"
> );
> }
>
> void
> dump_buffer(char *buf, int len)
> {
> int i;
> int last_off, last_val;
>
> last_off = -1;
> last_val = -1;
>
> for (i = 0; i < len; i++) {
> if (last_off < 0) {
> last_off = i;
> last_val = buf[i];
> continue;
> }
>
> if (buf[i] != last_val) {
> printf("%d - %d: %d\n", last_off, i - 1, last_val);
> last_off = i;
> last_val = buf[i];
> }
> }
>
> if (last_off != len - 1)
> printf("%d - %d: %d\n", last_off, i-1, last_val);
> }
>
> int
> check_buffer(char *bufp, int len, int pattern)
> {
> int i;
>
> for (i = 0; i < len; i++) {
> if (bufp[i] == pattern)
> return 1;
> }
> return 0;
> }
>
> void *
> forker_thread(void *arg)
> {
> pthread_mutex_lock(&fork_mutex);
> pthread_cond_signal(&fork_cond);
> pthread_cond_wait(&fork_cond, &fork_mutex);
> switch (fork()) {
> case 0:
> sleep(1);
> printf("child dumping buffer:\n");
> dump_buffer(buffer + 512, 2*getpagesize());
> exit(0);
> case -1:
> perror("fork");
> exit(1);
> default:
> break;
> }
> pthread_cond_signal(&fork_cond);
> pthread_mutex_unlock(&fork_mutex);
>
> wait(NULL);
> return (void *)0;
> }
>
> void *
> worker(void *arg)
> {
> int first = (int)arg;
> char *bufp;
> int pagesize = getpagesize();
> int ret;
> int corrupted = 0;
>
> if (first) {
> io_context_t aioctx;
> struct io_event event;
> struct iocb *iocb = malloc(sizeof *iocb);
> if (!iocb) {
> perror("malloc");
> exit(1);
> }
> memset(&aioctx, 0, sizeof(aioctx));
> ret = io_setup(1, &aioctx);
> if (ret != 0) {
> errno = -ret;
> perror("io_setup");
> exit(1);
> }
> bufp = buffer + 512;
> io_prep_pread(iocb, fd, bufp, pagesize, 0);
>
> /* submit the I/O */
> io_submit(aioctx, 1, &iocb);
>
> /* tell the fork thread to run */
> pthread_mutex_lock(&fork_mutex);
> pthread_cond_signal(&fork_cond);
>
> /* wait for the fork to happen */
> pthread_cond_wait(&fork_cond, &fork_mutex);
> pthread_mutex_unlock(&fork_mutex);
>
> /* release the other worker to issue I/O */
> pthread_mutex_lock(&worker_mutex);
> pthread_cond_signal(&worker_cond);
> pthread_mutex_unlock(&worker_mutex);
>
> ret = io_getevents(aioctx, 1, 1, &event, NULL);
> if (ret != 1) {
> errno = -ret;
> perror("io_getevents");
> exit(1);
> }
> if (event.res != pagesize) {
> errno = -event.res;
> perror("read error");
> exit(1);
> }
>
> io_destroy(aioctx);
>
> /* check buffer, should be corrupt */
> if (check_buffer(bufp, pagesize, PATTERN)) {
> printf("worker 0 failed check\n");
> dump_buffer(bufp, pagesize);
> corrupted = 1;
> }
>
> } else {
>
> bufp = buffer + 512 + pagesize;
>
> pthread_mutex_lock(&worker_mutex);
> pthread_cond_signal(&worker_cond); /* tell main we're ready */
> /* wait for the first I/O and the fork */
> pthread_cond_wait(&worker_cond, &worker_mutex);
> pthread_mutex_unlock(&worker_mutex);
>
> /* submit overlapping I/O */
> ret = read(fd, bufp, pagesize);
> if (ret != pagesize) {
> perror("read");
> exit(1);
> }
> /* check buffer, should be fine */
> if (check_buffer(bufp, pagesize, PATTERN)) {
> printf("worker 1 failed check -- abnormal\n");
> dump_buffer(bufp, pagesize);
> corrupted = 1;
> }
> }
>
> return (void *)corrupted;
> }
>
> int
> main(int argc, char **argv)
> {
> pthread_t workers[2];
> pthread_t forker;
> int ret, rc = 0;
> void *thread_ret;
> int pagesize = getpagesize();
>
> fd = open("data", O_DIRECT|O_RDONLY);
> if (fd < 0) {
> perror("open");
> exit(1);
> }
>
> ret = posix_memalign(&buffer, pagesize, 3 * pagesize);
> if (ret != 0) {
> errno = ret;
> perror("posix_memalign");
> exit(1);
> }
> memset(buffer, PATTERN, 3*pagesize);
>
> pthread_mutex_lock(&fork_mutex);
> ret = pthread_create(&forker, NULL, forker_thread, NULL);
> pthread_cond_wait(&fork_cond, &fork_mutex);
> pthread_mutex_unlock(&fork_mutex);
>
> pthread_mutex_lock(&worker_mutex);
> ret |= pthread_create(&workers[0], NULL, worker, (void *)0);
> if (ret) {
> perror("pthread_create");
> exit(1);
> }
> pthread_cond_wait(&worker_cond, &worker_mutex);
> pthread_mutex_unlock(&worker_mutex);
>
> ret = pthread_create(&workers[1], NULL, worker, (void *)1);
> if (ret != 0) {
> perror("pthread_create");
> exit(1);
> }
>
> pthread_join(forker, NULL);
> pthread_join(workers[0], &thread_ret);
> if (thread_ret != 0)
> rc = 1;
> pthread_join(workers[1], &thread_ret);
> if (thread_ret != 0)
> rc = 1;
>
> if (rc != 0) {
> printf("parent dumping full buffer\n");
> dump_buffer(buffer + 512, 2 * pagesize);
> }
>
> close(fd);
> free(buffer);
> exit(rc);
> }
> ========== forkscrew.c ========
>
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Sugessted-by: Linus Torvalds <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Zach Brown <[email protected]>
> Cc: Andy Grover <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> fs/direct-io.c | 16 ++++++++++++++++
> include/linux/init_task.h | 1 +
> include/linux/mm_types.h | 6 ++++++
> kernel/fork.c | 3 +++
> 4 files changed, 26 insertions(+)
>
> Index: b/fs/direct-io.c
> ===================================================================
> --- a/fs/direct-io.c 2009-04-13 00:24:01.000000000 +0900
> +++ b/fs/direct-io.c 2009-04-13 01:36:37.000000000 +0900
> @@ -131,6 +131,9 @@ struct dio {
> int is_async; /* is IO async ? */
> int io_error; /* IO error in completion path */
> ssize_t result; /* IO result */
> +
> + /* fork exclusive stuff */
> + struct mm_struct *mm;
> };
>
> /*
> @@ -244,6 +247,12 @@ static int dio_complete(struct dio *dio,
> /* lockdep: non-owner release */
> up_read_non_owner(&dio->inode->i_alloc_sem);
>
> + if (dio->rw == READ) {
> + BUG_ON(!dio->mm);
> + up_read_non_owner(&dio->mm->mm_pinned_sem);
> + mmdrop(dio->mm);
> + }
> +
> if (ret == 0)
> ret = dio->page_errors;
> if (ret == 0)
> @@ -942,6 +951,7 @@ direct_io_worker(int rw, struct kiocb *i
> ssize_t ret = 0;
> ssize_t ret2;
> size_t bytes;
> + struct mm_struct *mm;
>
> dio->inode = inode;
> dio->rw = rw;
> @@ -960,6 +970,12 @@ direct_io_worker(int rw, struct kiocb *i
> spin_lock_init(&dio->bio_lock);
> dio->refcount = 1;
>
> + if (rw == READ) {
> + mm = dio->mm = current->mm;
> + atomic_inc(&mm->mm_count);
> + down_read_non_owner(&mm->mm_pinned_sem);
> + }
> +
> /*
> * In case of non-aligned buffers, we may need 2 more
> * pages since we need to zero out first and last block.
> Index: b/include/linux/init_task.h
> ===================================================================
> --- a/include/linux/init_task.h 2009-04-13 00:24:01.000000000 +0900
> +++ b/include/linux/init_task.h 2009-04-13 00:24:32.000000000 +0900
> @@ -37,6 +37,7 @@ extern struct fs_struct init_fs;
> .page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \
> .mmlist = LIST_HEAD_INIT(name.mmlist), \
> .cpu_vm_mask = CPU_MASK_ALL, \
> + .mm_pinned_sem = __RWSEM_INITIALIZER(name.mm_pinned_sem), \
> }
>
> #define INIT_SIGNALS(sig) { \
> Index: b/include/linux/mm_types.h
> ===================================================================
> --- a/include/linux/mm_types.h 2009-04-13 00:24:01.000000000 +0900
> +++ b/include/linux/mm_types.h 2009-04-13 00:24:32.000000000 +0900
> @@ -274,6 +274,12 @@ struct mm_struct {
> #ifdef CONFIG_MMU_NOTIFIER
> struct mmu_notifier_mm *mmu_notifier_mm;
> #endif
> +
> + /*
> + * if there are on-flight directio or similar pinning action,
> + * COW cause memory corruption. the sem protect it by preventing fork.
> + */
> + struct rw_semaphore mm_pinned_sem;
> };
>
> /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
> Index: b/kernel/fork.c
> ===================================================================
> --- a/kernel/fork.c 2009-04-13 00:24:01.000000000 +0900
> +++ b/kernel/fork.c 2009-04-13 00:24:32.000000000 +0900
> @@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm
> unsigned long charge;
> struct mempolicy *pol;
>
> + down_write(&oldmm->mm_pinned_sem);
> down_write(&oldmm->mmap_sem);
> flush_cache_dup_mm(oldmm);
> /*
> @@ -368,6 +369,7 @@ out:
> up_write(&mm->mmap_sem);
> flush_tlb_mm(oldmm);
> up_write(&oldmm->mmap_sem);
> + up_write(&oldmm->mm_pinned_sem);
> return retval;
> fail_nomem_policy:
> kmem_cache_free(vm_area_cachep, tmp);
> @@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct
> mm->free_area_cache = TASK_UNMAPPED_BASE;
> mm->cached_hole_size = ~0UL;
> mm_init_owner(mm, p);
> + init_rwsem(&mm->mm_pinned_sem);
>
> if (likely(!mm_alloc_pgd(mm))) {
> mm->def_flags = 0;
>
>


2009-04-14 06:56:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c

On Tuesday 14 April 2009 16:23:13 KOSAKI Motohiro wrote:
> I don't have NET-DMA usable device. I hope to get expert review.
>
> =========================
> Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c
>
> down_read(mmap_sem)
> get_user_pages()
> up_read(mmap_sem)
>
> is fork unsafe.
> fix it.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: Maciej Sosnowski <[email protected]>
> Cc: David S. Miller <[email protected]>
> Cc: Chris Leech <[email protected]>
> Cc: [email protected]
> ---
> drivers/dma/iovlock.c | 18 ++++++------------
> 1 file changed, 6 insertions(+), 12 deletions(-)
>
> Index: b/drivers/dma/iovlock.c
> ===================================================================
> --- a/drivers/dma/iovlock.c 2009-02-21 16:53:23.000000000 +0900
> +++ b/drivers/dma/iovlock.c 2009-04-13 04:46:02.000000000 +0900
> @@ -94,18 +94,10 @@ struct dma_pinned_list *dma_pin_iovec_pa
> pages += page_list->nr_pages;
>
> /* pin pages down */
> - down_read(&current->mm->mmap_sem);
> - ret = get_user_pages(
> - current,
> - current->mm,
> - (unsigned long) iov[i].iov_base,
> - page_list->nr_pages,
> - 1, /* write */
> - 0, /* force */
> - page_list->pages,
> - NULL);
> - up_read(&current->mm->mmap_sem);
> -
> + down_read(&current->mm->mm_pinned_sem);
> + ret = get_user_pages_fast((unsigned long) iov[i].iov_base,
> + page_list->nr_pages, 1,
> + page_list->pages);

I would perhaps not fold gup_fast conversions into the same patch as
the fix.

2009-04-14 06:58:25

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c

> On Tuesday 14 April 2009 16:23:13 KOSAKI Motohiro wrote:
> > I don't have NET-DMA usable device. I hope to get expert review.
> >
> > =========================
> > Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c
> >
> > down_read(mmap_sem)
> > get_user_pages()
> > up_read(mmap_sem)
> >
> > is fork unsafe.
> > fix it.
> >
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > Cc: Maciej Sosnowski <[email protected]>
> > Cc: David S. Miller <[email protected]>
> > Cc: Chris Leech <[email protected]>
> > Cc: [email protected]
> > ---
> > drivers/dma/iovlock.c | 18 ++++++------------
> > 1 file changed, 6 insertions(+), 12 deletions(-)
> >
> > Index: b/drivers/dma/iovlock.c
> > ===================================================================
> > --- a/drivers/dma/iovlock.c 2009-02-21 16:53:23.000000000 +0900
> > +++ b/drivers/dma/iovlock.c 2009-04-13 04:46:02.000000000 +0900
> > @@ -94,18 +94,10 @@ struct dma_pinned_list *dma_pin_iovec_pa
> > pages += page_list->nr_pages;
> >
> > /* pin pages down */
> > - down_read(&current->mm->mmap_sem);
> > - ret = get_user_pages(
> > - current,
> > - current->mm,
> > - (unsigned long) iov[i].iov_base,
> > - page_list->nr_pages,
> > - 1, /* write */
> > - 0, /* force */
> > - page_list->pages,
> > - NULL);
> > - up_read(&current->mm->mmap_sem);
> > -
> > + down_read(&current->mm->mm_pinned_sem);
> > + ret = get_user_pages_fast((unsigned long) iov[i].iov_base,
> > + page_list->nr_pages, 1,
> > + page_list->pages);
>
> I would perhaps not fold gup_fast conversions into the same patch as
> the fix.

OK. I'll fix.

2009-04-14 08:42:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/6] IO pinning(get_user_pages()) vs fork race fix

On Tuesday 14 April 2009 16:15:43 KOSAKI Motohiro wrote:> > Linux Device Drivers, Third Edition, Chapter 15: Memory Mapping and DMA says> > get_user_pages is a low-level memory management function, with a suitably complex> interface. It also requires that the mmap reader/writer semaphore for the address> space be obtained in read mode before the call. As a result, calls to get_user_pages> usually look something like:> > down_read(&current->mm->mmap_sem);> result = get_user_pages(current, current->mm, ...);> up_read(&current->mm->mmap_sem);> > The return value is the number of pages actually mapped, which could be fewer than> the number requested (but greater than zero).> > but, it isn't true. mmap_sem isn't only used for vma traversal, but also prevent vs-fork race.> up_read(mmap_sem) mean end of critical section, IOW after up_read() code is fork unsafe.> (access_process_vm() explain proper get_user_pages() usage)> > Oh well, We have many wrong caller now. What is the best fix method?
What indeed...

> Nick Piggin and Andrea Arcangeli proposed to change get_user_pages() semantics as caller expected.> see "[PATCH] fork vs gup(-fast) fix" thead in linux-mm> but Linus NACKed it.> > Thus I made caller change approach patch series. it is made for discuss to compare Nick's approach.> I don't hope submit it yet.> > Nick, This version fixed vmsplice and aio issue (you pointed). I hope to hear your opiniton ;)
I don't see how it fixes vmsplice? vmsplice can get_user_pages pages from oneprocess's address space and put them into a pipe, and they are released byanother process after consuming the pages I think. So it's fairly hard to holda lock over this.
I guess apart from the vmsplice issue (unless I missed a clever fix), I guessthis *does* work. I can't see any races... I'd really still like to hear a goodreason why my proposed patch is so obviously crap.
Reasons proposed so far:"No locking" (I think this is a good thing; no *bugs* have been pointed out)"Too many page flags" (but it only uses 1 anon page flag, only fs pagecachehas a flags shortage so we can easily overload a pagecache flag)"Diffstat too large" (seems comparable when you factor in the fixes to callers,but has the advantage of being contained within VM subsystem)"Horrible code" (I still don't see it. Of course the code will be nicer if wedon't fix the issue _at all_, but I don't see this is so much worse than havingto fix callers.)
FWIW, I have attached my patch again (with simple function-movement hunksmoved into another patch so it is easier to see real impact of this patch).

> ChangeLog:> V2 -> V3> o remove early decow logic> o introduce prevent unmap logic> o fix nfs-directio> o fix aio> o fix bio (only bandaid fix)> > V1 -> V2> o fix aio+dio case> > TODO> o implement down_write_killable()> o fix kvm (need?)> o fix get_arg_page() (Why this function don't use mmap_sem?)
Probably someone was shooting for a crazy optimisation because no otherthreads should be able to access this mm yet :)
Anyway, this is my proposal. It has the advantage that it fixes everycaller in the tree.
--- arch/powerpc/mm/gup.c | 2 arch/x86/mm/gup.c | 3 include/linux/mm.h | 2 include/linux/page-flags.h | 5 + kernel/fork.c | 2 mm/memory.c | 178 +++++++++++++++++++++++++++++++++++++++------ 6 files changed, 167 insertions(+), 25 deletions(-)
Index: linux-2.6/include/linux/mm.h===================================================================--- linux-2.6.orig/include/linux/mm.h+++ linux-2.6/include/linux/mm.h@@ -791,7 +791,7 @@ int walk_page_range(unsigned long addr, void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src,- struct vm_area_struct *vma);+ struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); int follow_phys(struct vm_area_struct *vma, unsigned long address,Index: linux-2.6/mm/memory.c===================================================================--- linux-2.6.orig/mm/memory.c+++ linux-2.6/mm/memory.c@@ -601,12 +601,103 @@ void zap_pte(struct mm_struct *mm, struc } } /*+ * breaks COW of child pte that has been marked COW by fork().+ * Must be called with the child's ptl held and pte mapped.+ * Returns 0 on success with ptl held and pte mapped.+ * -ENOMEM on OOM failure, or -EAGAIN if something changed under us.+ * ptl dropped and pte unmapped on error cases.+ */+static noinline int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd,+ spinlock_t *ptl, struct vm_area_struct *vma,+ unsigned long address)+{+ pte_t pte = *ptep;+ struct page *page, *new_page;+ int ret;++ BUG_ON(!pte_present(pte));+ BUG_ON(pte_write(pte));++ page = vm_normal_page(vma, address, pte);+ BUG_ON(!page);+ BUG_ON(!PageAnon(page));+ BUG_ON(!PageDontCOW(page));++ /* The following code comes from do_wp_page */+ page_cache_get(page);+ pte_unmap_unlock(pte, ptl);++ if (unlikely(anon_vma_prepare(vma)))+ goto oom;+ VM_BUG_ON(page == ZERO_PAGE(0));+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);+ if (!new_page)+ goto oom;+ /*+ * Don't let another task, with possibly unlocked vma,+ * keep the mlocked page.+ */+ if (vma->vm_flags & VM_LOCKED) {+ lock_page(page); /* for LRU manipulation */+ clear_page_mlock(page);+ unlock_page(page);+ }+ cow_user_page(new_page, page, address, vma);+ __SetPageUptodate(new_page);++ if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))+ goto oom_free_new;++ /*+ * Re-check the pte - we dropped the lock+ */+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);+ if (pte_same(*ptep, pte)) {+ pte_t entry;++ flush_cache_page(vma, address, pte_pfn(pte));+ entry = mk_pte(new_page, vma->vm_page_prot);+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);+ /*+ * Clear the pte entry and flush it first, before updating the+ * pte with the new entry. This will avoid a race condition+ * seen in the presence of one thread doing SMC and another+ * thread doing COW.+ */+ ptep_clear_flush_notify(vma, address, ptep);+ page_add_new_anon_rmap(new_page, vma, address);+ set_pte_at(mm, address, ptep, entry);++ /* See comment in do_wp_page */+ page_remove_rmap(page);+ page_cache_release(page);+ ret = 0;+ } else {+ if (!pte_none(*ptep))+ zap_pte(mm, vma, address, ptep);+ pte_unmap_unlock(pte, ptl);+ mem_cgroup_uncharge_page(new_page);+ page_cache_release(new_page);+ ret = -EAGAIN;+ }+ page_cache_release(page);++ return ret;++oom_free_new:+ page_cache_release(new_page);+oom:+ page_cache_release(page);+ return -ENOMEM;+}++/* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range * covered by this vma. */ -static inline void+static inline int copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, unsigned long addr, int *rss)@@ -614,6 +705,7 @@ copy_one_pte(struct mm_struct *dst_mm, s unsigned long vm_flags = vma->vm_flags; pte_t pte = *src_pte; struct page *page;+ int ret = 0; /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) {@@ -665,20 +757,26 @@ copy_one_pte(struct mm_struct *dst_mm, s get_page(page); page_dup_rmap(page, vma, addr); rss[!!PageAnon(page)]++;+ if (unlikely(PageDontCOW(page)) && PageAnon(page))+ ret = 1; } out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte);++ return ret; } static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,- pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,+ pmd_t *dst_pmd, pmd_t *src_pmd,+ struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pte_t *src_pte, *dst_pte; spinlock_t *src_ptl, *dst_ptl; int progress = 0; int rss[2];+ int decow; again: rss[1] = rss[0] = 0;@@ -705,7 +803,10 @@ again: progress++; continue; }- copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);+ decow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,+ src_vma, addr, rss);+ if (unlikely(decow))+ goto decow; progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); @@ -714,14 +815,31 @@ again: pte_unmap_nested(src_pte - 1); add_mm_rss(dst_mm, rss[0], rss[1]); pte_unmap_unlock(dst_pte - 1, dst_ptl);+next: cond_resched(); if (addr != end) goto again; return 0;++decow:+ arch_leave_lazy_mmu_mode();+ spin_unlock(src_ptl);+ pte_unmap_nested(src_pte);+ add_mm_rss(dst_mm, rss[0], rss[1]);+ decow = decow_one_pte(dst_mm, dst_pte, dst_pmd, dst_ptl, dst_vma, addr);+ if (decow == -ENOMEM)+ return -ENOMEM;+ if (decow == -EAGAIN)+ goto again;+ pte_unmap_unlock(dst_pte, dst_ptl);+ cond_resched();+ addr += PAGE_SIZE;+ goto next; } static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,- pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,+ pud_t *dst_pud, pud_t *src_pud,+ struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pmd_t *src_pmd, *dst_pmd;@@ -736,14 +854,15 @@ static inline int copy_pmd_range(struct if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,- vma, addr, next))+ dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,- pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,+ pgd_t *dst_pgd, pgd_t *src_pgd,+ struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pud_t *src_pud, *dst_pud;@@ -758,19 +877,19 @@ static inline int copy_pud_range(struct if (pud_none_or_clear_bad(src_pud)) continue; if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,- vma, addr, next))+ dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pud++, src_pud++, addr = next, addr != end); return 0; } int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,- struct vm_area_struct *vma)+ struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { pgd_t *src_pgd, *dst_pgd; unsigned long next;- unsigned long addr = vma->vm_start;- unsigned long end = vma->vm_end;+ unsigned long addr = src_vma->vm_start;+ unsigned long end = src_vma->vm_end; int ret; /*@@ -779,20 +898,20 @@ int copy_page_range(struct mm_struct *ds * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */- if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {- if (!vma->anon_vma)+ if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {+ if (!src_vma->anon_vma) return 0; } - if (is_vm_hugetlb_page(vma))- return copy_hugetlb_page_range(dst_mm, src_mm, vma);+ if (is_vm_hugetlb_page(src_vma))+ return copy_hugetlb_page_range(dst_mm, src_mm, src_vma); - if (unlikely(is_pfn_mapping(vma))) {+ if (unlikely(is_pfn_mapping(src_vma))) { /* * We do not free on error cases below as remove_vma * gets called on error from higher level routine */- ret = track_pfn_vma_copy(vma);+ ret = track_pfn_vma_copy(src_vma); if (ret) return ret; }@@ -803,7 +922,7 @@ int copy_page_range(struct mm_struct *ds * parent mm. And a permission downgrade will only happen if * is_cow_mapping() returns true. */- if (is_cow_mapping(vma->vm_flags))+ if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); ret = 0;@@ -814,15 +933,16 @@ int copy_page_range(struct mm_struct *ds if (pgd_none_or_clear_bad(src_pgd)) continue; if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,- vma, addr, next))) {+ dst_vma, src_vma, addr, next))) { ret = -ENOMEM; break; } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - if (is_cow_mapping(vma->vm_flags))+ if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_end(src_mm,- vma->vm_start, end);+ src_vma->vm_start, end);+ return ret; } @@ -1272,8 +1392,6 @@ static inline int use_zero_page(struct v return !vma->vm_ops || !vma->vm_ops->fault; } -- int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int flags, struct page **pages, struct vm_area_struct **vmas)@@ -1298,6 +1416,7 @@ int __get_user_pages(struct task_struct do { struct vm_area_struct *vma; unsigned int foll_flags;+ int decow; vma = find_extend_vma(mm, start); if (!vma && in_gate_area(tsk, start)) {@@ -1352,6 +1471,14 @@ int __get_user_pages(struct task_struct continue; } + /*+ * Except in special cases where the caller will not read to or+ * write from these pages, we must break COW for any pages+ * returned from get_user_pages, so that our caller does not+ * subsequently end up with the pages of a parent or child+ * process after a COW takes place.+ */+ decow = (pages && is_cow_mapping(vma->vm_flags)); foll_flags = FOLL_TOUCH; if (pages) foll_flags |= FOLL_GET;@@ -1372,7 +1499,7 @@ int __get_user_pages(struct task_struct fatal_signal_pending(current))) return i ? i : -ERESTARTSYS; - if (write)+ if (write || decow) foll_flags |= FOLL_WRITE; cond_resched();@@ -1415,6 +1542,9 @@ int __get_user_pages(struct task_struct if (pages) { pages[i] = page; + if (decow && !PageDontCOW(page) &&+ PageAnon(page))+ SetPageDontCOW(page); flush_anon_page(vma, page, start); flush_dcache_page(page); }@@ -1966,6 +2096,8 @@ static int do_wp_page(struct mm_struct * } reuse = reuse_swap_page(old_page); unlock_page(old_page);+ VM_BUG_ON(PageDontCOW(old_page) && !reuse);+ } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { /*Index: linux-2.6/arch/x86/mm/gup.c===================================================================--- linux-2.6.orig/arch/x86/mm/gup.c+++ linux-2.6/arch/x86/mm/gup.c@@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t struct page *page; if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {+failed: pte_unmap(ptep); return 0; } VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte);+ if (PageAnon(page) && unlikely(!PageDontCOW(page)))+ goto failed; get_page(page); pages[*nr] = page; (*nr)++;Index: linux-2.6/include/linux/page-flags.h===================================================================--- linux-2.6.orig/include/linux/page-flags.h+++ linux-2.6/include/linux/page-flags.h@@ -106,6 +106,9 @@ enum pageflags { #endif __NR_PAGEFLAGS, + /* Anonymous pages */+ PG_dontcow = PG_owner_priv_1, /* do not WP for COW optimisation */+ /* Filesystems */ PG_checked = PG_owner_priv_1, @@ -225,6 +228,8 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC */ TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) __PAGEFLAG(Buddy, buddy)+__PAGEFLAG(DontCOW, dontcow)+SETPAGEFLAG(DontCOW, dontcow) PAGEFLAG(MappedToDisk, mappedtodisk) /* PG_readahead is only used for file reads; PG_reclaim is only for writes */Index: linux-2.6/kernel/fork.c===================================================================--- linux-2.6.orig/kernel/fork.c+++ linux-2.6/kernel/fork.c@@ -359,7 +359,7 @@ static int dup_mmap(struct mm_struct *mm rb_parent = &tmp->vm_rb; mm->map_count++;- retval = copy_page_range(mm, oldmm, mpnt);+ retval = copy_page_range(mm, oldmm, tmp, mpnt); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp);Index: linux-2.6/arch/powerpc/mm/gup.c===================================================================--- linux-2.6.orig/arch/powerpc/mm/gup.c+++ linux-2.6/arch/powerpc/mm/gup.c@@ -41,6 +41,8 @@ static noinline int gup_pte_range(pmd_t return 0; VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte);+ if (PageAnon(page) && unlikely(!PageDontCOW(page)))+ return 0; if (!page_cache_get_speculative(page)) return 0; if (unlikely(pte_val(pte) != pte_val(*ptep))) {????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2009-04-14 09:19:30

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/6] IO pinning(get_user_pages()) vs fork race fix

Hi

> > but, it isn't true. mmap_sem isn't only used for vma traversal, but also prevent vs-fork race.
> > up_read(mmap_sem) mean end of critical section, IOW after up_read() code is fork unsafe.
> > (access_process_vm() explain proper get_user_pages() usage)
> >
> > Oh well, We have many wrong caller now. What is the best fix method?
>
> What indeed...

Yes. semantics change vs many caller change is one of most important point
in this discussion.
I hope all related person reach the same conclusion.



> > Nick, This version fixed vmsplice and aio issue (you pointed). I hope to hear your opiniton ;)
>
> I don't see how it fixes vmsplice? vmsplice can get_user_pages pages from one
> process's address space and put them into a pipe, and they are released by
> another process after consuming the pages I think. So it's fairly hard to hold
> a lock over this.

I recognize my explanation is poor.

firstly, pipe_to_user() via vmsplice_to_user use copy_to_user. then we don't need care
receive side.
secondly, get_iovec_page_array() via vmsplice_to_pipe() use gup(read).
then we only need prevent to change the page.

I changed reuse_swap_page() at [1/6]. then if any process touch the page while
the process isn't recived yet, it makes COW break and toucher get copyed page.
then, Anybody can't change original page.

Thus, This patch series also fixes vmsplice issue, I think.
Am I missing anything?

> I guess apart from the vmsplice issue (unless I missed a clever fix), I guess
> this *does* work. I can't see any races... I'd really still like to hear a good
> reason why my proposed patch is so obviously crap.
>
> Reasons proposed so far:
> "No locking" (I think this is a good thing; no *bugs* have been pointed out)
> "Too many page flags" (but it only uses 1 anon page flag, only fs pagecache
> has a flags shortage so we can easily overload a pagecache flag)
> "Diffstat too large" (seems comparable when you factor in the fixes to callers,
> but has the advantage of being contained within VM subsystem)
> "Horrible code" (I still don't see it. Of course the code will be nicer if we
> don't fix the issue _at all_, but I don't see this is so much worse than having
> to fix callers.)

Honestly, I don't dislike your.
but I really hope to fix this bug. if someone nak your patch, I'll seek another way.



> FWIW, I have attached my patch again (with simple function-movement hunks
> moved into another patch so it is easier to see real impact of this patch).

OK. I try to test your patch too.


- kosaki

>
>
> > ChangeLog:
> > V2 -> V3
> > o remove early decow logic
> > o introduce prevent unmap logic
> > o fix nfs-directio
> > o fix aio
> > o fix bio (only bandaid fix)
> >
> > V1 -> V2
> > o fix aio+dio case
> >
> > TODO
> > o implement down_write_killable()
> > o fix kvm (need?)
> > o fix get_arg_page() (Why this function don't use mmap_sem?)
>
> Probably someone was shooting for a crazy optimisation because no other
> threads should be able to access this mm yet :)
>
> Anyway, this is my proposal. It has the advantage that it fixes every
> caller in the tree.
>
> ---
> arch/powerpc/mm/gup.c | 2
> arch/x86/mm/gup.c | 3
> include/linux/mm.h | 2
> include/linux/page-flags.h | 5 +
> kernel/fork.c | 2
> mm/memory.c | 178 +++++++++++++++++++++++++++++++++++++++------
> 6 files changed, 167 insertions(+), 25 deletions(-)
>
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -791,7 +791,7 @@ int walk_page_range(unsigned long addr,
> void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
> unsigned long end, unsigned long floor, unsigned long ceiling);
> int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
> - struct vm_area_struct *vma);
> + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
> void unmap_mapping_range(struct address_space *mapping,
> loff_t const holebegin, loff_t const holelen, int even_cows);
> int follow_phys(struct vm_area_struct *vma, unsigned long address,
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -601,12 +601,103 @@ void zap_pte(struct mm_struct *mm, struc
> }
> }
> /*
> + * breaks COW of child pte that has been marked COW by fork().
> + * Must be called with the child's ptl held and pte mapped.
> + * Returns 0 on success with ptl held and pte mapped.
> + * -ENOMEM on OOM failure, or -EAGAIN if something changed under us.
> + * ptl dropped and pte unmapped on error cases.
> + */
> +static noinline int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd,
> + spinlock_t *ptl, struct vm_area_struct *vma,
> + unsigned long address)
> +{
> + pte_t pte = *ptep;
> + struct page *page, *new_page;
> + int ret;
> +
> + BUG_ON(!pte_present(pte));
> + BUG_ON(pte_write(pte));
> +
> + page = vm_normal_page(vma, address, pte);
> + BUG_ON(!page);
> + BUG_ON(!PageAnon(page));
> + BUG_ON(!PageDontCOW(page));
> +
> + /* The following code comes from do_wp_page */
> + page_cache_get(page);
> + pte_unmap_unlock(pte, ptl);
> +
> + if (unlikely(anon_vma_prepare(vma)))
> + goto oom;
> + VM_BUG_ON(page == ZERO_PAGE(0));
> + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> + if (!new_page)
> + goto oom;
> + /*
> + * Don't let another task, with possibly unlocked vma,
> + * keep the mlocked page.
> + */
> + if (vma->vm_flags & VM_LOCKED) {
> + lock_page(page); /* for LRU manipulation */
> + clear_page_mlock(page);
> + unlock_page(page);
> + }
> + cow_user_page(new_page, page, address, vma);
> + __SetPageUptodate(new_page);
> +
> + if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
> + goto oom_free_new;
> +
> + /*
> + * Re-check the pte - we dropped the lock
> + */
> + ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> + if (pte_same(*ptep, pte)) {
> + pte_t entry;
> +
> + flush_cache_page(vma, address, pte_pfn(pte));
> + entry = mk_pte(new_page, vma->vm_page_prot);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + /*
> + * Clear the pte entry and flush it first, before updating the
> + * pte with the new entry. This will avoid a race condition
> + * seen in the presence of one thread doing SMC and another
> + * thread doing COW.
> + */
> + ptep_clear_flush_notify(vma, address, ptep);
> + page_add_new_anon_rmap(new_page, vma, address);
> + set_pte_at(mm, address, ptep, entry);
> +
> + /* See comment in do_wp_page */
> + page_remove_rmap(page);
> + page_cache_release(page);
> + ret = 0;
> + } else {
> + if (!pte_none(*ptep))
> + zap_pte(mm, vma, address, ptep);
> + pte_unmap_unlock(pte, ptl);
> + mem_cgroup_uncharge_page(new_page);
> + page_cache_release(new_page);
> + ret = -EAGAIN;
> + }
> + page_cache_release(page);
> +
> + return ret;
> +
> +oom_free_new:
> + page_cache_release(new_page);
> +oom:
> + page_cache_release(page);
> + return -ENOMEM;
> +}
> +
> +/*
> * copy one vm_area from one task to the other. Assumes the page tables
> * already present in the new task to be cleared in the whole range
> * covered by this vma.
> */
>
> -static inline void
> +static inline int
> copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
> unsigned long addr, int *rss)
> @@ -614,6 +705,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
> unsigned long vm_flags = vma->vm_flags;
> pte_t pte = *src_pte;
> struct page *page;
> + int ret = 0;
>
> /* pte contains position in swap or file, so copy. */
> if (unlikely(!pte_present(pte))) {
> @@ -665,20 +757,26 @@ copy_one_pte(struct mm_struct *dst_mm, s
> get_page(page);
> page_dup_rmap(page, vma, addr);
> rss[!!PageAnon(page)]++;
> + if (unlikely(PageDontCOW(page)) && PageAnon(page))
> + ret = 1;
> }
>
> out_set_pte:
> set_pte_at(dst_mm, addr, dst_pte, pte);
> +
> + return ret;
> }
>
> static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> + pmd_t *dst_pmd, pmd_t *src_pmd,
> + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> unsigned long addr, unsigned long end)
> {
> pte_t *src_pte, *dst_pte;
> spinlock_t *src_ptl, *dst_ptl;
> int progress = 0;
> int rss[2];
> + int decow;
>
> again:
> rss[1] = rss[0] = 0;
> @@ -705,7 +803,10 @@ again:
> progress++;
> continue;
> }
> - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
> + decow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
> + src_vma, addr, rss);
> + if (unlikely(decow))
> + goto decow;
> progress += 8;
> } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>
> @@ -714,14 +815,31 @@ again:
> pte_unmap_nested(src_pte - 1);
> add_mm_rss(dst_mm, rss[0], rss[1]);
> pte_unmap_unlock(dst_pte - 1, dst_ptl);
> +next:
> cond_resched();
> if (addr != end)
> goto again;
> return 0;
> +
> +decow:
> + arch_leave_lazy_mmu_mode();
> + spin_unlock(src_ptl);
> + pte_unmap_nested(src_pte);
> + add_mm_rss(dst_mm, rss[0], rss[1]);
> + decow = decow_one_pte(dst_mm, dst_pte, dst_pmd, dst_ptl, dst_vma, addr);
> + if (decow == -ENOMEM)
> + return -ENOMEM;
> + if (decow == -EAGAIN)
> + goto again;
> + pte_unmap_unlock(dst_pte, dst_ptl);
> + cond_resched();
> + addr += PAGE_SIZE;
> + goto next;
> }
>
> static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> - pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
> + pud_t *dst_pud, pud_t *src_pud,
> + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> unsigned long addr, unsigned long end)
> {
> pmd_t *src_pmd, *dst_pmd;
> @@ -736,14 +854,15 @@ static inline int copy_pmd_range(struct
> if (pmd_none_or_clear_bad(src_pmd))
> continue;
> if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
> - vma, addr, next))
> + dst_vma, src_vma, addr, next))
> return -ENOMEM;
> } while (dst_pmd++, src_pmd++, addr = next, addr != end);
> return 0;
> }
>
> static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> - pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
> + pgd_t *dst_pgd, pgd_t *src_pgd,
> + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> unsigned long addr, unsigned long end)
> {
> pud_t *src_pud, *dst_pud;
> @@ -758,19 +877,19 @@ static inline int copy_pud_range(struct
> if (pud_none_or_clear_bad(src_pud))
> continue;
> if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
> - vma, addr, next))
> + dst_vma, src_vma, addr, next))
> return -ENOMEM;
> } while (dst_pud++, src_pud++, addr = next, addr != end);
> return 0;
> }
>
> int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> - struct vm_area_struct *vma)
> + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> {
> pgd_t *src_pgd, *dst_pgd;
> unsigned long next;
> - unsigned long addr = vma->vm_start;
> - unsigned long end = vma->vm_end;
> + unsigned long addr = src_vma->vm_start;
> + unsigned long end = src_vma->vm_end;
> int ret;
>
> /*
> @@ -779,20 +898,20 @@ int copy_page_range(struct mm_struct *ds
> * readonly mappings. The tradeoff is that copy_page_range is more
> * efficient than faulting.
> */
> - if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
> - if (!vma->anon_vma)
> + if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
> + if (!src_vma->anon_vma)
> return 0;
> }
>
> - if (is_vm_hugetlb_page(vma))
> - return copy_hugetlb_page_range(dst_mm, src_mm, vma);
> + if (is_vm_hugetlb_page(src_vma))
> + return copy_hugetlb_page_range(dst_mm, src_mm, src_vma);
>
> - if (unlikely(is_pfn_mapping(vma))) {
> + if (unlikely(is_pfn_mapping(src_vma))) {
> /*
> * We do not free on error cases below as remove_vma
> * gets called on error from higher level routine
> */
> - ret = track_pfn_vma_copy(vma);
> + ret = track_pfn_vma_copy(src_vma);
> if (ret)
> return ret;
> }
> @@ -803,7 +922,7 @@ int copy_page_range(struct mm_struct *ds
> * parent mm. And a permission downgrade will only happen if
> * is_cow_mapping() returns true.
> */
> - if (is_cow_mapping(vma->vm_flags))
> + if (is_cow_mapping(src_vma->vm_flags))
> mmu_notifier_invalidate_range_start(src_mm, addr, end);
>
> ret = 0;
> @@ -814,15 +933,16 @@ int copy_page_range(struct mm_struct *ds
> if (pgd_none_or_clear_bad(src_pgd))
> continue;
> if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
> - vma, addr, next))) {
> + dst_vma, src_vma, addr, next))) {
> ret = -ENOMEM;
> break;
> }
> } while (dst_pgd++, src_pgd++, addr = next, addr != end);
>
> - if (is_cow_mapping(vma->vm_flags))
> + if (is_cow_mapping(src_vma->vm_flags))
> mmu_notifier_invalidate_range_end(src_mm,
> - vma->vm_start, end);
> + src_vma->vm_start, end);
> +
> return ret;
> }
>
> @@ -1272,8 +1392,6 @@ static inline int use_zero_page(struct v
> return !vma->vm_ops || !vma->vm_ops->fault;
> }
>
> -
> -
> int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long start, int len, int flags,
> struct page **pages, struct vm_area_struct **vmas)
> @@ -1298,6 +1416,7 @@ int __get_user_pages(struct task_struct
> do {
> struct vm_area_struct *vma;
> unsigned int foll_flags;
> + int decow;
>
> vma = find_extend_vma(mm, start);
> if (!vma && in_gate_area(tsk, start)) {
> @@ -1352,6 +1471,14 @@ int __get_user_pages(struct task_struct
> continue;
> }
>
> + /*
> + * Except in special cases where the caller will not read to or
> + * write from these pages, we must break COW for any pages
> + * returned from get_user_pages, so that our caller does not
> + * subsequently end up with the pages of a parent or child
> + * process after a COW takes place.
> + */
> + decow = (pages && is_cow_mapping(vma->vm_flags));
> foll_flags = FOLL_TOUCH;
> if (pages)
> foll_flags |= FOLL_GET;
> @@ -1372,7 +1499,7 @@ int __get_user_pages(struct task_struct
> fatal_signal_pending(current)))
> return i ? i : -ERESTARTSYS;
>
> - if (write)
> + if (write || decow)
> foll_flags |= FOLL_WRITE;
>
> cond_resched();
> @@ -1415,6 +1542,9 @@ int __get_user_pages(struct task_struct
> if (pages) {
> pages[i] = page;
>
> + if (decow && !PageDontCOW(page) &&
> + PageAnon(page))
> + SetPageDontCOW(page);
> flush_anon_page(vma, page, start);
> flush_dcache_page(page);
> }
> @@ -1966,6 +2096,8 @@ static int do_wp_page(struct mm_struct *
> }
> reuse = reuse_swap_page(old_page);
> unlock_page(old_page);
> + VM_BUG_ON(PageDontCOW(old_page) && !reuse);
> +
> } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> (VM_WRITE|VM_SHARED))) {
> /*
> Index: linux-2.6/arch/x86/mm/gup.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/gup.c
> +++ linux-2.6/arch/x86/mm/gup.c
> @@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t
> struct page *page;
>
> if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
> +failed:
> pte_unmap(ptep);
> return 0;
> }
> VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> page = pte_page(pte);
> + if (PageAnon(page) && unlikely(!PageDontCOW(page)))
> + goto failed;
> get_page(page);
> pages[*nr] = page;
> (*nr)++;
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h
> +++ linux-2.6/include/linux/page-flags.h
> @@ -106,6 +106,9 @@ enum pageflags {
> #endif
> __NR_PAGEFLAGS,
>
> + /* Anonymous pages */
> + PG_dontcow = PG_owner_priv_1, /* do not WP for COW optimisation */
> +
> /* Filesystems */
> PG_checked = PG_owner_priv_1,
>
> @@ -225,6 +228,8 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
> */
> TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
> __PAGEFLAG(Buddy, buddy)
> +__PAGEFLAG(DontCOW, dontcow)
> +SETPAGEFLAG(DontCOW, dontcow)
> PAGEFLAG(MappedToDisk, mappedtodisk)
>
> /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c
> +++ linux-2.6/kernel/fork.c
> @@ -359,7 +359,7 @@ static int dup_mmap(struct mm_struct *mm
> rb_parent = &tmp->vm_rb;
>
> mm->map_count++;
> - retval = copy_page_range(mm, oldmm, mpnt);
> + retval = copy_page_range(mm, oldmm, tmp, mpnt);
>
> if (tmp->vm_ops && tmp->vm_ops->open)
> tmp->vm_ops->open(tmp);
> Index: linux-2.6/arch/powerpc/mm/gup.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/gup.c
> +++ linux-2.6/arch/powerpc/mm/gup.c
> @@ -41,6 +41,8 @@ static noinline int gup_pte_range(pmd_t
> return 0;
> VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> page = pte_page(pte);
> + if (PageAnon(page) && unlikely(!PageDontCOW(page)))
> + return 0;
> if (!page_cache_get_speculative(page))
> return 0;
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {


2009-04-14 09:26:16

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Tuesday 14 April 2009 16:16:52 KOSAKI Motohiro wrote:

> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Sugessted-by: Linus Torvalds <[email protected]>

"Suggested-by:" ;)

> Cc: Hugh Dickins <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: [email protected]
> ---
> mm/rmap.c | 21 +++++++++++++++++++++
> mm/swapfile.c | 10 +++++++++-
> 2 files changed, 30 insertions(+), 1 deletion(-)
>
> Index: b/mm/swapfile.c
> ===================================================================
> --- a/mm/swapfile.c 2009-04-11 21:38:33.000000000 +0900
> +++ b/mm/swapfile.c 2009-04-11 21:38:45.000000000 +0900
> @@ -533,6 +533,8 @@ static inline int page_swapcount(struct
> * to it. And as a side-effect, free up its swap: because the old content
> * on disk will never be read, and seeking back there to write new content
> * later would only waste time away from clustering.
> + * Caller must hold pte_lock. try_to_unmap() decrement page::_mapcount
> + * and get_user_pages() increment page::_count under pte_lock.
> */
> int reuse_swap_page(struct page *page)
> {
> @@ -547,7 +549,13 @@ int reuse_swap_page(struct page *page)
> SetPageDirty(page);
> }
> }
> - return count == 1;
> +
> + /*
> + * If we can re-use the swap page _and_ the end
> + * result has only one user (the mapping), then
> + * we reuse the whole page
> + */
> + return count + page_count(page) == 2;
> }

I guess this patch does work to close the read-side race, but I slightly don't
like using page_count for things like this. page_count can be temporarily
raised for reasons other than access through their user mapping. Swapcache,
page reclaim, LRU pagevecs, concurrent do_wp_page, etc.


> /*
> Index: b/mm/rmap.c
> ===================================================================
> --- a/mm/rmap.c 2009-04-11 21:38:33.000000000 +0900
> +++ b/mm/rmap.c 2009-04-12 00:58:58.000000000 +0900
> @@ -773,6 +773,27 @@ static int try_to_unmap_one(struct page
> goto out;
>
> /*
> + * Don't pull an anonymous page out from under get_user_pages.
> + * GUP carefully breaks COW and raises page count (while holding
> + * pte_lock, as we have here) to make sure that the page
> + * cannot be freed. If we unmap that page here, a user write
> + * access to the virtual address will bring back the page, but
> + * its raised count will (ironically) be taken to mean it's not
> + * an exclusive swap page, do_wp_page will replace it by a copy
> + * page, and the user never get to see the data GUP was holding
> + * the original page for.
> + *
> + * This test is also useful for when swapoff (unuse_process) has
> + * to drop page lock: its reference to the page stops existing
> + * ptes from being unmapped, so swapoff can make progress.
> + */
> + if (PageSwapCache(page) &&
> + page_count(page) != page_mapcount(page) + 2) {
> + ret = SWAP_FAIL;
> + goto out_unmap;
> + }

I guess it does add another constraint to the VM, ie. not allowed to
unmap an anonymous page with elevated refcount. Maybe not a big deal
now, but I think it is enough that it should be noted. If you squint,
this could actually be more complex/intrusive to the wider VM than my
copy on fork (which is basically exactly like a manual do_wp_page at
fork time).

And.... I don't think this is safe against a concurrent gup_fast()
(which helps my point).

2009-04-14 09:37:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/6] IO pinning(get_user_pages()) vs fork race fix

On Tuesday 14 April 2009 19:19:10 KOSAKI Motohiro wrote:

> > I don't see how it fixes vmsplice? vmsplice can get_user_pages pages from one
> > process's address space and put them into a pipe, and they are released by
> > another process after consuming the pages I think. So it's fairly hard to hold
> > a lock over this.
>
> I recognize my explanation is poor.
>
> firstly, pipe_to_user() via vmsplice_to_user use copy_to_user. then we don't need care
> receive side.
> secondly, get_iovec_page_array() via vmsplice_to_pipe() use gup(read).
> then we only need prevent to change the page.
>
> I changed reuse_swap_page() at [1/6]. then if any process touch the page while
> the process isn't recived yet, it makes COW break and toucher get copyed page.
> then, Anybody can't change original page.
>
> Thus, This patch series also fixes vmsplice issue, I think.
> Am I missing anything?

Ah thanks, I see now. No I don't think you're missing anything.


> > I guess apart from the vmsplice issue (unless I missed a clever fix), I guess
> > this *does* work. I can't see any races... I'd really still like to hear a good
> > reason why my proposed patch is so obviously crap.
> >
> > Reasons proposed so far:
> > "No locking" (I think this is a good thing; no *bugs* have been pointed out)
> > "Too many page flags" (but it only uses 1 anon page flag, only fs pagecache
> > has a flags shortage so we can easily overload a pagecache flag)
> > "Diffstat too large" (seems comparable when you factor in the fixes to callers,
> > but has the advantage of being contained within VM subsystem)
> > "Horrible code" (I still don't see it. Of course the code will be nicer if we
> > don't fix the issue _at all_, but I don't see this is so much worse than having
> > to fix callers.)
>
> Honestly, I don't dislike your.
> but I really hope to fix this bug. if someone nak your patch, I'll seek another way.

Yes, I appreciate you looking at alternatives, and you haven't been strongly
arguing against my patch. So this comment was not aimed at you :)


> > FWIW, I have attached my patch again (with simple function-movement hunks
> > moved into another patch so it is easier to see real impact of this patch).
>
> OK. I try to test your patch too.

Well I split it out and it requires another patch to move functions around
(eg. zap_pte from fremap.c into memory.c). I just attached it here to
illustrate the core of my fix. If you would like to run any real tests, let
me know and I could send a proper rollup.

2009-04-14 12:03:00

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

> On Tuesday 14 April 2009 16:16:52 KOSAKI Motohiro wrote:
>
>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>> Sugessted-by: Linus Torvalds <[email protected]>
>
> "Suggested-by:" ;)

Agghh, thanks.


>> @@ -547,7 +549,13 @@ int reuse_swap_page(struct page *page)
>> ? ? ? ? ? ? ? ? ? ? ? SetPageDirty(page);
>> ? ? ? ? ? ? ? }
>> ? ? ? }
>> - ? ? return count == 1;
>> +
>> + ? ? /*
>> + ? ? ?* If we can re-use the swap page _and_ the end
>> + ? ? ?* result has only one user (the mapping), then
>> + ? ? ?* we reuse the whole page
>> + ? ? ?*/
>> + ? ? return count + page_count(page) == 2;
>> ?}
>
> I guess this patch does work to close the read-side race, but I slightly don't
> like using page_count for things like this. page_count can be temporarily
> raised for reasons other than access through their user mapping. Swapcache,
> page reclaim, LRU pagevecs, concurrent do_wp_page, etc.

Yes, that's trade-off.
your early decow also can misjudge and make unnecessary copy.



>> ? ? ? /*
>> + ? ? ?* Don't pull an anonymous page out from under get_user_pages.
>> + ? ? ?* GUP carefully breaks COW and raises page count (while holding
>> + ? ? ?* pte_lock, as we have here) to make sure that the page
>> + ? ? ?* cannot be freed. ?If we unmap that page here, a user write
>> + ? ? ?* access to the virtual address will bring back the page, but
>> + ? ? ?* its raised count will (ironically) be taken to mean it's not
>> + ? ? ?* an exclusive swap page, do_wp_page will replace it by a copy
>> + ? ? ?* page, and the user never get to see the data GUP was holding
>> + ? ? ?* the original page for.
>> + ? ? ?*
>> + ? ? ?* This test is also useful for when swapoff (unuse_process) has
>> + ? ? ?* to drop page lock: its reference to the page stops existing
>> + ? ? ?* ptes from being unmapped, so swapoff can make progress.
>> + ? ? ?*/
>> + ? ? if (PageSwapCache(page) &&
>> + ? ? ? ? page_count(page) != page_mapcount(page) + 2) {
>> + ? ? ? ? ? ? ret = SWAP_FAIL;
>> + ? ? ? ? ? ? goto out_unmap;
>> + ? ? }
>
> I guess it does add another constraint to the VM, ie. not allowed to
> unmap an anonymous page with elevated refcount. Maybe not a big deal
> now, but I think it is enough that it should be noted. If you squint,
> this could actually be more complex/intrusive to the wider VM than my
> copy on fork (which is basically exactly like a manual do_wp_page at
> fork time).

I agree this code effect widely kernel activity.
but actually, in past days, the kernel did the same behavior. then
almost core code is
page_count checking safe.

but Yes, we need to afraid newer code don't works with this code...


> And.... I don't think this is safe against a concurrent gup_fast()
> (which helps my point).

Could you please explain more detail ?

2009-04-14 12:25:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Tuesday 14 April 2009 22:02:47 KOSAKI Motohiro wrote:
> > On Tuesday 14 April 2009 16:16:52 KOSAKI Motohiro wrote:
> >> Signed-off-by: KOSAKI Motohiro <[email protected]>

> >> @@ -547,7 +549,13 @@ int reuse_swap_page(struct page *page)
> >> SetPageDirty(page);
> >> }
> >> }
> >> - return count == 1;
> >> +
> >> + /*
> >> + * If we can re-use the swap page _and_ the end
> >> + * result has only one user (the mapping), then
> >> + * we reuse the whole page
> >> + */
> >> + return count + page_count(page) == 2;
> >> }
> >
> > I guess this patch does work to close the read-side race, but I slightly don't
> > like using page_count for things like this. page_count can be temporarily
> > raised for reasons other than access through their user mapping. Swapcache,
> > page reclaim, LRU pagevecs, concurrent do_wp_page, etc.
>
> Yes, that's trade-off.
> your early decow also can misjudge and make unnecessary copy.

Yes indeed it can. Although it would only ever do so in case of pages
that have had get_user_pages run against them previously, and not from
random interactions from any other parts of the kernel.

I would be interested, using an anon vma field as you say for keeping
a gup count... it could potentially be used to avoid the extra copy.
But hmm, I don't have much time to go down that path so long as the
basic concept of my proposal is in question.


> >> /*
> >> + * Don't pull an anonymous page out from under get_user_pages.
> >> + * GUP carefully breaks COW and raises page count (while holding
> >> + * pte_lock, as we have here) to make sure that the page
> >> + * cannot be freed. If we unmap that page here, a user write
> >> + * access to the virtual address will bring back the page, but
> >> + * its raised count will (ironically) be taken to mean it's not
> >> + * an exclusive swap page, do_wp_page will replace it by a copy
> >> + * page, and the user never get to see the data GUP was holding
> >> + * the original page for.
> >> + *
> >> + * This test is also useful for when swapoff (unuse_process) has
> >> + * to drop page lock: its reference to the page stops existing
> >> + * ptes from being unmapped, so swapoff can make progress.
> >> + */
> >> + if (PageSwapCache(page) &&
> >> + page_count(page) != page_mapcount(page) + 2) {
> >> + ret = SWAP_FAIL;
> >> + goto out_unmap;
> >> + }
> >
> > I guess it does add another constraint to the VM, ie. not allowed to
> > unmap an anonymous page with elevated refcount. Maybe not a big deal
> > now, but I think it is enough that it should be noted. If you squint,
> > this could actually be more complex/intrusive to the wider VM than my
> > copy on fork (which is basically exactly like a manual do_wp_page at
> > fork time).
>
> I agree this code effect widely kernel activity.
> but actually, in past days, the kernel did the same behavior. then
> almost core code is
> page_count checking safe.
>
> but Yes, we need to afraid newer code don't works with this code...
>
>
> > And.... I don't think this is safe against a concurrent gup_fast()
> > (which helps my point).
>
> Could you please explain more detail ?
>

+ if (PageSwapCache(page) &&
+ page_count(page) != page_mapcount(page) + 2) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }

Now if another thread does a get_user_pages_fast after it passes this
check, it can take a gup reference to the page which is now about to
be unmapped. Then after it is unmapped, if a wp fault is caused on the
page, then it will not be reused and thus you lose data as explained
in your big comment.

2009-04-14 13:40:14

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

>> >> @@ -547,7 +549,13 @@ int reuse_swap_page(struct page *page)
>> >> ? ? ? ? ? ? ? ? ? ? ? SetPageDirty(page);
>> >> ? ? ? ? ? ? ? }
>> >> ? ? ? }
>> >> - ? ? return count == 1;
>> >> +
>> >> + ? ? /*
>> >> + ? ? ?* If we can re-use the swap page _and_ the end
>> >> + ? ? ?* result has only one user (the mapping), then
>> >> + ? ? ?* we reuse the whole page
>> >> + ? ? ?*/
>> >> + ? ? return count + page_count(page) == 2;
>> >> ?}
>> >
>> > I guess this patch does work to close the read-side race, but I slightly don't
>> > like using page_count for things like this. page_count can be temporarily
>> > raised for reasons other than access through their user mapping. Swapcache,
>> > page reclaim, LRU pagevecs, concurrent do_wp_page, etc.
>>
>> Yes, that's trade-off.
>> your early decow also can misjudge and make unnecessary copy.
>
> Yes indeed it can. Although it would only ever do so in case of pages
> that have had get_user_pages run against them previously, and not from
> random interactions from any other parts of the kernel.

Agreed.

> I would be interested, using an anon vma field as you say for keeping
> a gup count... it could potentially be used to avoid the extra copy.
> But hmm, I don't have much time to go down that path so long as the
> basic concept of my proposal is in question.

ok, I try to make it. thanks.

> + ? ? if (PageSwapCache(page) &&
> + ? ? ? ? page_count(page) != page_mapcount(page) + 2) {
> + ? ? ? ? ? ? ret = SWAP_FAIL;
> + ? ? ? ? ? ? goto out_unmap;
> + ? ? }
>
> Now if another thread does a get_user_pages_fast after it passes this
> check, it can take a gup reference to the page which is now about to
> be unmapped. Then after it is unmapped, if a wp fault is caused on the
> page, then it will not be reused and thus you lose data as explained
> in your big comment.

Grrr, I lose. I've misunderstood get_user_pages_fast() also grab pte_lock.
I must think it again.

I guess you dislike get_user_page_fast() grab pte_lock too, right?

2009-04-14 13:42:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 4/6] aio: Don't inherit aio ring memory at fork

On Tue, Apr 14, 2009 at 03:20:20PM +0900, KOSAKI Motohiro wrote:
> In addition, This patch has good side effect. it also fix "get_user_pages() vs fork" problem.

Yes, patches like 3/6, 4/6, and 6/6 are the side effect of not fixing
the core race in gup and spreading the new rwsem around the gup users,
instead of sticking to a page-granular PG_flag touched at the same
time atomic_inc runs on page->_count.

2009-04-14 14:15:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Tue, Apr 14, 2009 at 10:39:54PM +0900, KOSAKI Motohiro wrote:
> I guess you dislike get_user_page_fast() grab pte_lock too, right?

If get_user_page_fast is vetoed to run a set_bit on the already cache
hot and exclusive struct page, I doubt taking a potentially cache
cold, mm-wide or pmd-wide pte_lock is ok.

2009-04-14 14:27:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Wednesday 15 April 2009 00:12:09 Andrea Arcangeli wrote:
> On Tue, Apr 14, 2009 at 10:39:54PM +0900, KOSAKI Motohiro wrote:
> > I guess you dislike get_user_page_fast() grab pte_lock too, right?
>
> If get_user_page_fast is vetoed to run a set_bit on the already cache
> hot and exclusive struct page, I doubt taking a potentially cache
> cold, mm-wide or pmd-wide pte_lock is ok.

Yes, I'd *really* rather not. I actually implemented gup_fast in
response to problem reported with DB2 workload hitting the ptl
(and not the more obvious mmap_sem, although certainly they had
some gain from removing that cacheline as well).

gup_fast iirc is worth nearly 10% on a 4 socket x86 system with
DB2. That's the same order of magnitude as the speedups quoted
to justify the addition of hugepages, or O_DIRECT itself.

Andrea: I didn't veto that set_bit change of yours as such. I just
noted there could be more atomic operations. Actually I would
welcome more comparison between our two approaches, but they seem
to be stuck with Linus refusing (I think) to copy the page at
fork() time :(

2009-04-14 14:35:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Wed, Apr 15, 2009 at 12:26:34AM +1000, Nick Piggin wrote:
> Andrea: I didn't veto that set_bit change of yours as such. I just

I know you didn't ;)

> noted there could be more atomic operations. Actually I would
> welcome more comparison between our two approaches, but they seem

Agree about the welcome of comparison, it'd be nice to measure it the
enterprise workloads that showed the gup_fast gain in the first place.

2009-04-14 14:40:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Tue, Apr 14, 2009 at 03:16:52PM +0900, KOSAKI Motohiro wrote:
> + if (PageSwapCache(page) &&
> + page_count(page) != page_mapcount(page) + 2) {
> + ret = SWAP_FAIL;
> + goto out_unmap;
> + }
> +

Besides the race pointed out by Nick, this also would break KVM
swapping with mmu notifier. mmu_notifier_invalidate_page must be
invoked before reading page_count for this to work. However the
invalidate has to be moved below the
mlock/ptep_clear_flush_young_notify, no point to get rid of sptes if
any of the spte or the pte is still young.

2009-04-14 14:42:34

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Wednesday 15 April 2009 00:32:52 Andrea Arcangeli wrote:
> On Wed, Apr 15, 2009 at 12:26:34AM +1000, Nick Piggin wrote:
> > Andrea: I didn't veto that set_bit change of yours as such. I just
>
> I know you didn't ;)
>
> > noted there could be more atomic operations. Actually I would
> > welcome more comparison between our two approaches, but they seem
>
> Agree about the welcome of comparison, it'd be nice to measure it the
> enterprise workloads that showed the gup_fast gain in the first place.

I think we should be able to ask IBM to run some tests, provided
they still have machines available to do so. Although I don't want
to waste their time so we need to have something that has got past
initial code review and has a chance of being merged.

If we get that far, then I can ask them to run tests definitely.

2009-04-14 15:23:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Wed, Apr 15, 2009 at 12:42:14AM +1000, Nick Piggin wrote:
> to waste their time [..]

Me neither.

> If we get that far, then I can ask them to run tests definitely.

Agreed!

2009-04-14 16:03:39

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 4/6] aio: Don't inherit aio ring memory at fork

KOSAKI Motohiro <[email protected]> writes:

> AIO folks, Am I missing anything?
>
> ===============
> Subject: [RFC][PATCH] aio: Don't inherit aio ring memory at fork
>
> Currently, mm_struct::ioctx_list member isn't copyed at fork. IOW aio context don't inherit at fork.
> but only ring memory inherited. that's strange.
>
> This patch mark DONTFORK to ring-memory too.

Well, given that clearly nobody relies on io contexts being copied to
the child, I think it's okay to make this change. I think the current
behaviour violates the principal of least surprise, but I'm having a
hard time getting upset about that. ;)

> In addition, This patch has good side effect. it also fix
> "get_user_pages() vs fork" problem.

Hmm, I don't follow you, here. As I understand it, the get_user_pages
vs. fork problem has to do with the pages used for the actual I/O, not
the pages used to store the completion data. So, could you elaborate a
bit on what you mean by the above statement?

> I think "man fork" also sould be changed. it only say
>
> * The child does not inherit outstanding asynchronous I/O operations from
> its parent (aio_read(3), aio_write(3)).
> but aio_context_t (return value of io_setup(2)) also don't inherit in current implementaion.

I can certainly make that change, as I have other changes I need to push
to Michael, anyway.

Cheers,
Jeff

2009-04-14 16:48:21

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/6] mm, directio: fix fork vs direct-io race (read(2) side IOW gup(write) side)

KOSAKI Motohiro <[email protected]> writes:

> Oops, I forgot some cc. resend it.
>
>> Subject: [PATCH] mm, directio: fix fork vs direct-io race
>>
>>
>> ChangeLog:
>> V2 -> V3
>> o remove early decow logic
>>
>> V1 -> V2
>> o add dio+aio logic
>>

[snip test programs]

>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>> Sugessted-by: Linus Torvalds <[email protected]>
>> Cc: Hugh Dickins <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Cc: Andrea Arcangeli <[email protected]>
>> Cc: Jeff Moyer <[email protected]>
>> Cc: Zach Brown <[email protected]>
>> Cc: Andy Grover <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>> fs/direct-io.c | 16 ++++++++++++++++
>> include/linux/init_task.h | 1 +
>> include/linux/mm_types.h | 6 ++++++
>> kernel/fork.c | 3 +++
>> 4 files changed, 26 insertions(+)
>>
>> Index: b/fs/direct-io.c
>> ===================================================================
>> --- a/fs/direct-io.c 2009-04-13 00:24:01.000000000 +0900
>> +++ b/fs/direct-io.c 2009-04-13 01:36:37.000000000 +0900
>> @@ -131,6 +131,9 @@ struct dio {
>> int is_async; /* is IO async ? */
>> int io_error; /* IO error in completion path */
>> ssize_t result; /* IO result */
>> +
>> + /* fork exclusive stuff */
>> + struct mm_struct *mm;
>> };
>>
>> /*
>> @@ -244,6 +247,12 @@ static int dio_complete(struct dio *dio,
>> /* lockdep: non-owner release */
>> up_read_non_owner(&dio->inode->i_alloc_sem);
>>
>> + if (dio->rw == READ) {
>> + BUG_ON(!dio->mm);
>> + up_read_non_owner(&dio->mm->mm_pinned_sem);
>> + mmdrop(dio->mm);
>> + }
>> +
>> if (ret == 0)
>> ret = dio->page_errors;
>> if (ret == 0)
>> @@ -942,6 +951,7 @@ direct_io_worker(int rw, struct kiocb *i
>> ssize_t ret = 0;
>> ssize_t ret2;
>> size_t bytes;
>> + struct mm_struct *mm;
>>
>> dio->inode = inode;
>> dio->rw = rw;
>> @@ -960,6 +970,12 @@ direct_io_worker(int rw, struct kiocb *i
>> spin_lock_init(&dio->bio_lock);
>> dio->refcount = 1;
>>
>> + if (rw == READ) {
>> + mm = dio->mm = current->mm;
>> + atomic_inc(&mm->mm_count);
>> + down_read_non_owner(&mm->mm_pinned_sem);
>> + }
>> +

So, if you're continuously submitting async read I/O, you will starve
out the fork() call indefinitely. I agree that you want to allow
multiple O_DIRECT I/Os to go on at once, but I'm not sure this is the
right way forward.

I have to weigh in and say I much prefer the patches posted by Nick and
Andrea. They were much more contained and had negligible performance
impact.

Have you done any performance measurements on this patch series?

Cheers,
Jeff

2009-04-14 16:49:50

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 3/6] nfs, direct-io: fix fork vs direct-io race on nfs

KOSAKI Motohiro <[email protected]> writes:

> Subject: [PATCH] nfs, direct-io: fix fork vs direct-io race on nfs
>
> After fs/diorect-io.c fix, following testcase still fail on nfs running.
> it's because nfs has own specific diorct-io implementation.

Same issue as 2/6.

Cheers,
Jeff

2009-04-14 17:54:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/6] mm, directio: fix fork vs direct-io race (read(2) side IOW gup(write) side)

On Tue, Apr 14, 2009 at 12:45:41PM -0400, Jeff Moyer wrote:
> So, if you're continuously submitting async read I/O, you will starve
> out the fork() call indefinitely. I agree that you want to allow

IIRC rwsem good enough to stop the down_read when a down_write is
blocked. Otherwise page fault flood in threads would also starve any
mmap or similar call. Still with this approach fork will start to hang
indefinitely waiting for I/O, making it an I/O bound call, and not a
CPU call anymore, which may severely impact interactive-ness of
applications.

As long as fork is useful in the first place to provide memory
protection of different code with different
memory-corruption-trust-levels (otherwise nobody should use fork at
all, and vfork [or better spawn] should become the only option), then
fork from a thread pool is also reasonable. Either fork is totally
useless as a whole (which I wouldn't argue too much about), or if you
agree fork makes any sense, it can also make sense if intermixed with
clone(CLONE_VM) and hopefully it should behave CPU bound like CLONE_VM.

2009-04-14 18:12:31

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/6] mm, directio: fix fork vs direct-io race (read(2) side IOW gup(write) side)

Andrea Arcangeli <[email protected]> writes:

> On Tue, Apr 14, 2009 at 12:45:41PM -0400, Jeff Moyer wrote:
>> So, if you're continuously submitting async read I/O, you will starve
>> out the fork() call indefinitely. I agree that you want to allow
>
> IIRC rwsem good enough to stop the down_read when a down_write is
> blocked. Otherwise page fault flood in threads would also starve any
> mmap or similar call. Still with this approach fork will start to hang

Really? I don't actually see that in the code, have I missed it?

Cheers,
Jeff

2009-04-14 19:50:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/6] mm, directio: fix fork vs direct-io race (read(2) side IOW gup(write) side)

On Tue, Apr 14, 2009 at 02:10:08PM -0400, Jeff Moyer wrote:
> Really? I don't actually see that in the code, have I missed it?

Checking the spinlock version, when any writer is waiting, the
sem->wait_list won't empty and down_read will wait too. The wakeup is
FIFO with __rwsem_do_wake is doing a wake-one if first one in queue is
a down_write. So it looks ok to me. asm version should have an
equivalent logic.

2009-04-15 00:56:50

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 4/6] aio: Don't inherit aio ring memory at fork

Hi!

> KOSAKI Motohiro <[email protected]> writes:
>
> > AIO folks, Am I missing anything?
> >
> > ===============
> > Subject: [RFC][PATCH] aio: Don't inherit aio ring memory at fork
> >
> > Currently, mm_struct::ioctx_list member isn't copyed at fork. IOW aio context don't inherit at fork.
> > but only ring memory inherited. that's strange.
> >
> > This patch mark DONTFORK to ring-memory too.
>
> Well, given that clearly nobody relies on io contexts being copied to
> the child, I think it's okay to make this change. I think the current
> behaviour violates the principal of least surprise, but I'm having a
> hard time getting upset about that. ;)

ok.
So, Can I get your Acked-by?

> > In addition, This patch has good side effect. it also fix
> > "get_user_pages() vs fork" problem.
>
> Hmm, I don't follow you, here. As I understand it, the get_user_pages
> vs. fork problem has to do with the pages used for the actual I/O, not
> the pages used to store the completion data. So, could you elaborate a
> bit on what you mean by the above statement?

No.

The problem is, get_user_pages() increment page_count only.
but VM page-fault logic don't care page_count. (it only care page::_mapcount)
Then, fork and pagefault can change virtual-physical relationship although
get_user_pages() is called.

drawback worst aio scenario here
-----------------------------------------------------------------------
io_setup() and gup inc page_count

fork inc mapcount
and make write-protect to pte

write ring from userland(*) page fault and
COW break.
parent process get copyed page and
child get original page owner-ship.

kmap and memcpy from kernel change child page. (it mean data lost)

(*) Is this happend?

MADV_DONTFORK or down_read(mmap_sem) or down_read(mm_pinned_sem)
or copy-at-fork mecanism(=Nick/Andrea patch) solve it.



> > I think "man fork" also sould be changed. it only say
> >
> > * The child does not inherit outstanding asynchronous I/O operations from
> > its parent (aio_read(3), aio_write(3)).
> > but aio_context_t (return value of io_setup(2)) also don't inherit in current implementaion.
>
> I can certainly make that change, as I have other changes I need to push
> to Michael, anyway.

thanks.


2009-04-15 02:46:18

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 4/6] aio: Don't inherit aio ring memory at fork

KOSAKI Motohiro <[email protected]> writes:

> Hi!
>
>> KOSAKI Motohiro <[email protected]> writes:
>>
>> > AIO folks, Am I missing anything?
>> >
>> > ===============
>> > Subject: [RFC][PATCH] aio: Don't inherit aio ring memory at fork
>> >
>> > Currently, mm_struct::ioctx_list member isn't copyed at fork. IOW aio context don't inherit at fork.
>> > but only ring memory inherited. that's strange.
>> >
>> > This patch mark DONTFORK to ring-memory too.
>>
>> Well, given that clearly nobody relies on io contexts being copied to
>> the child, I think it's okay to make this change. I think the current
>> behaviour violates the principal of least surprise, but I'm having a
>> hard time getting upset about that. ;)
>
> ok.
> So, Can I get your Acked-by?

I have more comments below.

>> > In addition, This patch has good side effect. it also fix
>> > "get_user_pages() vs fork" problem.
>>
>> Hmm, I don't follow you, here. As I understand it, the get_user_pages
>> vs. fork problem has to do with the pages used for the actual I/O, not
>> the pages used to store the completion data. So, could you elaborate a
>> bit on what you mean by the above statement?
>
> No.
>
> The problem is, get_user_pages() increment page_count only.
> but VM page-fault logic don't care page_count. (it only care page::_mapcount)
> Then, fork and pagefault can change virtual-physical relationship although
> get_user_pages() is called.
>
> drawback worst aio scenario here
> -----------------------------------------------------------------------
> io_setup() and gup inc page_count
>
> fork inc mapcount
> and make write-protect to pte
>
> write ring from userland(*) page fault and
> COW break.
> parent process get copyed page and
> child get original page owner-ship.
>
> kmap and memcpy from kernel change child page. (it mean data lost)
>
> (*) Is this happend?

I guess it's possible, but I don't know of any programs that do this.

> MADV_DONTFORK or down_read(mmap_sem) or down_read(mm_pinned_sem)
> or copy-at-fork mecanism(=Nick/Andrea patch) solve it.

OK, thanks for the explanation.

+ /*
+ * aio context doesn't inherit while fork. (see mm_init())
+ * Then, aio ring also mark DONTFORK.
+ */

Would you mind if I did some word-smithing on that comment? Something
like:
/*
* The io_context is not inherited by the child after fork()
* (see mm_init). Therefore, it makes little sense for the
* completion ring to be inherited.
*/

+ ret = sys_madvise(info->mmap_base, info->mmap_size, MADV_DONTFORK);
+ BUG_ON(ret);
+

It appears there's no other way to set the VM_DONTCOPY flag, so I guess
calling sys_madvise is fine. I'm not sure I agree with the BUG_ON(ret),
however, as EAGAIN may be feasible.

So, fix that up and you can add my reviewed-by. I think you should push
this patch independent of the other patches in this series.

>> > I think "man fork" also sould be changed. it only say
>> >
>> > * The child does not inherit outstanding asynchronous I/O operations from
>> > its parent (aio_read(3), aio_write(3)).
>> > but aio_context_t (return value of io_setup(2)) also don't inherit in current implementaion.
>>
>> I can certainly make that change, as I have other changes I need to push
>> to Michael, anyway.
>
> thanks.

No problem. As you know, I've already sent a patch for this.

Cheers,
Jeff

2009-04-15 03:00:35

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 4/6] aio: Don't inherit aio ring memory at fork

Hi

> > drawback worst aio scenario here
> > -----------------------------------------------------------------------
> > io_setup() and gup inc page_count
> >
> > fork inc mapcount
> > and make write-protect to pte
> >
> > write ring from userland(*) page fault and
> > COW break.
> > parent process get copyed page and
> > child get original page owner-ship.
> >
> > kmap and memcpy from kernel change child page. (it mean data lost)
> >
> > (*) Is this happend?
>
> I guess it's possible, but I don't know of any programs that do this.

Yup, I also think this isn't happen in real world.

>
> > MADV_DONTFORK or down_read(mmap_sem) or down_read(mm_pinned_sem)
> > or copy-at-fork mecanism(=Nick/Andrea patch) solve it.
>
> OK, thanks for the explanation.
>
> + /*
> + * aio context doesn't inherit while fork. (see mm_init())
> + * Then, aio ring also mark DONTFORK.
> + */
>
> Would you mind if I did some word-smithing on that comment? Something
> like:
> /*
> * The io_context is not inherited by the child after fork()
> * (see mm_init). Therefore, it makes little sense for the
> * completion ring to be inherited.
> */
>
> + ret = sys_madvise(info->mmap_base, info->mmap_size, MADV_DONTFORK);
> + BUG_ON(ret);
> +
>
> It appears there's no other way to set the VM_DONTCOPY flag, so I guess
> calling sys_madvise is fine. I'm not sure I agree with the BUG_ON(ret),
> however, as EAGAIN may be feasible.
>
> So, fix that up and you can add my reviewed-by. I think you should push
> this patch independent of the other patches in this series.

Done :)



2009-04-15 08:06:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

Hi

> On Wednesday 15 April 2009 00:32:52 Andrea Arcangeli wrote:
> > On Wed, Apr 15, 2009 at 12:26:34AM +1000, Nick Piggin wrote:
> > > Andrea: I didn't veto that set_bit change of yours as such. I just
> >
> > I know you didn't ;)
> >
> > > noted there could be more atomic operations. Actually I would
> > > welcome more comparison between our two approaches, but they seem
> >
> > Agree about the welcome of comparison, it'd be nice to measure it the
> > enterprise workloads that showed the gup_fast gain in the first place.
>
> I think we should be able to ask IBM to run some tests, provided
> they still have machines available to do so. Although I don't want
> to waste their time so we need to have something that has got past
> initial code review and has a chance of being merged.
>
> If we get that far, then I can ask them to run tests definitely.

Oh, it seem very charming idea.
Nick, I hope to help your patch's rollup. It makes good comparision, I think.
Is there my doable thing?

And, I changed my patch.
How about this? I added simple twice check.

because, both do_wp_page and try_to_unmap_one grab ptl. then,
page-fault routine can't change pte while try_to_unmap nuke pte.



---
mm/rmap.c | 30 ++++++++++++++++++++++++------
mm/swapfile.c | 3 ++-
2 files changed, 26 insertions(+), 7 deletions(-)

Index: b/mm/swapfile.c
===================================================================
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -547,7 +547,8 @@ int reuse_swap_page(struct page *page)
SetPageDirty(page);
}
}
- return count == 1;
+
+ return count + page_count(page) == 2;
}

/*
Index: b/mm/rmap.c
===================================================================
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -772,12 +772,18 @@ static int try_to_unmap_one(struct page
if (!pte)
goto out;

- /*
- * If the page is mlock()d, we cannot swap it out.
- * If it's recently referenced (perhaps page_referenced
- * skipped over this mm) then we should reactivate it.
- */
if (!migration) {
+ if (PageSwapCache(page) &&
+ page_count(page) != page_mapcount(page) + 2) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+
+ /*
+ * If the page is mlock()d, we cannot swap it out.
+ * If it's recently referenced (perhaps page_referenced
+ * skipped over this mm) then we should reactivate it.
+ */
if (vma->vm_flags & VM_LOCKED) {
ret = SWAP_MLOCK;
goto out_unmap;
@@ -790,7 +796,19 @@ static int try_to_unmap_one(struct page

/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
- pteval = ptep_clear_flush_notify(vma, address, pte);
+ pteval = ptep_clear_flush(vma, address, pte);
+
+ if (!migration) {
+ /* re-check */
+ if (PageSwapCache(page) &&
+ page_count(page) != page_mapcount(page) + 2) {
+ /* We lose race against get_user_pages_fast() */
+ set_pte_at(mm, address, pte, pteval);
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+ }
+ mmu_notifier_invalidate_page(vma->vm_mm, address);

/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))

2009-04-15 08:23:03

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Wednesday 15 April 2009 18:05:54 KOSAKI Motohiro wrote:
> Hi
>
> > On Wednesday 15 April 2009 00:32:52 Andrea Arcangeli wrote:
> > > On Wed, Apr 15, 2009 at 12:26:34AM +1000, Nick Piggin wrote:
> > > > Andrea: I didn't veto that set_bit change of yours as such. I just
> > >
> > > I know you didn't ;)
> > >
> > > > noted there could be more atomic operations. Actually I would
> > > > welcome more comparison between our two approaches, but they seem
> > >
> > > Agree about the welcome of comparison, it'd be nice to measure it the
> > > enterprise workloads that showed the gup_fast gain in the first place.
> >
> > I think we should be able to ask IBM to run some tests, provided
> > they still have machines available to do so. Although I don't want
> > to waste their time so we need to have something that has got past
> > initial code review and has a chance of being merged.
> >
> > If we get that far, then I can ask them to run tests definitely.
>
> Oh, it seem very charming idea.
> Nick, I hope to help your patch's rollup. It makes good comparision, I think.
> Is there my doable thing?

Well, I guess review and testing. There are few possibilities for
reducing the cases where we have to de-cow (or increasing the
cases where we can WP-on-fork), which I'd like to experiment with,
but I don't know how much it will help...


> And, I changed my patch.
> How about this? I added simple twice check.
>
> because, both do_wp_page and try_to_unmap_one grab ptl. then,
> page-fault routine can't change pte while try_to_unmap nuke pte.

Hmm,

> @@ -790,7 +796,19 @@ static int try_to_unmap_one(struct page
>
> /* Nuke the page table entry. */
> flush_cache_page(vma, address, page_to_pfn(page));
> - pteval = ptep_clear_flush_notify(vma, address, pte);
> + pteval = ptep_clear_flush(vma, address, pte);
> +
> + if (!migration) {
> + /* re-check */
> + if (PageSwapCache(page) &&
> + page_count(page) != page_mapcount(page) + 2) {
> + /* We lose race against get_user_pages_fast() */
> + set_pte_at(mm, address, pte, pteval);
> + ret = SWAP_FAIL;
> + goto out_unmap;
> + }
> + }
> + mmu_notifier_invalidate_page(vma->vm_mm, address);

Hmm, in the case of powerpc-style gup_fast where the arch
does not send IPIs to flush TLBs, either the speculative
reference there should find the pte cleared, or the page_count
check here should find the speculative reference.

In the case of CPUs that do send IPIs and have x86-style
gup_fast, the TLB flush should ensure all gup_fast()s that
could have seen the pte will complete before we check
page_count.

Yes I think it might work.

2009-04-15 08:48:32

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c


> > I would perhaps not fold gup_fast conversions into the same patch as
> > the fix.
>
> OK. I'll fix.

Done.



===================================
Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c

down_read(mmap_sem)
get_user_pages()
up_read(mmap_sem)

is fork unsafe.
fix it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Maciej Sosnowski <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Chris Leech <[email protected]>
Cc: [email protected]
---
drivers/dma/iovlock.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: b/drivers/dma/iovlock.c
===================================================================
--- a/drivers/dma/iovlock.c 2009-04-13 22:58:36.000000000 +0900
+++ b/drivers/dma/iovlock.c 2009-04-14 20:27:16.000000000 +0900
@@ -104,8 +104,6 @@ struct dma_pinned_list *dma_pin_iovec_pa
0, /* force */
page_list->pages,
NULL);
- up_read(&current->mm->mmap_sem);
-
if (ret != page_list->nr_pages)
goto unpin;

@@ -127,6 +125,8 @@ void dma_unpin_iovec_pages(struct dma_pi
if (!pinned_list)
return;

+ up_read(&current->mm->mmap_sem);
+
for (i = 0; i < pinned_list->nr_iovecs; i++) {
struct dma_page_list *page_list = &pinned_list->page_list[i];
for (j = 0; j < page_list->nr_pages; j++) {

2009-04-15 09:22:26

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Wednesday 15 April 2009 18:22:32 Nick Piggin wrote:
> On Wednesday 15 April 2009 18:05:54 KOSAKI Motohiro wrote:
> > Hi
> >
> > > On Wednesday 15 April 2009 00:32:52 Andrea Arcangeli wrote:
> > > > On Wed, Apr 15, 2009 at 12:26:34AM +1000, Nick Piggin wrote:
> > > > > Andrea: I didn't veto that set_bit change of yours as such. I just
> > > >
> > > > I know you didn't ;)
> > > >
> > > > > noted there could be more atomic operations. Actually I would
> > > > > welcome more comparison between our two approaches, but they seem
> > > >
> > > > Agree about the welcome of comparison, it'd be nice to measure it the
> > > > enterprise workloads that showed the gup_fast gain in the first place.
> > >
> > > I think we should be able to ask IBM to run some tests, provided
> > > they still have machines available to do so. Although I don't want
> > > to waste their time so we need to have something that has got past
> > > initial code review and has a chance of being merged.
> > >
> > > If we get that far, then I can ask them to run tests definitely.
> >
> > Oh, it seem very charming idea.
> > Nick, I hope to help your patch's rollup. It makes good comparision, I think.
> > Is there my doable thing?
>
> Well, I guess review and testing. There are few possibilities for
> reducing the cases where we have to de-cow (or increasing the
> cases where we can WP-on-fork), which I'd like to experiment with,
> but I don't know how much it will help...

Oh, I forgot, it needs to be updated for hugepages as well, I'd say.

2009-04-15 10:50:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Wed, Apr 15, 2009 at 05:05:54PM +0900, KOSAKI Motohiro wrote:
> - /*
> - * If the page is mlock()d, we cannot swap it out.
> - * If it's recently referenced (perhaps page_referenced
> - * skipped over this mm) then we should reactivate it.
> - */
> if (!migration) {
> + if (PageSwapCache(page) &&
> + page_count(page) != page_mapcount(page) + 2) {
> + ret = SWAP_FAIL;
> + goto out_unmap;
> + }
> +
> + /*
> + * If the page is mlock()d, we cannot swap it out.
> + * If it's recently referenced (perhaps page_referenced
> + * skipped over this mm) then we should reactivate it.
> + */
> if (vma->vm_flags & VM_LOCKED) {
> ret = SWAP_MLOCK;
> goto out_unmap;
> @@ -790,7 +796,19 @@ static int try_to_unmap_one(struct page
>
> /* Nuke the page table entry. */
> flush_cache_page(vma, address, page_to_pfn(page));
> - pteval = ptep_clear_flush_notify(vma, address, pte);
> + pteval = ptep_clear_flush(vma, address, pte);
> +
> + if (!migration) {
> + /* re-check */
> + if (PageSwapCache(page) &&
> + page_count(page) != page_mapcount(page) + 2) {
> + /* We lose race against get_user_pages_fast() */
> + set_pte_at(mm, address, pte, pteval);
> + ret = SWAP_FAIL;
> + goto out_unmap;
> + }
> + }
> + mmu_notifier_invalidate_page(vma->vm_mm, address);

With regard to mmu notifier, this is the opposite of the right
ordering. One mmu_notifier_invalidate_page must run _before_ the first
check. The ptep_clear_flush_notify will then stay and there's no need
of a further mmu_notifier_invalidate_page after the second check.

2009-04-15 11:39:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

>> + ? ? if (!migration) {
>> + ? ? ? ? ? ? /* re-check */
>> + ? ? ? ? ? ? if (PageSwapCache(page) &&
>> + ? ? ? ? ? ? ? ? page_count(page) != page_mapcount(page) + 2) {
>> + ? ? ? ? ? ? ? ? ? ? /* We lose race against get_user_pages_fast() */
>> + ? ? ? ? ? ? ? ? ? ? set_pte_at(mm, address, pte, pteval);
>> + ? ? ? ? ? ? ? ? ? ? ret = SWAP_FAIL;
>> + ? ? ? ? ? ? ? ? ? ? goto out_unmap;
>> + ? ? ? ? ? ? }
>> + ? ? }
>> + ? ? mmu_notifier_invalidate_page(vma->vm_mm, address);
>
> With regard to mmu notifier, this is the opposite of the right
> ordering. One mmu_notifier_invalidate_page must run _before_ the first
> check. The ptep_clear_flush_notify will then stay and there's no need
> of a further mmu_notifier_invalidate_page after the second check.

OK. but I have one question.

Can we assume mmu_notifier is only used by kvm now?
if not, we need to make new notifier.

2009-04-15 11:43:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

On Wed, Apr 15, 2009 at 08:39:04PM +0900, KOSAKI Motohiro wrote:
> >> + ? ? if (!migration) {
> >> + ? ? ? ? ? ? /* re-check */
> >> + ? ? ? ? ? ? if (PageSwapCache(page) &&
> >> + ? ? ? ? ? ? ? ? page_count(page) != page_mapcount(page) + 2) {
> >> + ? ? ? ? ? ? ? ? ? ? /* We lose race against get_user_pages_fast() */
> >> + ? ? ? ? ? ? ? ? ? ? set_pte_at(mm, address, pte, pteval);
> >> + ? ? ? ? ? ? ? ? ? ? ret = SWAP_FAIL;
> >> + ? ? ? ? ? ? ? ? ? ? goto out_unmap;
> >> + ? ? ? ? ? ? }
> >> + ? ? }
> >> + ? ? mmu_notifier_invalidate_page(vma->vm_mm, address);
> >
> > With regard to mmu notifier, this is the opposite of the right
> > ordering. One mmu_notifier_invalidate_page must run _before_ the first
> > check. The ptep_clear_flush_notify will then stay and there's no need
> > of a further mmu_notifier_invalidate_page after the second check.
>
> OK. but I have one question.
>
> Can we assume mmu_notifier is only used by kvm now?
> if not, we need to make new notifier.

KVM is no fundamentally different from other users in this respect, so
I don't see why need a new notifier. If it works for others it'll work
for KVM and the other way around is true too.

mmu notifier users can or cannot take a page pin. KVM does. GRU
doesn't. XPMEM does. All of them releases any pin after
mmu_notifier_invalidate_page. All that is important is to run
mmu_notifier_invalidate_page _after_ the ptep_clear_young_notify, so
that we don't nuke secondary mappings on the pages unless we really go
to nuke the pte.

2009-04-15 11:53:23

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

>> Can we assume mmu_notifier is only used by kvm now?
>> if not, we need to make new notifier.
>
> KVM is no fundamentally different from other users in this respect, so
> I don't see why need a new notifier. If it works for others it'll work
> for KVM and the other way around is true too.
>
> mmu notifier users can or cannot take a page pin. KVM does. GRU
> doesn't. XPMEM does. All of them releases any pin after
> mmu_notifier_invalidate_page. All that is important is to run
> mmu_notifier_invalidate_page _after_ the ptep_clear_young_notify, so
> that we don't nuke secondary mappings on the pages unless we really go
> to nuke the pte.

Thank you kindful explain. I understand it :)

2009-04-17 15:09:47

by Sosnowski, Maciej

[permalink] [raw]
Subject: RE: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c

KOSAKI Motohiro wrote:
>>> I would perhaps not fold gup_fast conversions into the same patch as
>>> the fix.
>>
>> OK. I'll fix.
>
> Done.
>
>
>
> ===================================
> Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c
>
> down_read(mmap_sem)
> get_user_pages()
> up_read(mmap_sem)
>
> is fork unsafe.
> fix it.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: Maciej Sosnowski <[email protected]>
> Cc: David S. Miller <[email protected]>
> Cc: Chris Leech <[email protected]>
> Cc: [email protected]
> ---
> drivers/dma/iovlock.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: b/drivers/dma/iovlock.c
> ===================================================================
> --- a/drivers/dma/iovlock.c 2009-04-13 22:58:36.000000000 +0900
> +++ b/drivers/dma/iovlock.c 2009-04-14 20:27:16.000000000 +0900
> @@ -104,8 +104,6 @@ struct dma_pinned_list *dma_pin_iovec_pa
> 0, /* force */
> page_list->pages,
> NULL);
> - up_read(&current->mm->mmap_sem);
> -
> if (ret != page_list->nr_pages)
> goto unpin;
>
> @@ -127,6 +125,8 @@ void dma_unpin_iovec_pages(struct dma_pi
> if (!pinned_list)
> return;
>
> + up_read(&current->mm->mmap_sem);
> +
> for (i = 0; i < pinned_list->nr_iovecs; i++) {
> struct dma_page_list *page_list = &pinned_list->page_list[i];
> for (j = 0; j < page_list->nr_pages; j++) {

I have tried it with net_dma and here is what I've got.

Regards,
Maciej
---

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.30-rc2 #2
-------------------------------------------------------
iperf/10555 is trying to acquire lock:
(sk_lock-AF_INET){+.+.+.}, at: [<ffffffff80450991>] sk_wait_data+0x90/0xc5

but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff8043d0f6>] dma_pin_iovec_pages+0x122/0x1a0

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){++++++}:
[<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (sk_lock-AF_INET){+.+.+.}:
[<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

1 lock held by iperf/10555:
#0: (&mm->mmap_sem){++++++}, at: [<ffffffff8043d0f6>] dma_pin_iovec_pages+0x122/0x1a0

stack backtrace:
Pid: 10555, comm: iperf Tainted: G W 2.6.30-rc2 #2
Call Trace:
[<ffffffff8025b1dc>] ? print_circular_bug_tail+0xc0/0xc9
[<ffffffff8025aa2b>] ? print_circular_bug_header+0xc8/0xcf
[<ffffffff8025b984>] ? validate_chain+0x67d/0xc7c
[<ffffffff8025c6e6>] ? __lock_acquire+0x763/0x7ec
[<ffffffff8025c835>] ? lock_acquire+0xc6/0xea
[<ffffffff80450991>] ? sk_wait_data+0x90/0xc5
[<ffffffff8044ff01>] ? lock_sock_nested+0xee/0x100
[<ffffffff80450991>] ? sk_wait_data+0x90/0xc5
[<ffffffff80259b07>] ? mark_held_locks+0x43/0x5b
[<ffffffff802403ee>] ? local_bh_enable_ip+0xc4/0xc7
[<ffffffff80259c3c>] ? trace_hardirqs_on_caller+0x11d/0x148
[<ffffffff80450991>] ? sk_wait_data+0x90/0xc5
[<ffffffff8024ebb3>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff80488ad3>] ? tcp_recvmsg+0x3bf/0xa21
[<ffffffff8044f601>] ? sock_common_recvmsg+0x30/0x45
[<ffffffff8044d847>] ? sock_recvmsg+0xf0/0x10f
[<ffffffff8024ebb3>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff8025c704>] ? __lock_acquire+0x781/0x7ec
[<ffffffff802b553c>] ? fget_light+0xd5/0xdf
[<ffffffff802b54b0>] ? fget_light+0x49/0xdf
[<ffffffff8044e91b>] ? sys_recvfrom+0xbc/0x119
[<ffffffff80259c3c>] ? trace_hardirqs_on_caller+0x11d/0x148
[<ffffffff804e1a7f>] ? _spin_unlock_irq+0x24/0x27
[<ffffffff802359ab>] ? finish_task_switch+0x7a/0xe4
[<ffffffff80235967>] ? finish_task_switch+0x36/0xe4
[<ffffffff804df188>] ? thread_return+0x3e/0x97
[<ffffffff802718f7>] ? audit_syscall_entry+0x192/0x1bd
[<ffffffff8020b96b>] ? system_call_fastpath+0x16/0x1b

2009-04-19 12:40:07

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 1/6] mm: Don't unmap gup()ed page

> >> Can we assume mmu_notifier is only used by kvm now?
> >> if not, we need to make new notifier.
> >
> > KVM is no fundamentally different from other users in this respect, so
> > I don't see why need a new notifier. If it works for others it'll work
> > for KVM and the other way around is true too.
> >
> > mmu notifier users can or cannot take a page pin. KVM does. GRU
> > doesn't. XPMEM does. All of them releases any pin after
> > mmu_notifier_invalidate_page. All that is important is to run
> > mmu_notifier_invalidate_page _after_ the ptep_clear_young_notify, so
> > that we don't nuke secondary mappings on the pages unless we really go
> > to nuke the pte.
>
> Thank you kindful explain. I understand it :)

How about this?

---
mm/rmap.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
mm/swapfile.c | 3 ++-
2 files changed, 45 insertions(+), 8 deletions(-)

Index: b/mm/swapfile.c
===================================================================
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -547,7 +547,8 @@ int reuse_swap_page(struct page *page)
SetPageDirty(page);
}
}
- return count == 1;
+
+ return count + page_count(page) == 2;
}

/*
Index: b/mm/rmap.c
===================================================================
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -772,12 +772,34 @@ static int try_to_unmap_one(struct page
if (!pte)
goto out;

- /*
- * If the page is mlock()d, we cannot swap it out.
- * If it's recently referenced (perhaps page_referenced
- * skipped over this mm) then we should reactivate it.
- */
+
+ /* Unpinning the page from long time pinning subsystem (e.g. kvm). */
+ mmu_notifier_invalidate_page(vma->vm_mm, address);
+
if (!migration) {
+ /*
+ * Don't pull an anonymous page out from under get_user_pages.
+ * get_user_pages_fast() silently raises page count without any
+ * lock. thus, we need twice check here and _after_ pte nuking.
+ *
+ * If nuke the pte of pinned pages, do_wp_page() will replace
+ * it by a copy page, and the user never get to see the data
+ * GUP was holding the original page for.
+ *
+ * note:
+ * page_mapcount() + 2 mean pte + swapcache + us
+ */
+ if (PageAnon(page) &&
+ (page_count(page) != page_mapcount(page) + 2)) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+
+ /*
+ * If the page is mlock()d, we cannot swap it out.
+ * If it's recently referenced (perhaps page_referenced
+ * skipped over this mm) then we should reactivate it.
+ */
if (vma->vm_flags & VM_LOCKED) {
ret = SWAP_MLOCK;
goto out_unmap;
@@ -786,11 +808,25 @@ static int try_to_unmap_one(struct page
ret = SWAP_FAIL;
goto out_unmap;
}
- }
+ }

/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
- pteval = ptep_clear_flush_notify(vma, address, pte);
+ pteval = ptep_clear_flush(vma, address, pte);
+
+ if (!migration) {
+ if (PageAnon(page) &&
+ page_count(page) != page_mapcount(page) + 2) {
+ /*
+ * We lose the race against get_user_pages_fast().
+ * set the same pte and give up unmapping.
+ */
+ set_pte_at(mm, address, pte, pteval);
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+ }
+

/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))


2009-04-19 12:41:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c

> KOSAKI Motohiro wrote:
> >>> I would perhaps not fold gup_fast conversions into the same patch as
> >>> the fix.
> >>
> >> OK. I'll fix.
> >
> > Done.
> >
> >
> >
> > ===================================
> > Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c
> >
> > down_read(mmap_sem)
> > get_user_pages()
> > up_read(mmap_sem)
> >
> > is fork unsafe.
> > fix it.
> >
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > Cc: Maciej Sosnowski <[email protected]>
> > Cc: David S. Miller <[email protected]>
> > Cc: Chris Leech <[email protected]>
> > Cc: [email protected]
> > ---
> > drivers/dma/iovlock.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > Index: b/drivers/dma/iovlock.c
> > ===================================================================
> > --- a/drivers/dma/iovlock.c 2009-04-13 22:58:36.000000000 +0900
> > +++ b/drivers/dma/iovlock.c 2009-04-14 20:27:16.000000000 +0900
> > @@ -104,8 +104,6 @@ struct dma_pinned_list *dma_pin_iovec_pa
> > 0, /* force */
> > page_list->pages,
> > NULL);
> > - up_read(&current->mm->mmap_sem);
> > -
> > if (ret != page_list->nr_pages)
> > goto unpin;
> >
> > @@ -127,6 +125,8 @@ void dma_unpin_iovec_pages(struct dma_pi
> > if (!pinned_list)
> > return;
> >
> > + up_read(&current->mm->mmap_sem);
> > +
> > for (i = 0; i < pinned_list->nr_iovecs; i++) {
> > struct dma_page_list *page_list = &pinned_list->page_list[i];
> > for (j = 0; j < page_list->nr_pages; j++) {
>
> I have tried it with net_dma and here is what I've got.

Thanks.
Instead, How about this?


============================================
Subject: [Untested][RFC][PATCH v3] fix wrong get_user_pages usage in iovlock.c

down_read(mmap_sem)
get_user_pages()
up_read(mmap_sem)

is fork unsafe.
mmap_sem should't be released until dma_unpin_iovec_pages() is called.


Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Maciej Sosnowski <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Chris Leech <[email protected]>
Cc: [email protected]
---
drivers/dma/iovlock.c | 5 ++---
net/ipv4/tcp.c | 9 +++++++++
2 files changed, 11 insertions(+), 3 deletions(-)

Index: b/drivers/dma/iovlock.c
===================================================================
--- a/drivers/dma/iovlock.c 2009-04-19 17:27:25.000000000 +0900
+++ b/drivers/dma/iovlock.c 2009-04-19 17:29:42.000000000 +0900
@@ -45,6 +45,8 @@ static int num_pages_spanned(struct iove
* We are allocating a single chunk of memory, and then carving it up into
* 3 sections, the latter 2 whose size depends on the number of iovecs and the
* total number of pages, respectively.
+ *
+ * Caller must hold mm->mmap_sem
*/
struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
{
@@ -94,7 +96,6 @@ struct dma_pinned_list *dma_pin_iovec_pa
pages += page_list->nr_pages;

/* pin pages down */
- down_read(&current->mm->mmap_sem);
ret = get_user_pages(
current,
current->mm,
@@ -104,8 +105,6 @@ struct dma_pinned_list *dma_pin_iovec_pa
0, /* force */
page_list->pages,
NULL);
- up_read(&current->mm->mmap_sem);
-
if (ret != page_list->nr_pages)
goto unpin;

Index: b/net/ipv4/tcp.c
===================================================================
--- a/net/ipv4/tcp.c 2009-04-19 17:27:25.000000000 +0900
+++ b/net/ipv4/tcp.c 2009-04-19 18:09:42.000000000 +0900
@@ -1322,6 +1322,9 @@ int tcp_recvmsg(struct kiocb *iocb, stru
int copied_early = 0;
struct sk_buff *skb;

+#ifdef CONFIG_NET_DMA
+ down_read(&current->mm->mmap_sem);
+#endif
lock_sock(sk);

TCP_CHECK_TIMER(sk);
@@ -1688,11 +1691,17 @@ skip_copy:

TCP_CHECK_TIMER(sk);
release_sock(sk);
+#ifdef CONFIG_NET_DMA
+ up_read(&current->mm->mmap_sem);
+#endif
return copied;

out:
TCP_CHECK_TIMER(sk);
release_sock(sk);
+#ifdef CONFIG_NET_DMA
+ up_read(&current->mm->mmap_sem);
+#endif
return err;

recv_urg:

2009-04-23 12:49:23

by Sosnowski, Maciej

[permalink] [raw]
Subject: RE: [RFC][PATCH v3 6/6] fix wrong get_user_pages usage in iovlock.c

KOSAKI Motohiro wrote:
>> KOSAKI Motohiro wrote:
>>>>> I would perhaps not fold gup_fast conversions into the same patch as
>>>>> the fix.
>>>>
>>>> OK. I'll fix.
>>>
>>> Done.
>>>
>>>
>>>
>>> ===================================
>>> Subject: [Untested][RFC][PATCH] fix wrong get_user_pages usage in iovlock.c
>>>
>>> down_read(mmap_sem)
>>> get_user_pages()
>>> up_read(mmap_sem)
>>>
>>> is fork unsafe.
>>> fix it.
>>>
>>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>>> Cc: Maciej Sosnowski <[email protected]>
>>> Cc: David S. Miller <[email protected]>
>>> Cc: Chris Leech <[email protected]>
>>> Cc: [email protected]
>>> ---
>>> drivers/dma/iovlock.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> Index: b/drivers/dma/iovlock.c
>>> ===================================================================
>>> --- a/drivers/dma/iovlock.c 2009-04-13 22:58:36.000000000 +0900
>>> +++ b/drivers/dma/iovlock.c 2009-04-14 20:27:16.000000000 +0900
>>> @@ -104,8 +104,6 @@ struct dma_pinned_list *dma_pin_iovec_pa 0, /* force */
>>> page_list->pages,
>>> NULL);
>>> - up_read(&current->mm->mmap_sem);
>>> -
>>> if (ret != page_list->nr_pages)
>>> goto unpin;
>>>
>>> @@ -127,6 +125,8 @@ void dma_unpin_iovec_pages(struct dma_pi if (!pinned_list)
>>> return;
>>>
>>> + up_read(&current->mm->mmap_sem);
>>> +
>>> for (i = 0; i < pinned_list->nr_iovecs; i++) {
>>> struct dma_page_list *page_list = &pinned_list->page_list[i];
>>> for (j = 0; j < page_list->nr_pages; j++) {
>>
>> I have tried it with net_dma and here is what I've got.
>
> Thanks.
> Instead, How about this?
>

Unfortuantelly still does not look good.

Regards,
Maciej

=============================================
[ INFO: possible recursive locking detected ]
2.6.30-rc2 #14
---------------------------------------------
iperf/9932 is trying to acquire lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff804e3d5e>] do_page_fault+0x170/0


but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff80488722>] tcp_recvmsg+0x3a/0xa7


other info that might help us debug this:
2 locks held by iperf/9932:
#0: (&mm->mmap_sem){++++++}, at: [<ffffffff80488722>] tcp_recvmsg+0x3a

#1: (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff80450965>] sk_wait_data+0


stack backtrace:
Pid: 9932, comm: iperf Tainted: G W 2.6.30-rc2 #14
Call Trace:
[<ffffffff8025b861>] ? validate_chain+0x55a/0xc7c
[<ffffffff8025c6e6>] ? __lock_acquire+0x763/0x7ec
[<ffffffff8025c835>] ? lock_acquire+0xc6/0xea
[<ffffffff804e3d5e>] ? do_page_fault+0x170/0x29d
[<ffffffff804e0693>] ? down_read+0x46/0x77
[<ffffffff804e3d5e>] ? do_page_fault+0x170/0x29d
[<ffffffff804e3d5e>] ? do_page_fault+0x170/0x29d
[<ffffffff804e1ebf>] ? page_fault+0x1f/0x30
[<ffffffff803580ed>] ? copy_user_generic_string+0x2d/0x40
[<ffffffff804562cc>] ? memcpy_toiovec+0x36/0x66
[<ffffffff804569eb>] ? skb_copy_datagram_iovec+0x133/0x1f0
[<ffffffff80490199>] ? tcp_rcv_established+0x297/0x71a
[<ffffffff804953f8>] ? tcp_v4_do_rcv+0x2c/0x1d5
[<ffffffff8024ebb3>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff80486239>] ? tcp_prequeue_process+0x6b/0x7e
[<ffffffff80488b31>] ? tcp_recvmsg+0x449/0xa70
[<ffffffff8025c704>] ? __lock_acquire+0x781/0x7ec
[<ffffffff8044f5d5>] ? sock_common_recvmsg+0x30/0x45
[<ffffffff8044d81b>] ? sock_recvmsg+0xf0/0x10f
[<ffffffff80259c3c>] ? trace_hardirqs_on_caller+0x11d/0x148
[<ffffffff8024ebb3>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff8020c43c>] ? restore_args+0x0/0x30
[<ffffffff802b553c>] ? fget_light+0xd5/0xdf
[<ffffffff802b54b0>] ? fget_light+0x49/0xdf
[<ffffffff8044e8ef>] ? sys_recvfrom+0xbc/0x119
[<ffffffff802331cd>] ? try_to_wake_up+0x2ae/0x2c0
[<ffffffff802718f7>] ? audit_syscall_entry+0x192/0x1bd
[<ffffffff8020b96b>] ? system_call_fastpath+0x16/0x1b