Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp4602123imm; Mon, 17 Sep 2018 17:36:49 -0700 (PDT) X-Google-Smtp-Source: ANB0VdazLkump9iccF1MXuADgRFU0tQ+pzSSPf43ZsYHY8tw58Nq9RLuIyEg8xSFGOQ2SJCFDxIE X-Received: by 2002:a62:9992:: with SMTP id t18-v6mr27999228pfk.239.1537231009457; Mon, 17 Sep 2018 17:36:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537231009; cv=none; d=google.com; s=arc-20160816; b=LHJ7JE/sZqMr2aslLAnd5snTaA2dxkm2lcsAD1Ev+KnxRJrBPftPj0rcdWOtFjrJLw ZO+ohcmlWAXo1QY4BE7UNEok3WRzPdrv9KVarCvnEX4kwAYOR1e5f9eKi3D/MsMJavvp fmv3GMNdXQwVcwu5hhJ9Wu0uUM1NPkDY/OaFi4e9osXLiK1M+YwYnBNcDxbrIE/8lq/G B0jp2KmjZI4ADKhGS1UhG6YTmwNrF9R/TmupgO0tAiIeWFv9x0Hl1pgn+D6/0/HbLDag 4ju1YyAT0VJo1Ux+gjuJNgudeMoKc+YGa3Y881edKfzTNTRspxfLACYrcLITvsom2gPc s1Hw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=mUtWOCV1C2awiRXd+UxyeRrsNmzPXvRYcnzv1Q1VY/k=; b=bX3mulWO2Ler2xwZTUSpg3ds9JxFsJrteXjajGf/w+LSqnuhc8MfwzXdCgGXYSIhGY z6QzFKK/Dw9j0OBTpiQG0HoNufHai2gF5GQaMpmvblarqQ42doLDA8i/5b9M7YI8Q+kr Dowbq80cmpAS5lrp1SS8nr91TRz16Jh6YFmNqUqQT4AbUskCM1WIhQrfd86YgfanUusI 9n3qXKoe4RiF5niqRJscbJaFa+YmGpNMc4OYmqS7ivjo1+Ofo0kO3P9N03gooel9TfQg kze10Rbq5HjQldeQPk54HBTxQvleY6T2DR+XuhByWdbPlsUcSuIAV+2oJtbOMdlF1Pyf yXhA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=qW39uMci; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s136-v6si17808554pfs.255.2018.09.17.17.36.34; Mon, 17 Sep 2018 17:36:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=qW39uMci; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728842AbeIRGGC (ORCPT + 99 others); Tue, 18 Sep 2018 02:06:02 -0400 Received: from mail-ot1-f68.google.com ([209.85.210.68]:40857 "EHLO mail-ot1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726275AbeIRGGC (ORCPT ); Tue, 18 Sep 2018 02:06:02 -0400 Received: by mail-ot1-f68.google.com with SMTP id v10-v6so186438otk.7 for ; Mon, 17 Sep 2018 17:36:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=mUtWOCV1C2awiRXd+UxyeRrsNmzPXvRYcnzv1Q1VY/k=; b=qW39uMciOui1BO1hpnD4OegBR94DW5gcJT+nlyEEf8w/g4WKfa3TJVdCli+cV9rTmG LSzFlMJweOcSFHd63sW7nZjt2hUStZNQoDyORNpghl6phxI1nUk2h22gTPSbmy/Qb6wf S2hmeGijla//AUswvwMLeuXUdz4vT6+GflVp4cESohpWGAHvvRV+Ug8HuZihgixopvdO jD53B8O58T9a3fwkUi+C+nMPgrpWDIEoednlrJLjLH7H8MWpsk+CARey8EoS78Lp/ICp YXENMT8eEREmF1QTwUCVZuvT8dR9wjYMA1JQB8qLpAOn2k9TzuhbFVGLOGlX4MS9RDNK P/wg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=mUtWOCV1C2awiRXd+UxyeRrsNmzPXvRYcnzv1Q1VY/k=; b=i+OPoZkvHWMlSxlQcvORQKnvBZ+kQPcmWAkHqPuvFfxP2lSk//vXe1czZgzyyN2sP4 ppdYwFYvSuZsAKaq4GKxuqlM9aIOeICUFTE84evcUfBoU8i4umR/KarAa78HOxnnybvq warEDAFS+zRjLZoBLIzkdbsDYF5uQKOGbYyBrf5uuIoBim7Ges0axGXxxbwqRXYt/4mc G86vE2b7p+BlCD1kvsTgmtMr//9iIS7cUZFAieHwdpTn35FnV6NhaTD6O5FlALvfV8b7 MSaDwhqW43NWAylPeN6JA46tvKpqI2WI9W83xi9aGfQ/CBL1SIS0q5cahXyLoygg7YGm UggA== X-Gm-Message-State: APzg51D2U3flyS3c2ry0eJv3Dye/pQHZ2Bfu3bx7S1HpQfDs32v9/Hsv J+W2WInkH6ptRg3T8vSpS+HpNI/lYn3Dj0t5Xmfxhg== X-Received: by 2002:a9d:4ea:: with SMTP id 97-v6mr13230281otm.99.1537230970134; Mon, 17 Sep 2018 17:36:10 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jann Horn Date: Tue, 18 Sep 2018 02:35:43 +0200 Message-ID: Subject: Re: [BUG] mm: direct I/O (using GUP) can write to COW anonymous pages To: Hugh Dickins Cc: Dan Williams , Andrew Morton , Michal Hocko , Rik van Riel , Andrea Arcangeli , Konstantin Khlebnikov , sqazi@google.com, "Michael S. Tsirkin" , jack@suse.cz, kernel list , Linux-MM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 18, 2018 at 2:05 AM Hugh Dickins wrote: > > Hi Jann, > > On Mon, 17 Sep 2018, Jann Horn wrote: > > > [I'm not sure who the best people to ask about this are, I hope the > > recipient list resembles something reasonable...] > > > > I have noticed that the dup_mmap() logic on fork() doesn't handle > > pages with active direct I/O properly: dup_mmap() seems to assume that > > making the PTE referencing a page readonly will always prevent future > > writes to the page, but if the kernel has acquired a direct reference > > to the page before (e.g. via get_user_pages_fast()), writes can still > > happen that way. > > > > The worst-case effect of this - as far as I can tell - is that when a > > multithreaded process forks while one thread is in the middle of > > sys_read() on a file that uses direct I/O with get_user_pages_fast(), > > the read data can become visible in the child while the parent's > > buffer stays uninitialized if the parent writes to a relevant page > > post-fork before either the I/O completes or the child writes to it. > > Yes: you're understandably more worried by the one seeing the other's > data; Actually, I was mostly just trying to find a scenario in which the parent doesn't get the data it's asking for, and this is the simplest I could come up with. :) I was also vaguely worried about whether some other part of the mm subsystem might assume that COW pages are immutable, but I haven't found anything like that so far, so that might've been unwarranted paranoia. > we've tended in the past to be more worried about the one getting > corruption, and the other not seeing the data it asked for (and usually > in the context of RDMA, rather than filesystem direct I/O). > > I've added some Cc's: I might be misremembering, but I think both > Andrea and Konstantin have offered approaches to this in the past, > and I believe Salman is taking a look at it currently. > > But my own interest ended when Michael added MADV_DONTFORK: beyond > that, we've rated it a "Patient: It hurts when I do this. Doctor: > Don't do that then" - more complexity and overhead to solve, than > we have had appetite to get into. Makes sense, I guess. I wonder whether there's a concise way to express this in the fork.2 manpage, or something like that. Maybe I'll take a stab at writing something. The biggest issue I see with documenting this edgecase is that, as an application developer, if you don't know whether some file might be coming from a FUSE filesystem that has opted out of using the disk cache, the "don't do that" essentially becomes "don't read() into heap buffers while fork()ing in another thread", since with FUSE, direct I/O can happen even if you don't open files as O_DIRECT as long as the filesystem requests direct I/O, and get_user_pages_fast() will AFAIU be used for non-page-aligned buffers, meaning that an adjacent heap memory access could trigger CoW page duplication. But then, FUSE filesystems that opt out of the disk cache are probably so rare that it's not a concern in practice... > But not a shiningly satisfactory > situation, I'll agree. > > Hugh > > > > > Reproducer code: > > > > =3D=3D=3D=3D=3D=3D START hello.c =3D=3D=3D=3D=3D=3D > > #define FUSE_USE_VERSION 26 > > > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > > > static const char *hello_path =3D "/hello"; > > > > static int hello_getattr(const char *path, struct stat *stbuf) > > { > > int res =3D 0; > > memset(stbuf, 0, sizeof(struct stat)); > > if (strcmp(path, "/") =3D=3D 0) { > > stbuf->st_mode =3D S_IFDIR | 0755; > > stbuf->st_nlink =3D 2; > > } else if (strcmp(path, hello_path) =3D=3D 0) { > > stbuf->st_mode =3D S_IFREG | 0666; > > stbuf->st_nlink =3D 1; > > stbuf->st_size =3D 0x1000; > > stbuf->st_blocks =3D 0; > > } else > > res =3D -ENOENT; > > return res; > > } > > > > static int hello_readdir(const char *path, void *buf, fuse_fill_dir_t > > filler, off_t offset, struct fuse_file_info *fi) { > > filler(buf, ".", NULL, 0); > > filler(buf, "..", NULL, 0); > > filler(buf, hello_path + 1, NULL, 0); > > return 0; > > } > > > > static int hello_open(const char *path, struct fuse_file_info *fi) { > > return 0; > > } > > > > static int hello_read(const char *path, char *buf, size_t size, off_t > > offset, struct fuse_file_info *fi) { > > sleep(3); > > size_t len =3D 0x1000; > > if (offset < len) { > > if (offset + size > len) > > size =3D len - offset; > > memset(buf, 0, size); > > } else > > size =3D 0; > > return size; > > } > > > > static int hello_write(const char *path, const char *buf, size_t size, > > off_t offset, struct fuse_file_info *fi) { > > while(1) pause(); > > } > > > > static struct fuse_operations hello_oper =3D { > > .getattr =3D hello_getattr, > > .readdir =3D hello_readdir, > > .open =3D hello_open, > > .read =3D hello_read, > > .write =3D hello_write, > > }; > > > > int main(int argc, char *argv[]) { > > return fuse_main(argc, argv, &hello_oper, NULL); > > } > > =3D=3D=3D=3D=3D=3D END hello.c =3D=3D=3D=3D=3D=3D > > > > =3D=3D=3D=3D=3D=3D START simple_mmap.c =3D=3D=3D=3D=3D=3D > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > > > __attribute__((aligned(0x1000))) char data_buffer_[0x10000]; > > #define data_buffer (data_buffer_ + 0x8000) > > > > void *fuse_thread(void *dummy) { > > /* step 2: start direct I/O on data_buffer */ > > int fuse_fd =3D open("mount/hello", O_RDWR); > > if (fuse_fd =3D=3D -1) > > err(1, "unable to open FUSE fd"); > > printf("char in parent (before): %hhd\n", data_buffer[0]); > > int res =3D read(fuse_fd, data_buffer, 0x1000); > > /* step 6: read completes, show post-read state */ > > printf("fuse read result: %d\n", res); > > printf("char in parent (after): %hhd\n", data_buffer[0]); > > } > > > > int main(void) { > > /* step 1: make data_buffer dirty */ > > data_buffer[0] =3D 1; > > > > pthread_t thread; > > if (pthread_create(&thread, NULL, fuse_thread, NULL)) > > errx(1, "pthread_create"); > > > > sleep(1); > > /* step 3: fork a child */ > > pid_t child =3D fork(); > > if (child =3D=3D -1) > > err(1, "fork"); > > if (child =3D=3D 0) { > > prctl(PR_SET_PDEATHSIG, SIGKILL); > > sleep(1); > > > > /* step 5: show pre-read state in the child */ > > printf("char in child (before): %hhd\n", data_buffer[0]= ); > > sleep(3); > > /* step 7: read is complete, show post-read state in ch= ild */ > > printf("char in child (after): %hhd\n", data_buffer[0])= ; > > return 0; > > } > > > > /* step 4: de-CoW data_buffer in the parent */ > > data_buffer[0x800] =3D 2; > > > > int status; > > if (wait(&status) !=3D child) > > err(1, "wait"); > > } > > =3D=3D=3D=3D=3D=3D END simple_mmap.c =3D=3D=3D=3D=3D=3D > > > > Repro steps: > > > > In one terminal: > > $ mkdir mount > > $ gcc -o hello hello.c -Wall -std=3Dgnu99 `pkg-config fuse --cflags --l= ibs` > > hello.c: In function =E2=80=98hello_write=E2=80=99: > > hello.c:67:1: warning: no return statement in function returning > > non-void [-Wreturn-type] > > } > > ^ > > $ ./hello -d -o direct_io mount > > FUSE library version: 2.9.7 > > [...] > > > > In a second terminal: > > $ gcc -pthread -o simple_mmap simple_mmap.c > > $ ./simple_mmap > > char in parent (before): 1 > > char in child (before): 1 > > fuse read result: 4096 > > char in parent (after): 1 > > char in child (after): 0 > > > > I have tested that this still works on 4.19.0-rc3+. > > > > As far as I can tell, the fix would be to immediately copy pages with > > `refcount - mapcount > N` in dup_mmap(), or something like that? > >