Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752327AbcDZNBs (ORCPT ); Tue, 26 Apr 2016 09:01:48 -0400 Received: from mail-yw0-f182.google.com ([209.85.161.182]:35776 "EHLO mail-yw0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751507AbcDZNBq (ORCPT ); Tue, 26 Apr 2016 09:01:46 -0400 MIME-Version: 1.0 From: Ulrich Drepper Date: Tue, 26 Apr 2016 09:01:26 -0400 Message-ID: Subject: cpuset cleanup race To: Li Zefan , cgroups@vger.kernel.org, Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4907 Lines: 167 I came across a problem with code which uses a cpuset CG and tries to be responsible and clean up after itself. The code attached at the bottom illustrates the problem. It's only long because it has no dependencies aside from the basic runtime and should work on all machines. You need to run it with privileges high enough to create a CG. The code is really simple: - a (new) CG in cpuset is created - one of the cores of the root cpuset is selected - the thread (and therefore entire process) is switched to the cpuset - a thread is created which does nothing but terminate immediately - the parent waits for the thread - then the parent removes itself from the cpuset - finally the parent tries to remove the created cpuset The last part is where things go wrong. Usually* the rmdir() call made to remove the cpuset fails because the cpuset is still busy. The program prints the members of the cpuset CG: it's the child thread. * I wrote "usually" because slowing down the parent code will help. I.e., there is a race. Successful slowdowns I found: - compile with -fsanitize=address (seems already enough) - very short wait, e.g., 1ns (you can see this by starting the program with the parameter "wait") You might want to compile the code with optimization. It is a race, after all. The pthread_join() call made by the parent won't return until the kernel signals through the futex set up at clone() time that the thread has terminated. From the perspective of the userlevel code the thread is gone. But not all bookkeeping related to the terminated thread seems to has been finished, it seems. I didn't look at the code but I can imagine that the futex notification happens as soon as all observable aspects of the thread are gone. This is of course good to not delay the waiter. Hopefully the cgroup bookkeeping can also be moved before the notification. I tested it with a recent kernel (4.5.0-0.rc7) but I doubt it's a recent issue. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include #include #include #include #include #include #include #include #include #include static void *tf(void *p) { return NULL; } int main(int argc, char *argv[]) { const char *csname = argc == 1 ? "test" : argv[1]; struct mntent *me; FILE *fp = setmntent(_PATH_MOUNTED, "r"); if (fp == NULL) error(1, errno, "cannot read mounted filesystem information"); while ((me = getmntent(fp)) != NULL) { if (strcmp(me->mnt_type, "cgroup") == 0 && hasmntopt(me, "cpuset") != NULL) break; } if (me == NULL) error(1, 0, "cpuset filesystem not mounted"); endmntent(fp); char *cshier = NULL; asprintf(&cshier, "%s/%s", me->mnt_dir, csname); if (mkdir(cshier, 0777) == 0) printf("new cpuset control group: %s\n", cshier); else if (errno != EEXIST) error(1, errno, "cannot create cpuset group %s", cshier); char *csrootmems; asprintf(&csrootmems, "%s/cpuset.mems", me->mnt_dir); fp = fopen(csrootmems, "r"); if (fp == NULL) error(1, errno, "cannot read /cpuset.mems"); char *val = NULL; size_t vallen = 0; ssize_t n = getline(&val, &vallen, fp); fclose(fp); free(csrootmems); char *testmems; asprintf(&testmems, "%s/cpuset.mems", cshier); fp = fopen(testmems, "w"); if (fp == NULL) error(1, errno, "cannot read /%s/cpuset.mems", csname); fwrite(val, n, 1, fp); fclose(fp); free(testmems); free(val); cpu_set_t cs; int first = 0; sched_getaffinity(0, sizeof(cs), &cs); while (! CPU_ISSET(first, &cs)) ++first; char *testcpus; asprintf(&testcpus, "%s/cpuset.cpus", cshier); fp = fopen(testcpus, "w"); if (fp == NULL) error(1, errno, "cannot write /%s/cpuset.cpus", csname); fprintf(fp, "%d", first); fclose(fp); free(testcpus); char *testtasks; asprintf(&testtasks, "%s/tasks", cshier); fp = fopen(testtasks, "w"); if (fp == NULL) error(1, errno, "cannot write /%s/tasks", csname); fprintf(fp, "%d", (int) getpid()); fclose(fp); pthread_t th; pthread_create(&th, NULL, tf, NULL); pthread_join(th, NULL); char *roottasks; asprintf(&roottasks, "%s/tasks", me->mnt_dir); fp = fopen(roottasks, "w"); if (fp == NULL) error(1, errno, "cannot write /tasks"); fprintf(fp, "%d", (int) getpid()); fclose(fp); free(roottasks); if (strcmp(csname, "wait") == 0) { struct timespec s = { 0, 1 }; nanosleep(&s, NULL); } if (rmdir(cshier) != 0) { printf("PID = %ld\nremaining = ", (long) getpid()); fp = fopen(testtasks, "r"); char *line = NULL; size_t linelen = 0; while ((n = getline(&line, &linelen, fp)) > 0) fputs(line, stdout); fclose(fp); free(line); error(1, errno, "couldn't remove cpuset %s", cshier); } free(cshier); free(testtasks); return 0; }