Hello, Linus.
A lot updates for cgroup.
* The biggest one is cgroup's conversion to kernfs. cgroup took after
the long abandoned vfs-entangled sysfs implementation and made it
even more convoluted over time. cgroup's internal objects were
fused with vfs objects which also brought in vfs locking and object
lifetime rules. Naturally, there are places where vfs rules don't
fit and nasty hacks, such as credential switching or lock dance
interleaving inode mutex and cgroup_mutex with object serial number
comparison thrown in to decide whether the operation is actually
necessary, needed to be employed.
After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding of
several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.
* Various simplifications including dropping of module support, easier
cgroup name/path handling, simplified cgroup file type handling and
task_cg_lists optimization.
* Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are associated
with it, which BTW was what the dummy hierarchy was for anyway.
* Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.
There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch.
Pulling in this branch creates four conflicts - two detected during
merge and two silent.
C1. e61734c55c24 ("cgroup: remove cgroup->name") dropped a bunch of
vars from mem_cgroup_print_oom_info and 08088cb9ac0a ("memcg:
change oom_info_lock to mutex") changed one of them. Can be
combined by dropping what's dropped in the former and updating
oom_info_lock according to the latter.
void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
<<<<<<< HEAD
/*
* protects memcg_name and makes sure that parallel ooms do not
* interleave
*/
static DEFINE_MUTEX(oom_info_lock);
struct cgroup *task_cgrp;
struct cgroup *mem_cgrp;
static char memcg_name[PATH_MAX];
int ret;
=======
/* oom_info_lock ensures that parallel ooms do not interleave */
static DEFINE_SPINLOCK(oom_info_lock);
>>>>>>> 1ec41830e087cda1f62dda4182c2b62811eb0ffc
struct mem_cgroup *iter;
unsigned int i;
******* RESOLUTION *******
void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
/* oom_info_lock ensures that parallel ooms do not interleave */
static DEFINE_MUTEX(oom_info_lock);
struct mem_cgroup *iter;
unsigned int i;
C2. b8dadcb58d54 ("cpuset: use rcu_read_lock() to protect task_cs()")
replaced task_lock() with rcu_read_lock() and 99afb0fd5f05
("cpuset: fix a race condition in
__cpuset_node_allowed_softwall()") relocated a line into the
locked region from after. The resolution is straight-forward.
rcu_read_lock();
cs = nearest_hardwall_ancestor(task_cs(current));
<<<<<<< HEAD
allowed = node_isset(node, cs->mems_allowed);
task_unlock(current);
=======
rcu_read_unlock();
>>>>>>> 1ec41830e087cda1f62dda4182c2b62811eb0ffc
mutex_unlock(&callback_mutex);
return allowed;
******* RESOLUTION *******
rcu_read_lock();
cs = nearest_hardwall_ancestor(task_cs(current));
allowed = node_isset(node, cs->mems_allowed);
rcu_read_unlock();
mutex_unlock(&callback_mutex);
return allowed;
C3. fed95bab8d29 ("sysfs: fix namespace refcnt leak") added an
argument to kernfs_mount() and was routed through driver-core-next
after cgroup pulled the tree. cgroup's usage needs to be updated
accordingly.
C4. 3eb59ec64fc7 ("cgroup: fix a failure path in create_css()") routed
through cgroup/for-3.14-fixes added new usage of ss->subsys_id
which was renamed to ss->id in the devel branch. The new usage
needs to be updated accordingly.
The following patch resolves C3 and C4.
index 432bdec..fede3d3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1603,7 +1603,7 @@ out_unlock:
if (ret)
return ERR_PTR(ret);
- dentry = kernfs_mount(fs_type, flags, root->kf_root);
+ dentry = kernfs_mount(fs_type, flags, root->kf_root, NULL);
if (IS_ERR(dentry))
cgroup_put(&root->cgrp);
return dentry;
@@ -3654,7 +3654,7 @@ static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss)
return 0;
err_clear_dir:
- cgroup_clear_dir(css->cgroup, 1 << css->ss->subsys_id);
+ cgroup_clear_dir(css->cgroup, 1 << css->ss->id);
err_free_percpu_ref:
percpu_ref_cancel_init(&css->refcnt);
err_free_css:
Due to the intermediate merges, git-request-pull's diffstat didn't
match the changes made by pulling in the tree. diffstat is generated
by comparing pre and post merge trees. The test merge is available in
test-merge-3.15 branch.
The following changes since commit 532de3fc72adc2a6525c4d53c07bf81e1732083d:
cgroup: update cgroup_enable_task_cg_lists() to grab siglock (2014-02-18 18:23:18 -0500)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.15
for you to fetch changes up to 1ec41830e087cda1f62dda4182c2b62811eb0ffc:
cgroup: remove useless argument from cgroup_exit() (2014-03-29 09:15:54 -0400)
Thanks.
----------------------------------------------------------------
Fengguang Wu (1):
cgroup: fix coccinelle warnings
Li Zefan (7):
cgroup: fix locking in cgroupstats_build()
cgroup: fix memory leak in cgroup_mount()
cgroup: deal with dummp_top in cgroup_name() and cgroup_path()
cgroup: add a validation check to cgroup_add_cftyps()
cpuset: use rcu_read_lock() to protect task_cs()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: remove useless argument from cgroup_exit()
Monam Agarwal (1):
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
Paul Gortmaker (1):
sparc: fix implicit include of slab.h in leon_pci_grpci2.c
Stephen Rothwell (1):
sun4M: add include of slab.h for kzalloc
Tejun Heo (67):
cgroup: make CONFIG_CGROUP_NET_PRIO bool and drop unnecessary init_netclassid_cgroup()
cgroup: drop module support
cgroup: clean up cgroup_subsys names and initialization
cgroup: rename cgroup_subsys->subsys_id to ->id
cgroup: update locking in cgroup_show_options()
cgroup: remove cgroup_root_mutex
Merge branch 'for-3.14-fixes' into for-3.15
Merge branch 'driver-core-next' into cgroup/for-3.15
Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15
cgroup: improve css_from_dir() into css_tryget_from_dir()
cgroup: introduce cgroup_tree_mutex
cgroup: release cgroup_mutex over file removals
cgroup: restructure locking and error handling in cgroup_mount()
cgroup: factor out cgroup_setup_root() from cgroup_mount()
cgroup: update cgroup name handling
cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes()
cgroup: update the meaning of cftype->max_write_len
cgroup: introduce cgroup_init/exit_cftypes()
cgroup: introduce cgroup_ino()
cgroup: misc preps for kernfs conversion
cgroup: relocate functions in preparation of kernfs conversion
cgroup: convert to kernfs
cgroup: warn if "xattr" is specified with "sane_behavior"
cgroup: relocate cgroup_rm_cftypes()
cgroup: remove cftype_set
cgroup: simplify dynamic cftype addition and removal
cgroup: make cgroup hold onto its kernfs_node
cgroup: remove cgroup->name
cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it atomic_t
cgroup: remove cgroupfs_root->refcnt
cgroup: disallow xattr, release_agent and name if sane_behavior
cgroup: drop CGRP_ROOT_SUBSYS_BOUND
cgroup: enable task_cg_lists on the first cgroup mount
cgroup: relocate cgroup_enable_task_cg_lists()
cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count()
cgroup: reimplement cgroup_transfer_tasks() without using css_scan_tasks()
cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
cpuset: use css_task_iter_start/next/end() instead of css_scan_tasks()
cgroup: remove css_scan_tasks()
cgroup: separate out put_css_set_locked() and remove put_css_set_taskexit()
cgroup: move css_set_rwsem locking outside of cgroup_task_migrate()
cgroup: drop @skip_css from cgroup_taskset_for_each()
cpuset: don't use cgroup_taskset_cur_css()
cgroup: remove cgroup_taskset_cur_css() and cgroup_taskset_size()
cgroup: cosmetic updates to cgroup_attach_task()
cgroup: unexport functions
Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15
cgroup: add css_set->mg_tasks
cgroup: use css_set->mg_tasks to track target tasks during migration
cgroup: separate out cset_group_from_root() from task_cgroup_from_root()
cgroup: split process / task migration into four steps
cgroup: update how a newly forked task gets associated with css_set
cgroup: drop task_lock() protection around task->cgroups
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup_freezer: document freezer_fork() subtleties
cgroup: relocate setting of CGRP_DEAD
cgroup: reorganize cgroup bootstrapping
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: rename cgroup_dummy_root and related names
cgroup: drop const from @buffer of cftype->write_string()
cgroup: make cgrp_dfl_root mountable
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: fix cgroup_taskset walking order
cgroup: break kernfs active_ref protection in cgroup directory operations
arch/sparc/kernel/leon_pci_grpci2.c | 1
arch/sparc/kernel/sun4m_irq.c | 2
block/blk-cgroup.c | 11
block/blk-cgroup.h | 14
block/blk-throttle.c | 8
block/cfq-iosched.c | 7
fs/bio.c | 2
fs/kernfs/dir.c | 1
include/linux/cgroup.h | 275 +-
include/linux/cgroup_subsys.h | 30
include/linux/hugetlb_cgroup.h | 2
include/linux/memcontrol.h | 2
include/net/cls_cgroup.h | 2
include/net/netprio_cgroup.h | 17
init/Kconfig | 1
kernel/cgroup.c | 3721 ++++++++++++++----------------------
kernel/cgroup_freezer.c | 40
kernel/cpuset.c | 262 --
kernel/events/core.c | 25
kernel/exit.c | 2
kernel/fork.c | 5
kernel/sched/core.c | 10
kernel/sched/cpuacct.c | 6
kernel/sched/debug.c | 3
mm/hugetlb_cgroup.c | 11
mm/memcontrol.c | 110 -
mm/memory-failure.c | 8
net/Kconfig | 2
net/core/netclassid_cgroup.c | 15
net/core/netprio_cgroup.c | 41
net/ipv4/tcp_memcontrol.c | 4
security/device_cgroup.c | 12
32 files changed, 1913 insertions(+), 2739 deletions(-)
--
tejun
On Thu, Apr 3, 2014 at 9:49 AM, Tejun Heo <[email protected]> wrote:
>
> C3. fed95bab8d29 ("sysfs: fix namespace refcnt leak") added an
> argument to kernfs_mount() and was routed through driver-core-next
> after cgroup pulled the tree. cgroup's usage needs to be updated
> accordingly.
This one I find very annoying. There's exactly *one* user of that
interface (back in fed95bab8d29 there were none), and it doesn't want
that dummy argument.
WTF? Why did the driver-core-next development add that argument at
all, rather than just pass in NULL directly? Do we expect to grow more
users of kernfs_mount?
And the "bool *new_sb_created" argument really makes *zero* sense to
kernfs_mount(). It was added to fix a namespace refcount leak, BUT
kernfs_mount() DOES NOT TAKE A NAMESPACE PARAMETER!
So quite frankly, that kernfs_mount() calling convention change looks
completely f*cked up.
I think I'm going to resolve that conflict differently from your
suggested one - I'm going to just remove the broken stupid
"new_sb_created" argument from the kernfs_mount() helper function.
(It obviously stays around for kernfs_mount_ns() that *does* take a
namespace pointer).
People, holler if you disagree, I'll hold off pushing out my
resolution for a while anyway (do all the normal build tests and
reboot into it).
Linus
On Thu, Apr 3, 2014 at 11:11 AM, Linus Torvalds
<[email protected]> wrote:
>
> And the "bool *new_sb_created" argument really makes *zero* sense to
> kernfs_mount(). It was added to fix a namespace refcount leak, BUT
> kernfs_mount() DOES NOT TAKE A NAMESPACE PARAMETER!
Let me clarify that: the only reason I can see for why you'd care
about whether a new sb has been created is because the new sb needs a
namespace refcount.
Now, *if* there is some other reason to care, then the
"new_sb_created" argument may make sense even for kernfs_mount(). I
just don't see it. But if I'm wrong about that, then my alternate
resolution is wrong. However, as far as I can tell, anything else
should be properly refcounted by the mounting logic, and the "ns"
parameter is the only thing that needs to be handled by the caller
because it has that stupid opaque pointer that doesn't know its own
type.
Could that perhaps be fixed? If kernfs_mount_ns() could just do the
proper "grab" on the "void *ns" directly, all of the stupid
"new_sb_created" crud could just go away, even for kernfs_mount_ns().
Linus
[ Extending the participants list a bit ]
On Thu, Apr 3, 2014 at 11:34 AM, Tejun Heo <[email protected]> wrote:
> On the road so sending from phone. Iirc the param is necessary to
> distinguishe when a new sb is created so that it can be put properly later.
> I think cgroup is leaking super ref now and li was planning to send a fix
> once things are merged.
So as far as I can tell, cgroup is fine, because the superblock itself
is properly refcounted by the mounting code. It's the magic hidden
"namespace" pointer that isn't properly refcounted, because that
stupid sucker is typeless and depends on the namespace kind (so to
drop a namespace pointer you need to pass in not just the pointer, but
also the right "KOBJ_NS_TYPE_xxx" type, which right now is always
KOBJ_NS_TYPE_NET).
Quite frankly, the whole "let's make the 'ns' pointer be a 'const void
*' and rely on people passing the right type along" is some seriously
crazy sh*t. Couldn't we force it to be a pointer to a union where the
first member is the type, so that
(a) we don't use "const void *" that casts silently a bit too well to
*anything* (the "const" helps a bit, thankfully)
(b) we don't have to pass both pointer and type along, we can just
pass the pointer, and the type can be looked up from it?
Hmm? It *looks* like the only thing that generates namespace pointers
right now is "net_namespace()", which just returns a "struct net *".
Could we just add that "type" as the first member of that struct, and
make the rule be (for any future namespaces) that the first member has
to be an "int type" field?
I might well have missed some other source that generates namespace
pointers, but it really looks like right now all those namespace()
things end up calling ->class->namespace(), and the only class that
then implements a namespace pointer function is networking and that
"net_namespace()". And I don't *think* that the networking code would
mind having that "int type" at the head of the "struct net" it uses.
The reason I'd like to have the type accessible from the pointer is
that then we *could* use that type to index into the namespace ops,
and grab reference counts etc. Instead of having that "const void *ns"
pointer that you can't do anything about, and have to expect that the
caller (that knows the type of it) will do the right thing wrt
refcounting etc.
Is there something I'm missing?
Linus
Hello,
On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
> [ Extending the participants list a bit ]
>
> On Thu, Apr 3, 2014 at 11:34 AM, Tejun Heo <[email protected]> wrote:
> > On the road so sending from phone. Iirc the param is necessary to
> > distinguishe when a new sb is created so that it can be put properly later.
> > I think cgroup is leaking super ref now and li was planning to send a fix
> > once things are merged.
>
> So as far as I can tell, cgroup is fine, because the superblock itself
> is properly refcounted by the mounting code. It's the magic hidden
Ah, I remembered the other way around. We could leak cgroup_root
reference, not the other way around. cgroup_mount() can be called
multiple times for the same sb and we inc cgroup_root's ref each time
but cgroup_kill_sb() only happens when the sb is released, so if we do
the following,
# mkdir cpuset cpuset1
# mount -t cgroup -o cpuset cgroup cpuset
# mount -t cgroup -o cpuset cgroup cpuset1
# umount cpuset
# umount cpuset1
The cgroup_root should be destroyed but it isn't, I think. We'd need
to bump cgroup_root's refcnt only when a new sb is created. It's
kinda ugly. Hmmm...
As for using specific type for ns tag, yeah, that'd be better
regardless of this. The opaqueness is a bit extreme now.
Thanks.
--
tejun
On Thu, Apr 3, 2014 at 12:43 PM, Tejun Heo <[email protected]> wrote:
>
> Ah, I remembered the other way around. We could leak cgroup_root
> reference, not the other way around. cgroup_mount() can be called
> multiple times for the same sb and we inc cgroup_root's ref each time
> but cgroup_kill_sb() only happens when the sb is released, so if we do
> the following,
Oh, Christ, I see what you are talking about.
That interface is all kinds of crazy.
> The cgroup_root should be destroyed but it isn't, I think. We'd need
> to bump cgroup_root's refcnt only when a new sb is created. It's
> kinda ugly. Hmmm...
Ok, so I guess we can use that "new_sb_created" thing, and I'll redo
my merge resolution to reflect that. I do find this incredibly ugly.
Linus
Tejun Heo <[email protected]> writes:
> Hello,
>
> On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
>> [ Extending the participants list a bit ]
>>
> As for using specific type for ns tag, yeah, that'd be better
> regardless of this. The opaqueness is a bit extreme now.
(The opaqueness has alwasy been silly but it was the compromise
required to get the code merged. Sigh)
With respect to cleaning up the namespace opaqueue pointers while
working on the sysctls I came up with a design that is quite a bit
more maintainable.
The basic idea in terms of kernfs is to have an array (sized 1 for now)
in the superblock of pointers to root kernfs_nodes. Then you have a
nslink node type that holds the index of the kernfs_node pointer it is
associated with. When following a nslink (which is unconditional
because they would not be visible to userspace) the appropriate root
kernfs_node is selected from the superblock and traversed to the same
path as the nslink.
The major benefit is that except for special cases in mounting (to
populate the extra root), and following the nslink there is no extra
logic to worry about or deal with.
Since that winds up using ordinary kernfs_nodes, which are already
reference counted in a well understood way we should be able to
completely remove passive refcount to the network namespace.
Since it has been clear what is needed for 3.15 and for resolving the
merge conflict I am not suggesting we do anything this merge window but
since we are talking about how to clean up warts in the existing design,
I figured I should mention it.
Eric
On 2014/4/4 3:43, Tejun Heo wrote:
> Hello,
>
> On Thu, Apr 03, 2014 at 12:01:23PM -0700, Linus Torvalds wrote:
>> [ Extending the participants list a bit ]
>>
>> On Thu, Apr 3, 2014 at 11:34 AM, Tejun Heo <[email protected]> wrote:
>>> On the road so sending from phone. Iirc the param is necessary to
>>> distinguishe when a new sb is created so that it can be put properly later.
>>> I think cgroup is leaking super ref now and li was planning to send a fix
>>> once things are merged.
>>
>> So as far as I can tell, cgroup is fine, because the superblock itself
>> is properly refcounted by the mounting code. It's the magic hidden
>
> Ah, I remembered the other way around. We could leak cgroup_root
> reference, not the other way around. cgroup_mount() can be called
> multiple times for the same sb and we inc cgroup_root's ref each time
> but cgroup_kill_sb() only happens when the sb is released, so if we do
> the following,
>
> # mkdir cpuset cpuset1
> # mount -t cgroup -o cpuset cgroup cpuset
> # mount -t cgroup -o cpuset cgroup cpuset1
> # umount cpuset
> # umount cpuset1
>
> The cgroup_root should be destroyed but it isn't, I think. We'd need
> to bump cgroup_root's refcnt only when a new sb is created.
Yeah, I had been waiting for the kernfs change to be merged into mainline,
so I can fix this cgroup refcnting, and here is the fix.
====================
[PATCH] cgroup: fix top cgroup refcnt leak
As mount() and kill_sb() is not a one-to-one match, If we mount the same
cgroupfs in serveral mount points, and then umount all of them, kill_sb()
will be called only once.
Try:
# mount -t cgroup -o cpuacct xxx /cgroup
# mount -t cgroup -o cpuacct xxx /cgroup2
# cat /proc/cgroups | grep cpuacct
cpuacct 2 1 1
# umount /cgroup
# umount /cgroup2
# cat /proc/cgroups | grep cpuacct
cpuacct 2 1 1
You'll see cgroupfs will never be freed.
Signed-off-by: Li Zefan <[email protected]>
---
kernel/cgroup.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2189462..48bd9c9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1487,6 +1487,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
struct cgroup_sb_opts opts;
struct dentry *dentry;
int ret;
+ bool new_sb;
/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -1603,8 +1604,8 @@ out_unlock:
if (ret)
return ERR_PTR(ret);
- dentry = kernfs_mount(fs_type, flags, root->kf_root, NULL);
- if (IS_ERR(dentry))
+ dentry = kernfs_mount(fs_type, flags, root->kf_root, &new_sb);
+ if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
return dentry;
}
--
1.8.0.2
On Thu, Apr 03, 2014 at 01:02:45PM -0700, Linus Torvalds wrote:
> > The cgroup_root should be destroyed but it isn't, I think. We'd need
> > to bump cgroup_root's refcnt only when a new sb is created. It's
> > kinda ugly. Hmmm...
>
> Ok, so I guess we can use that "new_sb_created" thing, and I'll redo
> my merge resolution to reflect that. I do find this incredibly ugly.
I apparently missed the issue and designed the interface without
considering this ugliness. Maybe kernfs could be made to wrap rather
than providing mount/kill_sb() functions and hide details about sb or
we can simply add fstype->umount() so that there's symmetry; however,
the problem is kernfs-specific and other kernfs users would have
single static backing store and won't need to care about this, so, for
now, I think what it's a ugly but acceptable compromise.
Thanks.
--
tejun
On Fri, Apr 04, 2014 at 05:14:41PM +0800, Li Zefan wrote:
> [PATCH] cgroup: fix top cgroup refcnt leak
>
> As mount() and kill_sb() is not a one-to-one match, If we mount the same
> cgroupfs in serveral mount points, and then umount all of them, kill_sb()
> will be called only once.
>
> Try:
> # mount -t cgroup -o cpuacct xxx /cgroup
> # mount -t cgroup -o cpuacct xxx /cgroup2
> # cat /proc/cgroups | grep cpuacct
> cpuacct 2 1 1
> # umount /cgroup
> # umount /cgroup2
> # cat /proc/cgroups | grep cpuacct
> cpuacct 2 1 1
>
> You'll see cgroupfs will never be freed.
>
> Signed-off-by: Li Zefan <[email protected]>
Applied to cgroup/for-3.15-fixes.
Thanks.
--
tejun
On Thu, Apr 3, 2014 at 9:49 AM, Tejun Heo <[email protected]> wrote:
>
> A lot updates for cgroup [...]
Btw, just a heads up to let you know: I'm in the middle of bisecting
why my machine no longer reboots cleanly, and while I'm a few boots
away from pinpointing it exactly, it has now drilled down to the point
that the only remaining commits are from your cgroup pull that I
pulled yesterday.
Maybe I screwed something up in the bisection and it's not 100%
reliable, but I don't think so.
Will report back soon when I've narrowed it down more.
Linus
On Fri, Apr 4, 2014 at 6:06 PM, Linus Torvalds
<[email protected]> wrote:
>
> Will report back soon when I've narrowed it down more.
Ok, that may end up being harder than it looked. Commit
a755180bab81c038a6989d7ab746c702f1b3ec03 doesn't boot for me at all,
so apparently there are some really broken points in that whole
development chain.
I'll try to pick kernels away from that, but that makes bisection much
less effective.
Linus
On Fri, Apr 4, 2014 at 6:11 PM, Linus Torvalds
<[email protected]> wrote:
>
> I'll try to pick kernels away from that, but that makes bisection much
> less effective.
Ok, I think I'm away from the broken region. The problem seems to be in between
good: 8e30e2b8ba0e ("cgroup: restructure locking and error handling in
cgroup_mount()")
bad: 6f30558f37bf ("cgroup: make cgroup hold onto its kernfs_node")
and let's see what bisect says about the rest.. Does that give you any ideas?
Linus
On 2014.04.04 at 18:11 -0700, Linus Torvalds wrote:
> On Fri, Apr 4, 2014 at 6:06 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Will report back soon when I've narrowed it down more.
>
> Ok, that may end up being harder than it looked. Commit
> a755180bab81c038a6989d7ab746c702f1b3ec03 doesn't boot for me at all,
> so apparently there are some really broken points in that whole
> development chain.
FWIW I've seen similar boot failures, that started during this merge
window, on my machine. I first thought they might be linker related so I
opened a bug on: https://sourceware.org/bugzilla/show_bug.cgi?id=16788
But if I remove e.g. a single printk the bug goes away...
Unfortunately the same thing happens if I enable more debugging options,
so I haven't tracked the issue down yet.
--
Markus
Applied to cgroup/for-3.15. Will soon send pull request for this one
and the cgroup_root refcnt fix from Li.
Thanks.
------ 8< ------
>From 49957f8e2a43035a97d05bddefa394492a969c0d Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Mon, 7 Apr 2014 16:44:47 -0400
While converting cgroup to kernfs, 2bd59d48ebfb ("cgroup: convert to
kernfs") accidentally dropped the logic which makes newly created
cgroup dirs and files owned by the current uid / gid. This broke
cases where cgroup subtree management is delegated to !root as the sub
manager wouldn't be able to create more than single level of hierarchy
or put tasks into child cgroups it created.
Among other things, this breaks user session management in systemd and
one of the symptoms was 90s hang during shutdown. User session
systemd running as the user creates a sub-service to initiate shutdown
and tries to put kill(1) into it but fails because cgroup.procs is
owned by root. This leads to 90s hang during shutdown.
Implement cgroup_kn_set_ugid() which sets a kn's uid and gid to those
of the caller and use it from file and dir creation paths.
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Linus Torvalds <[email protected]>
---
kernel/cgroup.c | 27 ++++++++++++++++++++++++++-
1 file changed, 26 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0dfc732..9fcdaa7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2346,11 +2346,26 @@ static int cgroup_rename(struct kernfs_node *kn, struct kernfs_node *new_parent,
return ret;
}
+/* set uid and gid of cgroup dirs and files to that of the creator */
+static int cgroup_kn_set_ugid(struct kernfs_node *kn)
+{
+ struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID,
+ .ia_uid = current_fsuid(),
+ .ia_gid = current_fsgid(), };
+
+ if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) &&
+ gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID))
+ return 0;
+
+ return kernfs_setattr(kn, &iattr);
+}
+
static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
{
char name[CGROUP_FILE_NAME_MAX];
struct kernfs_node *kn;
struct lock_class_key *key = NULL;
+ int ret;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
key = &cft->lockdep_key;
@@ -2358,7 +2373,13 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
cgroup_file_mode(cft), 0, cft->kf_ops, cft,
NULL, false, key);
- return PTR_ERR_OR_ZERO(kn);
+ if (IS_ERR(kn))
+ return PTR_ERR(kn);
+
+ ret = cgroup_kn_set_ugid(kn);
+ if (ret)
+ kernfs_remove(kn);
+ return ret;
}
/**
@@ -3753,6 +3774,10 @@ static long cgroup_create(struct cgroup *parent, const char *name,
*/
idr_replace(&root->cgroup_idr, cgrp, cgrp->id);
+ err = cgroup_kn_set_ugid(kn);
+ if (err)
+ goto err_destroy;
+
err = cgroup_addrm_files(cgrp, cgroup_base_files, true);
if (err)
goto err_destroy;
--
1.9.0