In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics"), it reconstructed the code to reduce the branch miss rate.
Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
z->node would not be equal to preferred_zone->node. That seems to be
incorrect.
Here is what I catch, dumpstack() is triggered when z->node ==
preferred_zone->node and z->node != numa_node_id()
z=5,prefer=5,local=4, flag&OTHER_NODE=0
[c000000cdcaef440] [c0000000002e88cc] cache_grow_begin+0xcc/0x500
[c000000cdcaef6f0] [c0000000002ecb44] do_tune_cpucache+0x64/0x100
[c000000cdcaef750] [c0000000002ecc7c] enable_cpucache+0x9c/0x180
[c000000cdcaef7d0] [c0000000002ed01c] __kmem_cache_create+0x1ec/00x2c0
[c000000cdcaef820] [c000000000291c98] create_cache+0xb8/0x240
[c000000cdcaef890] [c000000000291fa8] kmem_cache_create+0x188/0x2290
[c000000cdcaef950] [d000000011dc5c70] ext4_mb_init+0x3c0/0x5e0 [eext4]
[c000000cdcaef9f0] [d000000011daaedc] ext4_fill_super+0x266c/0x33390 [ext4]
[c000000cdcaefb30] [c000000000328b8c] mount_bdev+0x22c/0x260
[c000000cdcaefbd0] [d000000011da1fa8] ext4_mount+0x48/0x60 [ext4]
[c000000cdcaefc10] [c00000000032a11c] mount_fs+0x8c/0x230
[c000000cdcaefcb0] [c000000000351f98] vfs_kern_mount+0x78/0x180
[c000000cdcaefd00] [c000000000356d68] do_mount+0x258/0xea0
[c000000cdcaefde0] [c000000000357da0] SyS_mount+0xa0/0x110
[c000000cdcaefe30] [c00000000000bd84] system_call+0x38/0xe0
Before this patch, the numa_miss and numa_foreign looked very odd:
linux:~ # numastat
node0 node1 node2 node3 node4 node5 node6
numa_hit 42216 0 0 0 96755 0 0
numa_miss 1 718 711 726 860 712 719
numa_foreign 1 718 711 726 860 712 719
interleave_hit 631 638 632 641 621 633 636
local_node 42216 0 0 0 96755 0 0
other_node 0 0 0 0 0 0 0
After this patch
linux:~ # numastat
node0 node1 node2 node3 node4 node5 node6
numa_hit 177891 718 711 726 60302 712 719
numa_miss 0 196944 237222 253424 0 36265 0
numa_foreign 723855 0 0 0 0 0 0
interleave_hit 631 638 632 641 621 633 636
local_node 177891 0 0 0 59444 0 0
other_node 0 718 711 726 858 712 719
Jia He (1):
mm, page_alloc: fix incorrect zone_statistics data
mm/page_alloc.c | 3 +++
1 file changed, 3 insertions(+)
--
2.5.5
In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics"), it reconstructed codes to reduce the branch miss rate.
Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
z->node would not be equal to preferred_zone->node. That seems to be
incorrect.
Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics")
Signed-off-by: Jia He <[email protected]>
---
mm/page_alloc.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440..474757e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2568,6 +2568,9 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
if (z->node == local_nid) {
__inc_zone_state(z, NUMA_HIT);
__inc_zone_state(z, local_stat);
+ } else if (z->node == preferred_zone->node) {
+ __inc_zone_state(z, NUMA_HIT);
+ __inc_zone_state(z, NUMA_OTHER);
} else {
__inc_zone_state(z, NUMA_MISS);
__inc_zone_state(preferred_zone, NUMA_FOREIGN);
--
2.5.5
On Mon 12-12-16 13:59:07, Jia He wrote:
> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> z->node would not be equal to preferred_zone->node. That seems to be
> incorrect.
I am sorry but I have hard time following the changelog. It is clear
that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
but it is not really clear when such thing happens. You are adding
preferred_zone->node check. preferred_zone is the first zone in the
requested zonelist. So for the most allocations it is a node from the
local node. But if something request an explicit numa node (without
__GFP_OTHER_NODE which would be the majority I suspect) then we could
indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
referenced patch indeed caused an unintended change of accounting AFAIU.
If this is correct then it should be a part of the changelog. I also
cannot say I would like the fix. First of all I am not sure
__GFP_OTHER_NODE is a good idea at all. How is an explicit usage of the
flag any different from an explicit __alloc_pages_node(non_local_nid)?
In both cases we ask for an allocation on a remote node and successful
allocation is a NUMA_HIT and NUMA_OTHER.
That being said, why cannot we simply do the following? As a bonus, we
can get rid of a barely used __GFP_OTHER_NODE. Also the number of
branches will stay same.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 429855be6ec9..f035d5c8b864 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int order)
* Update NUMA hit/miss statistics
*
* Must be called with interrupts disabled.
- *
- * When __GFP_OTHER_NODE is set assume the node of the preferred
- * zone is the local node. This is useful for daemons who allocate
- * memory on behalf of other processes.
*/
static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
gfp_t flags)
{
#ifdef CONFIG_NUMA
- int local_nid = numa_node_id();
- enum zone_stat_item local_stat = NUMA_LOCAL;
-
- if (unlikely(flags & __GFP_OTHER_NODE)) {
- local_stat = NUMA_OTHER;
- local_nid = preferred_zone->node;
- }
+ if (z->node == preferred_zone->node) {
+ enum zone_stat_item local_stat = NUMA_LOCAL;
- if (z->node == local_nid) {
__inc_zone_state(z, NUMA_HIT);
+ if (z->node != numa_node_id())
+ local_stat = NUMA_OTHER;
__inc_zone_state(z, local_stat);
} else {
__inc_zone_state(z, NUMA_MISS);
--
Michal Hocko
SUSE Labs
On Mon, Dec 12, 2016 at 01:59:07PM +0800, Jia He wrote:
> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> z->node would not be equal to preferred_zone->node. That seems to be
> incorrect.
>
> Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> zone_statistics")
>
> Signed-off-by: Jia He <[email protected]>
This is slightly curious. It appear it would only occur if a process was
running on a node that was outside the memory policy. Can you confirm
that is the case?
If so, your patch is a a semantic curiousity because it's actually
impossible for a NUMA allocation to be local and the definition of "HIT"
is fuzzy enough to be useless.
I won't object to the patch but it makes me trust "hit" even less than I
already do for any analysis.
Note that after this mail that I'll be unavailable by mail until early
new years.
--
Mel Gorman
SUSE Labs
On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> On Mon 12-12-16 13:59:07, Jia He wrote:
> > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > z->node would not be equal to preferred_zone->node. That seems to be
> > incorrect.
>
> I am sorry but I have hard time following the changelog. It is clear
> that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> but it is not really clear when such thing happens. You are adding
> preferred_zone->node check. preferred_zone is the first zone in the
> requested zonelist. So for the most allocations it is a node from the
> local node. But if something request an explicit numa node (without
> __GFP_OTHER_NODE which would be the majority I suspect) then we could
> indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> referenced patch indeed caused an unintended change of accounting AFAIU.
>
This is a similar concern to what I had. If the preferred zone, which is
the first valid usable zone, is not a "hit" for the statistics then I
don't know what "hit" is meant to mean.
--
Mel Gorman
SUSE Labs
On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > z->node would not be equal to preferred_zone->node. That seems to be
> > > incorrect.
> >
> > I am sorry but I have hard time following the changelog. It is clear
> > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > but it is not really clear when such thing happens. You are adding
> > preferred_zone->node check. preferred_zone is the first zone in the
> > requested zonelist. So for the most allocations it is a node from the
> > local node. But if something request an explicit numa node (without
> > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > referenced patch indeed caused an unintended change of accounting AFAIU.
> >
>
> This is a similar concern to what I had. If the preferred zone, which is
> the first valid usable zone, is not a "hit" for the statistics then I
> don't know what "hit" is meant to mean.
But the first valid usable zone is defined based on the requested numa
node. Unless the requested node is memoryless then we should have a hit,
no?
--
Michal Hocko
SUSE Labs
On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
> On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > > z->node would not be equal to preferred_zone->node. That seems to be
> > > > incorrect.
> > >
> > > I am sorry but I have hard time following the changelog. It is clear
> > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > > but it is not really clear when such thing happens. You are adding
> > > preferred_zone->node check. preferred_zone is the first zone in the
> > > requested zonelist. So for the most allocations it is a node from the
> > > local node. But if something request an explicit numa node (without
> > > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > > referenced patch indeed caused an unintended change of accounting AFAIU.
> > >
> >
> > This is a similar concern to what I had. If the preferred zone, which is
> > the first valid usable zone, is not a "hit" for the statistics then I
> > don't know what "hit" is meant to mean.
>
> But the first valid usable zone is defined based on the requested numa
> node. Unless the requested node is memoryless then we should have a hit,
> no?
>
Should be. If the local node is memoryless then there would be a difference
between hit and whether it's local or not but that to me is a little
useless. A local vs remote page allocated has a specific meaning and
consequence. It's hard to see how hit can be meaningfully interpreted if
there are memoryless nodes. I don't have a strong objection to the patch
so I didn't nak it, I'm just not convinced it matters.
--
Mel Gorman
SUSE Labs
On Tue 20-12-16 14:28:45, Mel Gorman wrote:
> On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
> > On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > > > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > > > z->node would not be equal to preferred_zone->node. That seems to be
> > > > > incorrect.
> > > >
> > > > I am sorry but I have hard time following the changelog. It is clear
> > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > > > but it is not really clear when such thing happens. You are adding
> > > > preferred_zone->node check. preferred_zone is the first zone in the
> > > > requested zonelist. So for the most allocations it is a node from the
> > > > local node. But if something request an explicit numa node (without
> > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > > > referenced patch indeed caused an unintended change of accounting AFAIU.
> > > >
> > >
> > > This is a similar concern to what I had. If the preferred zone, which is
> > > the first valid usable zone, is not a "hit" for the statistics then I
> > > don't know what "hit" is meant to mean.
> >
> > But the first valid usable zone is defined based on the requested numa
> > node. Unless the requested node is memoryless then we should have a hit,
> > no?
> >
>
> Should be. If the local node is memoryless then there would be a difference
> between hit and whether it's local or not but that to me is a little
> useless. A local vs remote page allocated has a specific meaning and
> consequence. It's hard to see how hit can be meaningfully interpreted if
> there are memoryless nodes. I don't have a strong objection to the patch
> so I didn't nak it, I'm just not convinced it matters.
So what do you think about
http://lkml.kernel.org/r/[email protected]
I think that we should get rid of __GFP_OTHER_NODE thingy. It is just
one off thing and the gfp space it rather precious.
--
Michal Hocko
SUSE Labs
On Tue, Dec 20, 2016 at 02:28:45PM +0000, Mel Gorman wrote:
> On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
> > On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > > > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > > > z->node would not be equal to preferred_zone->node. That seems to be
> > > > > incorrect.
> > > >
> > > > I am sorry but I have hard time following the changelog. It is clear
> > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > > > but it is not really clear when such thing happens. You are adding
> > > > preferred_zone->node check. preferred_zone is the first zone in the
> > > > requested zonelist. So for the most allocations it is a node from the
> > > > local node. But if something request an explicit numa node (without
> > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > > > referenced patch indeed caused an unintended change of accounting AFAIU.
> > > >
> > >
> > > This is a similar concern to what I had. If the preferred zone, which is
> > > the first valid usable zone, is not a "hit" for the statistics then I
> > > don't know what "hit" is meant to mean.
> >
> > But the first valid usable zone is defined based on the requested numa
> > node. Unless the requested node is memoryless then we should have a hit,
> > no?
> >
>
> Should be. If the local node is memoryless then there would be a difference
> between hit and whether it's local or not but that to me is a little
> useless. A local vs remote page allocated has a specific meaning and
> consequence. It's hard to see how hit can be meaningfully interpreted if
> there are memoryless nodes. I don't have a strong objection to the patch
> so I didn't nak it, I'm just not convinced it matters.
>
That said, it's also hard to interpret "hit" when memory policies
forbid a memory node but allows the CPUs to be used. Local and remote
can be interpreted regardless of machine, "hit" has meaning depending on
the configuration and interpreting it is weird. It had some value when
zone_reclaim_mode was always enabled because a low hit rate would imply
zone_reclaim_mode was broken but that's a corner case. I would prefer to
get rid of it because interpreting it is such a mess if there wasn't tools
already collecting the data.
--
Mel Gorman
SUSE Labs
On 12/20/2016 03:35 PM, Michal Hocko wrote:
> On Tue 20-12-16 14:28:45, Mel Gorman wrote:
>> On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
>>> On Tue 20-12-16 13:10:40, Mel Gorman wrote:
>>>> On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
>>>>> On Mon 12-12-16 13:59:07, Jia He wrote:
>>>>>> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
>>>>>> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
>>>>>> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
>>>>>> z->node would not be equal to preferred_zone->node. That seems to be
>>>>>> incorrect.
>>>>>
>>>>> I am sorry but I have hard time following the changelog. It is clear
>>>>> that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
>>>>> but it is not really clear when such thing happens. You are adding
>>>>> preferred_zone->node check. preferred_zone is the first zone in the
>>>>> requested zonelist. So for the most allocations it is a node from the
>>>>> local node. But if something request an explicit numa node (without
>>>>> __GFP_OTHER_NODE which would be the majority I suspect) then we could
>>>>> indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
>>>>> referenced patch indeed caused an unintended change of accounting AFAIU.
>>>>>
>>>>
>>>> This is a similar concern to what I had. If the preferred zone, which is
>>>> the first valid usable zone, is not a "hit" for the statistics then I
>>>> don't know what "hit" is meant to mean.
>>>
>>> But the first valid usable zone is defined based on the requested numa
>>> node. Unless the requested node is memoryless then we should have a hit,
>>> no?
>>>
>>
>> Should be. If the local node is memoryless then there would be a difference
>> between hit and whether it's local or not but that to me is a little
>> useless. A local vs remote page allocated has a specific meaning and
>> consequence. It's hard to see how hit can be meaningfully interpreted if
>> there are memoryless nodes. I don't have a strong objection to the patch
>> so I didn't nak it, I'm just not convinced it matters.
>
> So what do you think about
> http://lkml.kernel.org/r/[email protected]
>
> I think that we should get rid of __GFP_OTHER_NODE thingy. It is just
> one off thing and the gfp space it rather precious.
Let's CC Andi who introduced it by commit 78afd5612deb8.
Personally I agree that the reasoning provided by that commit does not
justify the troubles. We already have the HIT and MISS counters to
record if we allocated on the node we explicitly wanted/preferred (local
or remote). The LOCAL and OTHER should thus be true local/remote
statistics, why fake them in some rare cases such as khugepaged? The
only other case is do_huge_pmd_wp_page_fallback() where it perhaps makes
even less sense. No others were added in 5 years so I think the flag
really didn't catch on, let's get rid of it.
On Tue, Dec 20, 2016 at 03:35:02PM +0100, Michal Hocko wrote:
> On Tue 20-12-16 14:28:45, Mel Gorman wrote:
> > On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
> > > On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> > > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > > > > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > > > > z->node would not be equal to preferred_zone->node. That seems to be
> > > > > > incorrect.
> > > > >
> > > > > I am sorry but I have hard time following the changelog. It is clear
> > > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > > > > but it is not really clear when such thing happens. You are adding
> > > > > preferred_zone->node check. preferred_zone is the first zone in the
> > > > > requested zonelist. So for the most allocations it is a node from the
> > > > > local node. But if something request an explicit numa node (without
> > > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > > > > referenced patch indeed caused an unintended change of accounting AFAIU.
> > > > >
> > > >
> > > > This is a similar concern to what I had. If the preferred zone, which is
> > > > the first valid usable zone, is not a "hit" for the statistics then I
> > > > don't know what "hit" is meant to mean.
> > >
> > > But the first valid usable zone is defined based on the requested numa
> > > node. Unless the requested node is memoryless then we should have a hit,
> > > no?
> > >
> >
> > Should be. If the local node is memoryless then there would be a difference
> > between hit and whether it's local or not but that to me is a little
> > useless. A local vs remote page allocated has a specific meaning and
> > consequence. It's hard to see how hit can be meaningfully interpreted if
> > there are memoryless nodes. I don't have a strong objection to the patch
> > so I didn't nak it, I'm just not convinced it matters.
>
> So what do you think about
> http://lkml.kernel.org/r/[email protected]
>
This doesn't appear to resolve for me and I've 30 minutes left before
being offline for 4 days so didn't go digging.
> I think that we should get rid of __GFP_OTHER_NODE thingy. It is just
> one off thing and the gfp space it rather precious.
>
However, broadly speaking, I'd be ok with getting rid of
__GFP_OTHER_NODE altogether and making it truely only about local vs
remote hits because those are the ones that matter in terms of
performance. If a user has memoryless nodes or policies that allow local
CPUs but forbid local memory and they need to debug an issue, they're
going to need tracepoints anyway. Hit/miss/other is not sufficient for
most interesting problems involving local or remote memory usage.
--
Mel Gorman
SUSE Labs
On 12/20/2016 10:18 AM, Michal Hocko wrote:
> On Mon 12-12-16 13:59:07, Jia He wrote:
>> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
>> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
>> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
>> z->node would not be equal to preferred_zone->node. That seems to be
>> incorrect.
>
> I am sorry but I have hard time following the changelog. It is clear
> that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> but it is not really clear when such thing happens. You are adding
> preferred_zone->node check. preferred_zone is the first zone in the
> requested zonelist. So for the most allocations it is a node from the
> local node. But if something request an explicit numa node (without
> __GFP_OTHER_NODE which would be the majority I suspect) then we could
> indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> referenced patch indeed caused an unintended change of accounting AFAIU.
Since I could not quite wrap my head around all possible combinations,
I've got lazy and employed cbmc, inspired by [1] (test file attached if
anyone wants to play with it) and tried to test whether both commit
b9f00e147f27, and Jia He's patch are equivalent to pre-b9f00e147f27. The
tool found counter examples for both cases (there might be more
scenarios, but it only reports the first one it hits). Note that
decoding the output is not exactly trivial...
[1] http://paulmck.livejournal.com/38997.html
==============
Counterexample for commit b9f00e147f27 vs pre-b9f00e147f27:
numa_node_id() == 1
preferred_zone on node 0
allocated from zone on node 1
without __GFP_OTHER_NODE
pre-b9f00e147f27:
allocated zone (node 1) increased NUMA_MISS and NUMA_LOCAL
preferred zone (node 0) increased NUMA_FOREIGN
(that looks correct to me)
b9f00e147f27:
allocated zone (node 1) got NUMA_HIT and NUMA_LOCAL
there's no NUMA_FOREIGN
(that's wrong wrt HIT and FOREIGN)
==============
Counterexample for Jia He's patch vs pre-b9f00e147f27:
numa_node_id() == 0
preferred_zone on node 0
allocated from zone on node 1
without __GFP_OTHER_NODE
pre-b9f00e147f27:
allocated zone (node 1) increased NUMA_MISS and NUMA_OTHER
preferred zone (node 0) increased NUMA_FOREIGN
(that looks correct to me)
Jia He's patch:
allocated zone (node 1) increased NUMA_MISS
preferred zone (node 0) increased NUMA_FOREIGN
(it's missing NUMA_OTHER)
On 20/12/2016 5:18 PM, Michal Hocko wrote:
> On Mon 12-12-16 13:59:07, Jia He wrote:
>> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
>> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
>> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
>> z->node would not be equal to preferred_zone->node. That seems to be
>> incorrect.
> I am sorry but I have hard time following the changelog. It is clear
> that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> but it is not really clear when such thing happens. You are adding
> preferred_zone->node check. preferred_zone is the first zone in the
> requested zonelist. So for the most allocations it is a node from the
> local node. But if something request an explicit numa node (without
> __GFP_OTHER_NODE which would be the majority I suspect) then we could
> indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> referenced patch indeed caused an unintended change of accounting AFAIU.
>
> If this is correct then it should be a part of the changelog. I also
> cannot say I would like the fix. First of all I am not sure
> __GFP_OTHER_NODE is a good idea at all. How is an explicit usage of the
> flag any different from an explicit __alloc_pages_node(non_local_nid)?
> In both cases we ask for an allocation on a remote node and successful
> allocation is a NUMA_HIT and NUMA_OTHER.
>
> That being said, why cannot we simply do the following? As a bonus, we
> can get rid of a barely used __GFP_OTHER_NODE. Also the number of
> branches will stay same.
Yes, I agree maybe we can get rid of __GFP_OTHER_NODE if no objections
Seems currently it is only used for hugepage and statistics
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 429855be6ec9..f035d5c8b864 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int order)
> * Update NUMA hit/miss statistics
> *
> * Must be called with interrupts disabled.
> - *
> - * When __GFP_OTHER_NODE is set assume the node of the preferred
> - * zone is the local node. This is useful for daemons who allocate
> - * memory on behalf of other processes.
> */
> static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
> gfp_t flags)
> {
> #ifdef CONFIG_NUMA
> - int local_nid = numa_node_id();
> - enum zone_stat_item local_stat = NUMA_LOCAL;
> -
> - if (unlikely(flags & __GFP_OTHER_NODE)) {
> - local_stat = NUMA_OTHER;
> - local_nid = preferred_zone->node;
> - }
> + if (z->node == preferred_zone->node) {
> + enum zone_stat_item local_stat = NUMA_LOCAL;
>
> - if (z->node == local_nid) {
> __inc_zone_state(z, NUMA_HIT);
> + if (z->node != numa_node_id())
> + local_stat = NUMA_OTHER;
> __inc_zone_state(z, local_stat);
> } else {
> __inc_zone_state(z, NUMA_MISS);
I thought the logic here is different
Here is the zone_statistics() before introducing __GFP_OTHER_NODE:
if (z->zone_pgdat == preferred_zone->zone_pgdat) {
__inc_zone_state(z, NUMA_HIT);
} else {
__inc_zone_state(z, NUMA_MISS);
__inc_zone_state(preferred_zone, NUMA_FOREIGN);
}
if (z->node == numa_node_id())
__inc_zone_state(z, NUMA_LOCAL);
else
__inc_zone_state(z, NUMA_OTHER);
B.R.
Jia
On 20/12/2016 8:31 PM, Mel Gorman wrote:
> On Mon, Dec 12, 2016 at 01:59:07PM +0800, Jia He wrote:
>> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
>> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
>> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
>> z->node would not be equal to preferred_zone->node. That seems to be
>> incorrect.
>>
>> Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in
>> zone_statistics")
>>
>> Signed-off-by: Jia He <[email protected]>
> This is slightly curious. It appear it would only occur if a process was
> running on a node that was outside the memory policy. Can you confirm
> that is the case?
Yes, here is what I caught:
dumpstack() is triggered when z->node(5) == preferred_zone->node(5) and z->node != numa_node_id(4)
without flag GET_OTHER_NODE
It is not a rare case. I saw hundreds of such cases when kernel boots up
[c000000cdcaef440] [c0000000002e88cc] cache_grow_begin+0xcc/0x500
[c000000cdcaef6f0] [c0000000002ecb44] do_tune_cpucache+0x64/0x100
[c000000cdcaef750] [c0000000002ecc7c] enable_cpucache+0x9c/0x180
[c000000cdcaef7d0] [c0000000002ed01c] __kmem_cache_create+0x1ec/00x2c0
[c000000cdcaef820] [c000000000291c98] create_cache+0xb8/0x240
[c000000cdcaef890] [c000000000291fa8] kmem_cache_create+0x188/0x2290
[c000000cdcaef950] [d000000011dc5c70] ext4_mb_init+0x3c0/0x5e0 [eext4]
[c000000cdcaef9f0] [d000000011daaedc] ext4_fill_super+0x266c/0x33390 [ext4]
[c000000cdcaefb30] [c000000000328b8c] mount_bdev+0x22c/0x260
[c000000cdcaefbd0] [d000000011da1fa8] ext4_mount+0x48/0x60 [ext4]
[c000000cdcaefc10] [c00000000032a11c] mount_fs+0x8c/0x230
[c000000cdcaefcb0] [c000000000351f98] vfs_kern_mount+0x78/0x180
[c000000cdcaefd00] [c000000000356d68] do_mount+0x258/0xea0
[c000000cdcaefde0] [c000000000357da0] SyS_mount+0xa0/0x110
[c000000cdcaefe30] [c00000000000bd84] system_call+0x38/0xe0
> If so, your patch is a a semantic curiousity because it's actually
> impossible for a NUMA allocation to be local and the definition of "HIT"
> is fuzzy enough to be useless.
>
> I won't object to the patch but it makes me trust "hit" even less than I
> already do for any analysis.
>
> Note that after this mail that I'll be unavailable by mail until early
> new years.
>
On Tue 20-12-16 14:54:35, Mel Gorman wrote:
> On Tue, Dec 20, 2016 at 03:35:02PM +0100, Michal Hocko wrote:
> > On Tue 20-12-16 14:28:45, Mel Gorman wrote:
> > > On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
> > > > On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> > > > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > > > > > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > > > > > z->node would not be equal to preferred_zone->node. That seems to be
> > > > > > > incorrect.
> > > > > >
> > > > > > I am sorry but I have hard time following the changelog. It is clear
> > > > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > > > > > but it is not really clear when such thing happens. You are adding
> > > > > > preferred_zone->node check. preferred_zone is the first zone in the
> > > > > > requested zonelist. So for the most allocations it is a node from the
> > > > > > local node. But if something request an explicit numa node (without
> > > > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > > > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > > > > > referenced patch indeed caused an unintended change of accounting AFAIU.
> > > > > >
> > > > >
> > > > > This is a similar concern to what I had. If the preferred zone, which is
> > > > > the first valid usable zone, is not a "hit" for the statistics then I
> > > > > don't know what "hit" is meant to mean.
> > > >
> > > > But the first valid usable zone is defined based on the requested numa
> > > > node. Unless the requested node is memoryless then we should have a hit,
> > > > no?
> > > >
> > >
> > > Should be. If the local node is memoryless then there would be a difference
> > > between hit and whether it's local or not but that to me is a little
> > > useless. A local vs remote page allocated has a specific meaning and
> > > consequence. It's hard to see how hit can be meaningfully interpreted if
> > > there are memoryless nodes. I don't have a strong objection to the patch
> > > so I didn't nak it, I'm just not convinced it matters.
> >
> > So what do you think about
> > http://lkml.kernel.org/r/[email protected]
> >
>
> This doesn't appear to resolve for me and I've 30 minutes left before
> being offline for 4 days so didn't go digging.
OK, it seems that it didn't go to the lkml so it didn't get to the
archive indexed by the message id. I will send the two patches as a
reply to this email for reference.
--
Michal Hocko
SUSE Labs