Message-ID: <8aa21721-6951-80d1-ae20-db4198b743ff@linux.ibm.com>
Date:   Mon, 27 Feb 2023 11:41:10 +0530
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.7.1
Subject: Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr
 notes
Content-Language: en-US
To:     Eric DeVolder <eric.devolder@oracle.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        linux-kernel@vger.kernel.org, x86@kernel.org,
        kexec@lists.infradead.org, ebiederm@xmission.com,
        dyoung@redhat.com, bhe@redhat.com, vgoyal@redhat.com
Cc:     mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
        hpa@zytor.com, nramas@linux.microsoft.com, thomas.lendacky@amd.com,
        robh@kernel.org, efault@gmx.de, rppt@kernel.org, david@redhat.com,
        konrad.wilk@oracle.com, boris.ostrovsky@oracle.com
References: <20230131224236.122805-1-eric.devolder@oracle.com>
 <20230131224236.122805-6-eric.devolder@oracle.com> <87sffpzkle.ffs@tglx>
 <dd03f47a-0017-6239-04e9-e796dca03c0c@oracle.com> <87h6vw2rwf.ffs@tglx>
 <7580421a-648a-2c4b-3c33-82e7622d9585@oracle.com>
 <d465173e-a31a-c4d6-af51-59d9ff0c2edc@linux.ibm.com>
 <24034f33-739b-e5f5-40c0-8d5abeb1ad89@oracle.com>
 <18c57fd0-2ad0-361a-9a53-ac49c372f021@linux.ibm.com>
 <ee58fcfa-1734-2992-d6b7-83f365a9e16a@oracle.com>
 <fc9f8ba4-967a-6e48-df73-67427c145141@linux.ibm.com>
 <4eeda077-c578-791b-8e48-e418744a7615@oracle.com>
From:   Sourabh Jain <sourabhjain@linux.ibm.com>
In-Reply-To: <4eeda077-c578-791b-8e48-e418744a7615@oracle.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: bulk


On 25/02/23 01:46, Eric DeVolder wrote:
>
>
> On 2/24/23 02:34, Sourabh Jain wrote:
>>
>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>
>>>
>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>
>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>> Hello Eric,
>>>>>>
>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>> Eric!
>>>>>>>>
>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>
>>>>>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. 
>>>>>>>>> I'm open to better names.
>>>>>>>>>
>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>>>> attempts at locating this state failed when inside the 
>>>>>>>>> STARTING section, so I located
>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler 
>>>>>>>>> is registered on
>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>
>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>>>> placed it at the end of the PREPARE section. This crash 
>>>>>>>>> hotplug handler is also
>>>>>>>>> registered on this state as the callback for the .teardown 
>>>>>>>>> method.
>>>>>>>>
>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>
>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>> {
>>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>
>>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>> }
>>>>>>>>
>>>>>>>> and use this to query the actual state at crash time. That 
>>>>>>>> spares all
>>>>>>>> those callback heuristics.
>>>>>>>>
>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, 
>>>>>>>>> vmcoreinfo,
>>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash 
>>>>>>>>> utility,
>>>>>>>>> in order to understand the impact of moving from 
>>>>>>>>> for_each_present_cpu()
>>>>>>>>> to for_each_online_cpu().
>>>>>>>>
>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>          tglx
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Thomas,
>>>>>>> I've investigated the passing of crash notes through the vmcore. 
>>>>>>> What I've learned is that:
>>>>>>>
>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do 
>>>>>>> its job) does
>>>>>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>>>>>> coalesce them together.
>>>>>>>
>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to 
>>>>>>> determine its
>>>>>>>   nr_cpus variable, which is reported in a header, but otherwise 
>>>>>>> unused (except
>>>>>>>   for sadump method).
>>>>>>>
>>>>>>> - the crash utility, for the purposes of determining the cpus, 
>>>>>>> does not appear to
>>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from 
>>>>>>> that, and also of
>>>>>>>   course which are online. In addition, when crash does 
>>>>>>> reference the cpu PT_NOTE,
>>>>>>>   to get its prstatus, it does so by using a percpu technique 
>>>>>>> directly in the vmcore
>>>>>>>   image memory, not via the ELF structure. Said differently, it 
>>>>>>> appears to me that
>>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; 
>>>>>>> rather it obtains them
>>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>>
>>>>>>> With this understanding, I did some testing. Perhaps the most 
>>>>>>> telling test was that I
>>>>>>> changed the number of cpu PT_NOTEs emitted in the 
>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>> hot plugged some cpus, then also took a few offline sparsely via 
>>>>>>> chcpu, then generated a
>>>>>>> vmcore. The crash utility had no problem loading the vmcore, it 
>>>>>>> reported the proper number
>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), 
>>>>>>> and changing to a different
>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>
>>>>>>> My take away is that crash utility does not rely upon ELF cpu 
>>>>>>> PT_NOTEs, it obtains the
>>>>>>> cpu information directly from kernel data structures. Perhaps at 
>>>>>>> one time crash relied
>>>>>>> upon the ELF information, but no more. (Perhaps there are other 
>>>>>>> crash dump analyzers
>>>>>>> that might rely on the ELF info?)
>>>>>>>
>>>>>>> So, all this to say that I see no need to change 
>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>> is no compelling reason to move away from 
>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>> online/offline.
>>>>>>>
>>>>>>> Which then leaves the topic of the cpuhp state on which to 
>>>>>>> register. Perhaps reverting
>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. 
>>>>>>> There does not appear to
>>>>>>> be a compelling need to accurately track whether the cpu went 
>>>>>>> online/offline for the
>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash 
>>>>>>> utility pulls that from
>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>
>>>>>>> I think this is what Sourabh has known and has been advocating 
>>>>>>> for an optimization
>>>>>>> path that allows not regenerating the elfcorehdr on cpu changes 
>>>>>>> (because all the percpu
>>>>>>> structs are all laid out). I do think it best to leave that as 
>>>>>>> an arch choice.
>>>>>>
>>>>>> Since things are clear on how the PT_NOTES are consumed in kdump 
>>>>>> kernel [fs/proc/vmcore.c],
>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>
>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>> If yes, can you please list the elfcorehdr components that 
>>>>>> changes due to CPU hotplug.
>>>>> Due to the use of for_each_present_cpu(), it is possible for the 
>>>>> number of cpu PT_NOTEs
>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus 
>>>>> does not impact the
>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>
>>>>>>
>>>>>>  From what I understood, crash notes are prepared for possible 
>>>>>> CPUs as system boots and
>>>>>> could be used to create a PT_NOTE section for each possible CPU 
>>>>>> while generating the elfcorehdr
>>>>>> during the kdump kernel load.
>>>>>>
>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every 
>>>>>> possible CPU there is no need to
>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>
>>>>> For onlining/offlining of cpus, there is no need to regenerate the 
>>>>> elfcorehdr. However,
>>>>> for actual hot un/plug of cpus, the answer is yes due to 
>>>>> for_each_present_cpu(). The
>>>>> caveat here of course is that if crash utility is the only 
>>>>> coredump analyzer of concern,
>>>>> then it doesn't care about these cpu PT_NOTEs and there would be 
>>>>> no need to re-generate them.
>>>>>
>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming 
>>>>> into mainstream, impacts
>>>>> any of this.
>>>>>
>>>>> Perhaps the one item that might help here is to distinguish 
>>>>> between actual hot un/plug of
>>>>> cpus, versus onlining/offlining. At the moment, I can not 
>>>>> distinguish between a hot plug
>>>>> event and an online event (and unplug/offline). If those were 
>>>>> distinguishable, then we
>>>>> could only regenerate on un/plug events.
>>>>>
>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>
>>>> Yes, because once elfcorehdr is built with possible CPUs we don't 
>>>> have to worry about
>>>> hot[un]plug case.
>>>>
>>>> Here is my view on how things should be handled if a core-dump 
>>>> analyzer is dependent on
>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>
>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding 
>>>> crash notes (kernel has
>>>> one crash note per CPU for every possible CPU). Though the crash 
>>>> notes are allocated
>>>> during the boot time they are populated when the system is on the 
>>>> crash path.
>>>>
>>>> This is how crash notes are populated on PowerPC and I am expecting 
>>>> it would be something
>>>> similar on other architectures too.
>>>>
>>>> The crashing CPU sends IPI to every other online CPU with a 
>>>> callback function that updates the
>>>> crash notes of that specific CPU. Once the IPI completes the 
>>>> crashing CPU updates its own crash
>>>> note and proceeds further.
>>>>
>>>> The crash notes of CPUs remain uninitialized if the CPUs were 
>>>> offline or hot unplugged at the time
>>>> system crash. The core-dump analyzer should be able to identify 
>>>> [un]/initialized crash notes
>>>> and display the information accordingly.
>>>>
>>>> Thoughts?
>>>>
>>>> - Sourabh
>>>
>>> I've been examining what it would mean to move to 
>>> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think it 
>>> means:
>>>
>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in 
>>> crash_prepare_elf64_headers().
>>> - For kexec_load() syscall path, rewrite the incoming/supplied 
>>> elfcorehdr immediately on the load with the elfcorehdr generated by 
>>> crash_prepare_elf64_headers().
>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug 
>>> events.
>>
>> If for_each_present_cpu is replaced with for_each_possible_cpu I 
>> still need cpuhp machinery
>> to update FDT kexec segment for CPU hot add case.
>
> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>
>>
>>>
>>> This would then setup PT_NOTEs for all possible cpus, which should 
>>> in theory accommodate crash analyzers that rely on ELF PT_NOTEs for 
>>> crash_notes.
>>>
>>> If staying with for_each_present_cpu() is ultimately decided, then I 
>>> think leaving the cpuhp machinery in place and each arch could 
>>> decide how to handle crash cpu hotplug events. The overhead for 
>>> doing this is very minimal, and the events are likely very infrequent.
>>
>> I agree. Some architectures may need cpuhp machinery to update kexec 
>> segment[s] other then elfcorehdr. For example FDT on PowerPC.
>>
>> - Sourabh Jain
>
> OK, I was thinking that the desire was to eliminate the cpuhp 
> callbacks. In reality, the desire is to change to 
> for_each_possible_cpu(). Given that the kernel creates crash_notes for 
> all possible cpus upon kernel boot, there seems to be no reason to not 
> do this?
>
> HOWEVER...
>
> It's not clear to me that this particular change needs to be part of 
> this series. It's inclusion would facilitate PPC support, but doesn't 
> "solve" anything in general. In fact it causes kexec_load and 
> kexec_file_load to deviate (kexec_load via userspace kexec does the 
> equivalent of for_each_present_cpu() where as with this change 
> kexec_file_load would do for_each_possible_cpu(); until a hot plug 
> event then both would do for_each_possible_cpu()). And if this change 
> were to arrive as part of Sourabh's PPC support, then it does not 
> appear to impact x86 (not sure about other arches). And the 'crash' 
> dump analyzer doesn't care either way.
>
> Including this change would enable an optimization path (for x86 at 
> least) that short-circuits cpu hotplug changes in the arch crash 
> handler, for example:
>
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index aca3f1817674..0883f6b11de4 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct 
> kimage *image)
>     unsigned long mem, memsz;
>     unsigned long elfsz = 0;
>
> +   if (image->file_mode && (
> +       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
> +       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
> +       return;
> +
>     /*
>      * Create the new elfcorehdr reflecting the changes to CPU and/or
>      * memory resources.
>
> I'm not sure that is compelling given the infrequent nature of cpu 
> hotplug events.
It certainly closes/reduces the window where kdump is not active due 
kexec segment update.|

>
> In my mind I still have a question about kexec_load() path. The 
> userspace kexec can not do the equivalent of for_each_possible_cpu(). 
> It can obtain max possible cpus from /sys/devices/system/cpu/possible, 
> but for those cpus not present the /sys/devices/system/cpu/cpuXX is 
> not available and so the crash_notes entries is not available. My 
> attempts to expose all cpuXX lead to odd behavior that was requiring 
> changes in ACPI and arch code that looked untenable.
>
> There seem to be these options available for kexec_load() path:
> - immediately rewrite the elfcorehdr upon load via a call to 
> crash_prepare_elf64_headers(). I've made this work with the following, 
> as proof of concept:
Yes regenerating/patching the elfcorehdr could be an option for 
kexec_load syscall.

>
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index cb8e6e6f983c..4eb201270f97 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, 
> unsigned long
>     kimage_free(image);
>  out_unlock:
>     kexec_unlock();
> +   if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
> +       if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) {
> +           crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, 
> KEXEC_CRASH_HP_INVALID_CPU);
> +       }
> +   }
>     return ret;
>  }
>
> - Another option is spend the time to determine whether exposing all 
> cpuXX is a viable solution; I have no idea what impacts to userspace 
> would be for possible-but-not-yet-present cpuXX entries would be. It 
> might also mean requiring a 'present' entry available within the cpuXX.
>
> - Another option is to simply let the hot plug events rewrite the 
> elfcorehdr on demand. This is what I've originally put forth, but not 
> sure how this impacts PPC given for_each_possible_cpu() change.
Given that /sys/devices/system/cpu/cpuXX is not present for 
possbile-but-not-yet-present CPUs, I am wondering do we even have crash 
notes for possible CPUs on x86?
>
> The concern is that today, both kexec_load and kexec_file_load mirror 
> each other with respect to for_each_present_cpu(); that is userspace 
> kexec is able to generate the elfcorehdr the same as would 
> kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), 
> the two would deviate.

Thanks,
Sourabh Jain