We already make sure to allocate percpu data only after we verified that
the module we're loading hasn't already been loaded and isn't
concurrently getting loaded -- that it's unique.
On big systems (> 400 CPUs and many devices) with KASAN enabled, we're now
phasing a similar issue with the module vmap space.
When KASAN_INLINE is enabled (resulting in large module size), plenty
of devices that udev wants to probe and plenty (> 400) of CPUs that can
carry out that probing concurrently, we can actually run out of module
vmap space and trigger vmap allocation errors:
[ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
[ 165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
[ 165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
[ 165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
[ 165.947343] Call Trace:
[ 165.950075] <TASK>
[ 165.952425] dump_stack_lvl+0x57/0x81
[ 165.956532] warn_alloc.cold+0x95/0x18a
[ 165.960836] ? zone_watermark_ok_safe+0x240/0x240
[ 165.966100] ? slab_free_freelist_hook+0x11d/0x1d0
[ 165.971461] ? __get_vm_area_node+0x2af/0x360
[ 165.976341] ? __get_vm_area_node+0x2af/0x360
[ 165.981219] __vmalloc_node_range+0x291/0x560
[ 165.986087] ? __mutex_unlock_slowpath+0x161/0x5e0
[ 165.991447] ? move_module+0x4c/0x630
[ 165.995547] ? vfree_atomic+0xa0/0xa0
[ 165.999647] ? move_module+0x4c/0x630
[ 166.003741] module_alloc+0xe7/0x170
[ 166.007747] ? move_module+0x4c/0x630
[ 166.011840] move_module+0x4c/0x630
[ 166.015751] layout_and_allocate+0x32c/0x560
[ 166.020519] load_module+0x8e0/0x25c0
[ 166.024623] ? layout_and_allocate+0x560/0x560
[ 166.029586] ? kernel_read_file+0x286/0x6b0
[ 166.034269] ? __x64_sys_fspick+0x290/0x290
[ 166.038946] ? userfaultfd_unmap_prep+0x430/0x430
[ 166.044203] ? lock_downgrade+0x130/0x130
[ 166.048698] ? __do_sys_finit_module+0x11a/0x1c0
[ 166.053854] __do_sys_finit_module+0x11a/0x1c0
[ 166.058818] ? __ia32_sys_init_module+0xa0/0xa0
[ 166.063882] ? __seccomp_filter+0x92/0x930
[ 166.068494] do_syscall_64+0x59/0x90
[ 166.072492] ? do_syscall_64+0x69/0x90
[ 166.076679] ? do_syscall_64+0x69/0x90
[ 166.080864] ? do_syscall_64+0x69/0x90
[ 166.085047] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 166.090984] ? lockdep_hardirqs_on+0x79/0x100
[ 166.095855] entry_SYSCALL_64_after_hwframe+0x63/0xcd[ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
Interestingly, when reducing the number of CPUs (nosmt), it works as
expected.
The underlying issue is that we first allocate memory (including module
vmap space) in layout_and_allocate(), and then verify whether the module
is unique in add_unformed_module(). So we end up allocating module vmap
space even though we might not need it -- which is a problem when modules
are big and we can have a lot of concurrent probing of the same set of
modules as on the big system at hand.
Unfortunately, we cannot simply add the module earlier, because
move_module() -- that allocates the module vmap space -- essentially
brings the module to life from a temporary one. Adding the temporary one
and replacing it is also sub-optimal (because replacing it would require
to synchronize against RCU) and feels kind of dangerous judging that we
end up copying it.
So instead, add a second list (pending_load_infos) that tracks the modules
(via their load_info) that are unique and are still getting loaded
("pending"), but haven't made it to the actual module list yet. This
shouldn't have a notable runtime overhead when concurrently loading
modules: the new list is expected to usually either be empty or contain
very few entries for a short time.
Thanks to Uladzislau for his help to verify that it's not actually a
vmap code issue.
Reported-by: Lin Liu <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
kernel/module/internal.h | 2 +
kernel/module/main.c | 122 +++++++++++++++++++++++++++------------
2 files changed, 88 insertions(+), 36 deletions(-)
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 680d980a4fb2..9d5cc9b1d56a 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -76,6 +76,8 @@ struct load_info {
struct {
unsigned int sym, str, mod, vers, info, pcpu;
} index;
+
+ struct list_head next;
};
enum mod_license {
diff --git a/kernel/module/main.c b/kernel/module/main.c
index a4e4d84b6f4e..b473228136eb 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -65,10 +65,20 @@
* 2) module_use links,
* 3) mod_tree.addr_min/mod_tree.addr_max.
* (delete and add uses RCU list operations).
+ *
+ * 4) List of pending load infos
*/
DEFINE_MUTEX(module_mutex);
LIST_HEAD(modules);
+/*
+ * Modules (via load_info) that are currently being loaded but cannot be added
+ * to the module list yet are kept in a separate list. This list, combined with
+ * the module list makes sure that modules are unique: a module name has to be
+ * unique across both lists, protected by the module_mutex.
+ */
+LIST_HEAD(pending_load_infos);
+
/* Work queue for freeing init sections in success case */
static void do_free_init(struct work_struct *w);
static DECLARE_WORK(init_free_wq, do_free_init);
@@ -762,7 +772,7 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
strscpy(last_unloaded_module.taints, module_flags(mod, buf, false), sizeof(last_unloaded_module.taints));
free_module(mod);
- /* someone could wait for the module in add_unformed_module() */
+ /* someone could wait for the module in add_pending_load_info() */
wake_up_all(&module_wq);
return 0;
out:
@@ -2374,6 +2384,16 @@ static int post_relocation(struct module *mod, const struct load_info *info)
return module_finalize(info->hdr, info->sechdrs, mod);
}
+static bool __is_pending_load_info_name(const char *name)
+{
+ struct load_info *info;
+
+ list_for_each_entry(info, &pending_load_infos, next)
+ if (!strcmp(info->name, name))
+ return true;
+ return false;
+}
+
/* Is this module of this name done loading? No locks held. */
static bool finished_loading(const char *name)
{
@@ -2388,7 +2408,11 @@ static bool finished_loading(const char *name)
sched_annotate_sleep();
mutex_lock(&module_mutex);
mod = find_module_all(name, strlen(name), true);
- ret = !mod || mod->state == MODULE_STATE_LIVE;
+ if (!mod)
+ /* It might still be in the early process of loading. */
+ ret = !__is_pending_load_info_name(name);
+ else
+ ret = mod->state == MODULE_STATE_LIVE;
mutex_unlock(&module_mutex);
return ret;
@@ -2552,43 +2576,58 @@ static int may_init_module(void)
return 0;
}
-/*
- * We try to place it in the list now to make sure it's unique before
- * we dedicate too many resources. In particular, temporary percpu
- * memory exhaustion.
- */
-static int add_unformed_module(struct module *mod)
+static int add_pending_load_info(struct load_info *info)
{
+ struct module *mod;
int err;
- struct module *old;
-
- mod->state = MODULE_STATE_UNFORMED;
-again:
- mutex_lock(&module_mutex);
- old = find_module_all(mod->name, strlen(mod->name), true);
- if (old != NULL) {
- if (old->state != MODULE_STATE_LIVE) {
- /* Wait in case it fails to load. */
+ while (true) {
+ mutex_lock(&module_mutex);
+ mod = find_module_all(info->name, strlen(info->name), true);
+ if (!mod && !__is_pending_load_info_name(info->name))
+ break;
+ if (mod && mod->state == MODULE_STATE_LIVE) {
mutex_unlock(&module_mutex);
- err = wait_event_interruptible(module_wq,
- finished_loading(mod->name));
- if (err)
- goto out_unlocked;
- goto again;
+ return -EEXIST;
}
- err = -EEXIST;
- goto out;
+
+ /*
+ * The module is in some phase of getting loaded/unloaded;
+ * wait and retry.
+ */
+ mutex_unlock(&module_mutex);
+ err = wait_event_interruptible(module_wq,
+ finished_loading(info->name));
+ if (err)
+ return err;
}
+
+ INIT_LIST_HEAD(&info->next);
+ list_add(&info->next, &pending_load_infos);
+ mutex_unlock(&module_mutex);
+ return 0;
+}
+
+static void remove_pending_load_info(struct load_info *info)
+{
+ mutex_lock(&module_mutex);
+ list_del(&info->next);
+ /* someone could wait for the module name in finished_loading(). */
+ wake_up_all(&module_wq);
+ mutex_unlock(&module_mutex);
+}
+
+static void add_unformed_module(struct load_info *info, struct module *mod)
+{
+ mod->state = MODULE_STATE_UNFORMED;
+
+ mutex_lock(&module_mutex);
mod_update_bounds(mod);
list_add_rcu(&mod->list, &modules);
+ /* The module is on the module list now. */
+ list_del(&info->next);
mod_tree_insert(mod);
- err = 0;
-
-out:
mutex_unlock(&module_mutex);
-out_unlocked:
- return err;
}
static int complete_formation(struct module *mod, struct load_info *info)
@@ -2720,12 +2759,24 @@ static int load_module(struct load_info *info, const char __user *uargs,
goto free_copy;
}
- err = rewrite_section_headers(info, flags);
+ /*
+ * We make sure the module name is unique before we dedicate too many
+ * resources. In particular, avoid temporary percpu memory and module
+ * vmap space exhaustion.
+ */
+ err = add_pending_load_info(info);
if (err)
goto free_copy;
+ err = rewrite_section_headers(info, flags);
+ if (err) {
+ remove_pending_load_info(info);
+ goto free_copy;
+ }
+
/* Check module struct version now, before we try to use module. */
if (!check_modstruct_version(info, info->mod)) {
+ remove_pending_load_info(info);
err = -ENOEXEC;
goto free_copy;
}
@@ -2739,10 +2790,11 @@ static int load_module(struct load_info *info, const char __user *uargs,
audit_log_kern_module(mod->name);
- /* Reserve our place in the list. */
- err = add_unformed_module(mod);
- if (err)
- goto free_module;
+ /*
+ * Add the module to the module list as unformed. This will remove the
+ * load_info from the pending load_info list.
+ */
+ add_unformed_module(info, mod);
#ifdef CONFIG_MODULE_SIG
mod->sig_ok = info->sig_ok;
@@ -2754,7 +2806,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
}
#endif
- /* To avoid stressing percpu allocator, do this once we're unique. */
err = percpu_modalloc(mod, info);
if (err)
goto unlink_mod;
@@ -2890,7 +2941,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
/* Wait for RCU-sched synchronizing before releasing mod->list. */
synchronize_rcu();
mutex_unlock(&module_mutex);
- free_module:
/* Free lock-classes; relies on the preceding sync_rcu() */
lockdep_free_key_range(mod->data_layout.base, mod->data_layout.size);
--
2.37.3
Hi David,
I love your patch! Perhaps something to improve:
[auto build test WARNING on mcgrof/modules-next]
[also build test WARNING on linus/master v6.0 next-20221014]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/David-Hildenbrand/kernel-module-allocate-module-vmap-space-after-making-sure-the-module-is-unique/20221014-020756
base: https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git modules-next
config: x86_64-randconfig-s022
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.4-39-gce1a6720-dirty
# https://github.com/intel-lab-lkp/linux/commit/8b7bfecba8ad77c19c0c857314df6f8e675f6f61
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review David-Hildenbrand/kernel-module-allocate-module-vmap-space-after-making-sure-the-module-is-unique/20221014-020756
git checkout 8b7bfecba8ad77c19c0c857314df6f8e675f6f61
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/module/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
sparse warnings: (new ones prefixed by >>)
>> kernel/module/main.c:80:1: sparse: sparse: symbol 'pending_load_infos' was not declared. Should it be static?
--
0-DAY CI Kernel Test Service
https://01.org/lkp
Hi,
On Thu, 13 Oct 2022, David Hildenbrand wrote:
> We already make sure to allocate percpu data only after we verified that
> the module we're loading hasn't already been loaded and isn't
> concurrently getting loaded -- that it's unique.
>
> On big systems (> 400 CPUs and many devices) with KASAN enabled, we're now
> phasing a similar issue with the module vmap space.
>
> When KASAN_INLINE is enabled (resulting in large module size), plenty
> of devices that udev wants to probe and plenty (> 400) of CPUs that can
> carry out that probing concurrently, we can actually run out of module
> vmap space and trigger vmap allocation errors:
>
> [ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
> [ 165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
> [ 165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
> [ 165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
> [ 165.947343] Call Trace:
> [ 165.950075] <TASK>
> [ 165.952425] dump_stack_lvl+0x57/0x81
> [ 165.956532] warn_alloc.cold+0x95/0x18a
> [ 165.960836] ? zone_watermark_ok_safe+0x240/0x240
> [ 165.966100] ? slab_free_freelist_hook+0x11d/0x1d0
> [ 165.971461] ? __get_vm_area_node+0x2af/0x360
> [ 165.976341] ? __get_vm_area_node+0x2af/0x360
> [ 165.981219] __vmalloc_node_range+0x291/0x560
> [ 165.986087] ? __mutex_unlock_slowpath+0x161/0x5e0
> [ 165.991447] ? move_module+0x4c/0x630
> [ 165.995547] ? vfree_atomic+0xa0/0xa0
> [ 165.999647] ? move_module+0x4c/0x630
> [ 166.003741] module_alloc+0xe7/0x170
> [ 166.007747] ? move_module+0x4c/0x630
> [ 166.011840] move_module+0x4c/0x630
> [ 166.015751] layout_and_allocate+0x32c/0x560
> [ 166.020519] load_module+0x8e0/0x25c0
> [ 166.024623] ? layout_and_allocate+0x560/0x560
> [ 166.029586] ? kernel_read_file+0x286/0x6b0
> [ 166.034269] ? __x64_sys_fspick+0x290/0x290
> [ 166.038946] ? userfaultfd_unmap_prep+0x430/0x430
> [ 166.044203] ? lock_downgrade+0x130/0x130
> [ 166.048698] ? __do_sys_finit_module+0x11a/0x1c0
> [ 166.053854] __do_sys_finit_module+0x11a/0x1c0
> [ 166.058818] ? __ia32_sys_init_module+0xa0/0xa0
> [ 166.063882] ? __seccomp_filter+0x92/0x930
> [ 166.068494] do_syscall_64+0x59/0x90
> [ 166.072492] ? do_syscall_64+0x69/0x90
> [ 166.076679] ? do_syscall_64+0x69/0x90
> [ 166.080864] ? do_syscall_64+0x69/0x90
> [ 166.085047] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [ 166.090984] ? lockdep_hardirqs_on+0x79/0x100
> [ 166.095855] entry_SYSCALL_64_after_hwframe+0x63/0xcd[ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>
> Interestingly, when reducing the number of CPUs (nosmt), it works as
> expected.
>
> The underlying issue is that we first allocate memory (including module
> vmap space) in layout_and_allocate(), and then verify whether the module
> is unique in add_unformed_module(). So we end up allocating module vmap
> space even though we might not need it -- which is a problem when modules
> are big and we can have a lot of concurrent probing of the same set of
> modules as on the big system at hand.
>
> Unfortunately, we cannot simply add the module earlier, because
> move_module() -- that allocates the module vmap space -- essentially
> brings the module to life from a temporary one. Adding the temporary one
> and replacing it is also sub-optimal (because replacing it would require
> to synchronize against RCU) and feels kind of dangerous judging that we
> end up copying it.
>
> So instead, add a second list (pending_load_infos) that tracks the modules
> (via their load_info) that are unique and are still getting loaded
> ("pending"), but haven't made it to the actual module list yet. This
> shouldn't have a notable runtime overhead when concurrently loading
> modules: the new list is expected to usually either be empty or contain
> very few entries for a short time.
>
> Thanks to Uladzislau for his help to verify that it's not actually a
> vmap code issue.
this seems to be related to what
https://lore.kernel.org/all/[email protected]/
tries to solve. Just your symptoms are different. Does the patch set fix
your issue too?
Regards
Miroslav
On 14.10.22 08:09, Miroslav Benes wrote:
> Hi,
>
> On Thu, 13 Oct 2022, David Hildenbrand wrote:
>
>> We already make sure to allocate percpu data only after we verified that
>> the module we're loading hasn't already been loaded and isn't
>> concurrently getting loaded -- that it's unique.
>>
>> On big systems (> 400 CPUs and many devices) with KASAN enabled, we're now
>> phasing a similar issue with the module vmap space.
>>
>> When KASAN_INLINE is enabled (resulting in large module size), plenty
>> of devices that udev wants to probe and plenty (> 400) of CPUs that can
>> carry out that probing concurrently, we can actually run out of module
>> vmap space and trigger vmap allocation errors:
>>
>> [ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
>> [ 165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
>> [ 165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
>> [ 165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
>> [ 165.947343] Call Trace:
>> [ 165.950075] <TASK>
>> [ 165.952425] dump_stack_lvl+0x57/0x81
>> [ 165.956532] warn_alloc.cold+0x95/0x18a
>> [ 165.960836] ? zone_watermark_ok_safe+0x240/0x240
>> [ 165.966100] ? slab_free_freelist_hook+0x11d/0x1d0
>> [ 165.971461] ? __get_vm_area_node+0x2af/0x360
>> [ 165.976341] ? __get_vm_area_node+0x2af/0x360
>> [ 165.981219] __vmalloc_node_range+0x291/0x560
>> [ 165.986087] ? __mutex_unlock_slowpath+0x161/0x5e0
>> [ 165.991447] ? move_module+0x4c/0x630
>> [ 165.995547] ? vfree_atomic+0xa0/0xa0
>> [ 165.999647] ? move_module+0x4c/0x630
>> [ 166.003741] module_alloc+0xe7/0x170
>> [ 166.007747] ? move_module+0x4c/0x630
>> [ 166.011840] move_module+0x4c/0x630
>> [ 166.015751] layout_and_allocate+0x32c/0x560
>> [ 166.020519] load_module+0x8e0/0x25c0
>> [ 166.024623] ? layout_and_allocate+0x560/0x560
>> [ 166.029586] ? kernel_read_file+0x286/0x6b0
>> [ 166.034269] ? __x64_sys_fspick+0x290/0x290
>> [ 166.038946] ? userfaultfd_unmap_prep+0x430/0x430
>> [ 166.044203] ? lock_downgrade+0x130/0x130
>> [ 166.048698] ? __do_sys_finit_module+0x11a/0x1c0
>> [ 166.053854] __do_sys_finit_module+0x11a/0x1c0
>> [ 166.058818] ? __ia32_sys_init_module+0xa0/0xa0
>> [ 166.063882] ? __seccomp_filter+0x92/0x930
>> [ 166.068494] do_syscall_64+0x59/0x90
>> [ 166.072492] ? do_syscall_64+0x69/0x90
>> [ 166.076679] ? do_syscall_64+0x69/0x90
>> [ 166.080864] ? do_syscall_64+0x69/0x90
>> [ 166.085047] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>> [ 166.090984] ? lockdep_hardirqs_on+0x79/0x100
>> [ 166.095855] entry_SYSCALL_64_after_hwframe+0x63/0xcd[ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>>
>> Interestingly, when reducing the number of CPUs (nosmt), it works as
>> expected.
>>
>> The underlying issue is that we first allocate memory (including module
>> vmap space) in layout_and_allocate(), and then verify whether the module
>> is unique in add_unformed_module(). So we end up allocating module vmap
>> space even though we might not need it -- which is a problem when modules
>> are big and we can have a lot of concurrent probing of the same set of
>> modules as on the big system at hand.
>>
>> Unfortunately, we cannot simply add the module earlier, because
>> move_module() -- that allocates the module vmap space -- essentially
>> brings the module to life from a temporary one. Adding the temporary one
>> and replacing it is also sub-optimal (because replacing it would require
>> to synchronize against RCU) and feels kind of dangerous judging that we
>> end up copying it.
>>
>> So instead, add a second list (pending_load_infos) that tracks the modules
>> (via their load_info) that are unique and are still getting loaded
>> ("pending"), but haven't made it to the actual module list yet. This
>> shouldn't have a notable runtime overhead when concurrently loading
>> modules: the new list is expected to usually either be empty or contain
>> very few entries for a short time.
>>
>> Thanks to Uladzislau for his help to verify that it's not actually a
>> vmap code issue.
>
> this seems to be related to what
> https://lore.kernel.org/all/[email protected]/
> tries to solve. Just your symptoms are different. Does the patch set fix
> your issue too?
Hi Miroslav,
the underlying approach with a load_info list is similar (which is nice
to see), so I assume it will similarly fix the issue.
I'm not sure if merging the requests (adding the refcount logic and the
-EBUSY change is really required/wanted), though. Looks like some of
these changes that might have been factored out into separate patches.
Not my call to make. I'll give the set a churn on the machine where I
can reproduce the issue.
--
Thanks,
David / dhildenb