210 lines
8.2 KiB
Diff
210 lines
8.2 KiB
Diff
From adfa22dd89640042c0df2d8906781ea74da9166c Mon Sep 17 00:00:00 2001
|
|
From: Harry Yoo <harry.yoo@oracle.com>
|
|
Date: Mon, 18 Aug 2025 11:02:04 +0900
|
|
Subject: mm: move page table sync declarations to linux/pgtable.h
|
|
|
|
During our internal testing, we started observing intermittent boot
|
|
failures when the machine uses 4-level paging and has a large amount of
|
|
persistent memory:
|
|
|
|
BUG: unable to handle page fault for address: ffffe70000000034
|
|
#PF: supervisor write access in kernel mode
|
|
#PF: error_code(0x0002) - not-present page
|
|
PGD 0 P4D 0
|
|
Oops: 0002 [#1] SMP NOPTI
|
|
RIP: 0010:__init_single_page+0x9/0x6d
|
|
Call Trace:
|
|
<TASK>
|
|
__init_zone_device_page+0x17/0x5d
|
|
memmap_init_zone_device+0x154/0x1bb
|
|
pagemap_range+0x2e0/0x40f
|
|
memremap_pages+0x10b/0x2f0
|
|
devm_memremap_pages+0x1e/0x60
|
|
dev_dax_probe+0xce/0x2ec [device_dax]
|
|
dax_bus_probe+0x6d/0xc9
|
|
[... snip ...]
|
|
</TASK>
|
|
|
|
It turns out that the kernel panics while initializing vmemmap (struct
|
|
page array) when the vmemmap region spans two PGD entries, because the new
|
|
PGD entry is only installed in init_mm.pgd, but not in the page tables of
|
|
other tasks.
|
|
|
|
And looking at __populate_section_memmap():
|
|
if (vmemmap_can_optimize(altmap, pgmap))
|
|
// does not sync top level page tables
|
|
r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
|
|
else
|
|
// sync top level page tables in x86
|
|
r = vmemmap_populate(start, end, nid, altmap);
|
|
|
|
In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c
|
|
synchronizes the top level page table (See commit 9b861528a801 ("x86-64,
|
|
mem: Update all PGDs for direct mapping and vmemmap mapping changes")) so
|
|
that all tasks in the system can see the new vmemmap area.
|
|
|
|
However, when vmemmap_can_optimize() returns true, the optimized path
|
|
skips synchronization of top-level page tables. This is because
|
|
vmemmap_populate_compound_pages() is implemented in core MM code, which
|
|
does not handle synchronization of the top-level page tables. Instead,
|
|
the core MM has historically relied on each architecture to perform this
|
|
synchronization manually.
|
|
|
|
We're not the first party to encounter a crash caused by not-sync'd top
|
|
level page tables: earlier this year, Gwan-gyeong Mun attempted to address
|
|
the issue [1] [2] after hitting a kernel panic when x86 code accessed the
|
|
vmemmap area before the corresponding top-level entries were synced. At
|
|
that time, the issue was believed to be triggered only when struct page
|
|
was enlarged for debugging purposes, and the patch did not get further
|
|
updates.
|
|
|
|
It turns out that current approach of relying on each arch to handle the
|
|
page table sync manually is fragile because 1) it's easy to forget to sync
|
|
the top level page table, and 2) it's also easy to overlook that the
|
|
kernel should not access the vmemmap and direct mapping areas before the
|
|
sync.
|
|
|
|
To address this, Dave Hansen suggested [3] [4] introducing
|
|
{pgd,p4d}_populate_kernel() for updating kernel portion of the page tables
|
|
and allow each architecture to explicitly perform synchronization when
|
|
installing top-level entries. With this approach, we no longer need to
|
|
worry about missing the sync step, reducing the risk of future
|
|
regressions.
|
|
|
|
The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK,
|
|
PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by
|
|
vmalloc and ioremap to synchronize page tables.
|
|
|
|
pgd_populate_kernel() looks like this:
|
|
static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd,
|
|
p4d_t *p4d)
|
|
{
|
|
pgd_populate(&init_mm, pgd, p4d);
|
|
if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED)
|
|
arch_sync_kernel_mappings(addr, addr);
|
|
}
|
|
|
|
It is worth noting that vmalloc() and apply_to_range() carefully
|
|
synchronizes page tables by calling p*d_alloc_track() and
|
|
arch_sync_kernel_mappings(), and thus they are not affected by this patch
|
|
series.
|
|
|
|
This series was hugely inspired by Dave Hansen's suggestion and hence
|
|
added Suggested-by: Dave Hansen.
|
|
|
|
Cc stable because lack of this series opens the door to intermittent
|
|
boot failures.
|
|
|
|
This patch (of 3):
|
|
|
|
Move ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to
|
|
linux/pgtable.h so that they can be used outside of vmalloc and ioremap.
|
|
|
|
Link: https://lkml.kernel.org/r/20250818020206.4517-1-harry.yoo@oracle.com
|
|
Link: https://lkml.kernel.org/r/20250818020206.4517-2-harry.yoo@oracle.com
|
|
Link: https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com [1]
|
|
Link: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com [2]
|
|
Link: https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com [3]
|
|
Link: https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com [4]
|
|
Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
|
|
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
|
|
Acked-by: Kiryl Shutsemau <kas@kernel.org>
|
|
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
|
|
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
|
|
Acked-by: David Hildenbrand <david@redhat.com>
|
|
Cc: Alexander Potapenko <glider@google.com>
|
|
Cc: Alistair Popple <apopple@nvidia.com>
|
|
Cc: Andrey Konovalov <andreyknvl@gmail.com>
|
|
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
|
|
Cc: Andy Lutomirski <luto@kernel.org>
|
|
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
|
|
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
|
|
Cc: Ard Biesheuvel <ardb@kernel.org>
|
|
Cc: Arnd Bergmann <arnd@arndb.de>
|
|
Cc: bibo mao <maobibo@loongson.cn>
|
|
Cc: Borislav Betkov <bp@alien8.de>
|
|
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
|
|
Cc: Dennis Zhou <dennis@kernel.org>
|
|
Cc: Dev Jain <dev.jain@arm.com>
|
|
Cc: Dmitriy Vyukov <dvyukov@google.com>
|
|
Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
|
|
Cc: Ingo Molnar <mingo@redhat.com>
|
|
Cc: Jane Chu <jane.chu@oracle.com>
|
|
Cc: Joao Martins <joao.m.martins@oracle.com>
|
|
Cc: Joerg Roedel <joro@8bytes.org>
|
|
Cc: John Hubbard <jhubbard@nvidia.com>
|
|
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
|
|
Cc: Liam Howlett <liam.howlett@oracle.com>
|
|
Cc: Michal Hocko <mhocko@suse.com>
|
|
Cc: Oscar Salvador <osalvador@suse.de>
|
|
Cc: Peter Xu <peterx@redhat.com>
|
|
Cc: Peter Zijlstra <peterz@infradead.org>
|
|
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
|
|
Cc: Ryan Roberts <ryan.roberts@arm.com>
|
|
Cc: Suren Baghdasaryan <surenb@google.com>
|
|
Cc: Tejun Heo <tj@kernel.org>
|
|
Cc: Thomas Gleinxer <tglx@linutronix.de>
|
|
Cc: Thomas Huth <thuth@redhat.com>
|
|
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
|
|
Cc: Vlastimil Babka <vbabka@suse.cz>
|
|
Cc: Dave Hansen <dave.hansen@linux.intel.com>
|
|
Cc: <stable@vger.kernel.org>
|
|
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
---
|
|
include/linux/pgtable.h | 17 +++++++++++++++++
|
|
include/linux/vmalloc.h | 16 ----------------
|
|
2 files changed, 17 insertions(+), 16 deletions(-)
|
|
|
|
--- a/include/linux/pgtable.h
|
|
+++ b/include/linux/pgtable.h
|
|
@@ -1329,6 +1329,23 @@ static inline void ptep_modify_prot_comm
|
|
__ptep_modify_prot_commit(vma, addr, ptep, pte);
|
|
}
|
|
#endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
|
|
+
|
|
+/*
|
|
+ * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
|
|
+ * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
|
|
+ * needs to be called.
|
|
+ */
|
|
+#ifndef ARCH_PAGE_TABLE_SYNC_MASK
|
|
+#define ARCH_PAGE_TABLE_SYNC_MASK 0
|
|
+#endif
|
|
+
|
|
+/*
|
|
+ * There is no default implementation for arch_sync_kernel_mappings(). It is
|
|
+ * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK
|
|
+ * is 0.
|
|
+ */
|
|
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
|
|
+
|
|
#endif /* CONFIG_MMU */
|
|
|
|
/*
|
|
--- a/include/linux/vmalloc.h
|
|
+++ b/include/linux/vmalloc.h
|
|
@@ -220,22 +220,6 @@ int vmap_pages_range(unsigned long addr,
|
|
struct page **pages, unsigned int page_shift);
|
|
|
|
/*
|
|
- * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
|
|
- * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
|
|
- * needs to be called.
|
|
- */
|
|
-#ifndef ARCH_PAGE_TABLE_SYNC_MASK
|
|
-#define ARCH_PAGE_TABLE_SYNC_MASK 0
|
|
-#endif
|
|
-
|
|
-/*
|
|
- * There is no default implementation for arch_sync_kernel_mappings(). It is
|
|
- * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK
|
|
- * is 0.
|
|
- */
|
|
-void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
|
|
-
|
|
-/*
|
|
* Lowlevel-APIs (not for driver use!)
|
|
*/
|
|
|