From 5d833062139d290adb8b62c093b654a01a353448 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:16 -0800 Subject: mm: numa: do not dereference pmd outside of the lock during NUMA hinting fault Automatic NUMA balancing depends on being able to protect PTEs to trap a fault and gather reference locality information. Very broadly speaking it would mark PTEs as not present and use another bit to distinguish between NUMA hinting faults and other types of faults. It was universally loved by everybody and caused no problems whatsoever. That last sentence might be a lie. This series is very heavily based on patches from Linus and Aneesh to replace the existing PTE/PMD NUMA helper functions with normal change protections. I did alter and add parts of it but I consider them relatively minor contributions. At their suggestion, acked-bys are in there but I've no problem converting them to Signed-off-by if requested. AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh for that. I tested trinity under kvm-tool and passed and ran a few other basic tests. At the time of writing, only the short-lived tests have completed but testing of V2 indicated that long-term testing had no surprises. In most cases I'm leaving out detail as it's not that interesting. specjbb single JVM: There was negligible performance difference in the benchmark itself for short runs. However, system activity is higher and interrupts are much higher over time -- possibly TLB flushes. Migrations are also higher. Overall, this is more overhead but considering the problems faced with the old approach I think we just have to suck it up and find another way of reducing the overhead. specjbb multi JVM: Negligible performance difference to the actual benchmark but like the single JVM case, the system overhead is noticeably higher. Again, interrupts are a major factor. autonumabench: This was all over the place and about all that can be reasonably concluded is that it's different but not necessarily better or worse. autonumabench 3.18.0-rc5 3.18.0-rc5 mmotm-20141119 protnone-v3r3 User NUMA01 32380.24 ( 0.00%) 21642.92 ( 33.16%) User NUMA01_THEADLOCAL 22481.02 ( 0.00%) 22283.22 ( 0.88%) User NUMA02 3137.00 ( 0.00%) 3116.54 ( 0.65%) User NUMA02_SMT 1614.03 ( 0.00%) 1543.53 ( 4.37%) System NUMA01 322.97 ( 0.00%) 1465.89 (-353.88%) System NUMA01_THEADLOCAL 91.87 ( 0.00%) 49.32 ( 46.32%) System NUMA02 37.83 ( 0.00%) 14.61 ( 61.38%) System NUMA02_SMT 7.36 ( 0.00%) 7.45 ( -1.22%) Elapsed NUMA01 716.63 ( 0.00%) 599.29 ( 16.37%) Elapsed NUMA01_THEADLOCAL 553.98 ( 0.00%) 539.94 ( 2.53%) Elapsed NUMA02 83.85 ( 0.00%) 83.04 ( 0.97%) Elapsed NUMA02_SMT 86.57 ( 0.00%) 79.15 ( 8.57%) CPU NUMA01 4563.00 ( 0.00%) 3855.00 ( 15.52%) CPU NUMA01_THEADLOCAL 4074.00 ( 0.00%) 4136.00 ( -1.52%) CPU NUMA02 3785.00 ( 0.00%) 3770.00 ( 0.40%) CPU NUMA02_SMT 1872.00 ( 0.00%) 1959.00 ( -4.65%) System CPU usage of NUMA01 is worse but it's an adverse workload on this machine so I'm reluctant to conclude that it's a problem that matters. On the other workloads that are sensible on this machine, system CPU usage is great. Overall time to complete the benchmark is comparable 3.18.0-rc5 3.18.0-rc5 mmotm-20141119protnone-v3r3 User 59612.50 48586.44 System 460.22 1537.45 Elapsed 1442.20 1304.29 NUMA alloc hit 5075182 5743353 NUMA alloc miss 0 0 NUMA interleave hit 0 0 NUMA alloc local 5075174 5743339 NUMA base PTE updates 637061448 443106883 NUMA huge PMD updates 1243434 864747 NUMA page range updates 1273699656 885857347 NUMA hint faults 1658116 1214277 NUMA hint local faults 959487 754113 NUMA hint local percent 57 62 NUMA pages migrated 5467056 61676398 The NUMA pages migrated look terrible but when I looked at a graph of the activity over time I see that the massive spike in migration activity was during NUMA01. This correlates with high system CPU usage and could be simply down to bad luck but any modifications that affect that workload would be related to scan rates and migrations, not the protection mechanism. For all other workloads, migration activity was comparable. Overall, headline performance figures are comparable but the overhead is higher, mostly in interrupts. To some extent, higher overhead from this approach was anticipated but not to this degree. It's going to be necessary to reduce this again with a separate series in the future. It's still worth going ahead with this series though as it's likely to avoid constant headaches with Xen and is probably easier to maintain. This patch (of 10): A transhuge NUMA hinting fault may find the page is migrating and should wait until migration completes. The check is race-prone because the pmd is deferenced outside of the page lock and while the race is tiny, it'll be larger if the PMD is cleared while marking PMDs for hinting fault. This patch closes the race. Signed-off-by: Mel Gorman Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 3 ++- mm/migrate.c | 6 ------ 2 files changed, 2 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cb7be110cad3..c6921362c5fc 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1272,8 +1272,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * check_same as the page may no longer be mapped. */ if (unlikely(pmd_trans_migrating(*pmdp))) { + page = pmd_page(*pmdp); spin_unlock(ptl); - wait_migrate_huge_page(vma->anon_vma, pmdp); + wait_on_page_locked(page); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index f98067e5d353..5e8f03a8de2a 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1654,12 +1654,6 @@ bool pmd_trans_migrating(pmd_t pmd) return PageLocked(page); } -void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ - struct page *page = pmd_page(*pmd); - wait_on_page_locked(page); -} - /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on -- cgit v1.2.3 From 8a0516ed8b90c95ffa1363b420caa37418149f21 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:22 -0800 Subject: mm: convert p[te|md]_numa users to p[te|md]_protnone_numa Convert existing users of pte_numa and friends to the new helper. Note that the kernel is broken after this patch is applied until the other page table modifiers are also altered. This patch layout is to make review easier. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar Acked-by: Benjamin Herrenschmidt Tested-by: Sasha Levin Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- arch/powerpc/mm/fault.c | 5 ----- arch/powerpc/mm/pgtable.c | 11 ++++++++--- arch/powerpc/mm/pgtable_64.c | 3 ++- arch/x86/mm/gup.c | 4 ++-- include/uapi/linux/mempolicy.h | 2 +- mm/gup.c | 10 +++++----- mm/huge_memory.c | 16 ++++++++-------- mm/memory.c | 4 ++-- mm/mprotect.c | 38 ++++++++++--------------------------- mm/pgtable-generic.c | 2 +- 11 files changed, 40 insertions(+), 57 deletions(-) (limited to 'mm') diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 510bdfbc4073..625407e4d3b0 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -212,7 +212,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, &pte_size); - if (pte_present(pte) && !pte_numa(pte)) { + if (pte_present(pte) && !pte_protnone(pte)) { if (writing && !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 6154b0a2b063..f38327b95f76 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -398,8 +398,6 @@ good_area: * processors use the same I/D cache coherency mechanism * as embedded. */ - if (error_code & DSISR_PROTFAULT) - goto bad_area; #endif /* CONFIG_PPC_STD_MMU */ /* @@ -423,9 +421,6 @@ good_area: flags |= FAULT_FLAG_WRITE; /* a read */ } else { - /* protection fault */ - if (error_code & 0x08000000) - goto bad_area; if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; } diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index c90e602677c9..83dfcb55ffef 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -172,9 +172,14 @@ static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma, void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { -#ifdef CONFIG_DEBUG_VM - WARN_ON(pte_val(*ptep) & _PAGE_PRESENT); -#endif + /* + * When handling numa faults, we already have the pte marked + * _PAGE_PRESENT, but we can be sure that it is not in hpte. + * Hence we can use set_pte_at for them. + */ + VM_WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); + /* Note: mm->context.id might not yet have been assigned as * this context might not have been activated yet when this * is called. diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 4fe5f64cc179..91bb8836825a 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -718,7 +718,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { #ifdef CONFIG_DEBUG_VM - WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT); + WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); assert_spin_locked(&mm->page_table_lock); WARN_ON(!pmd_trans_huge(pmd)); #endif diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index 89df70e0caa6..81bf3d2af3eb 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, struct page *page; /* Similar to the PMD case, NUMA hinting must take slow path */ - if (pte_numa(pte)) { + if (pte_protnone(pte)) { pte_unmap(ptep); return 0; } @@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, * slowpath for accounting purposes and so that they * can be serialised against THP migration. */ - if (pmd_numa(pmd)) + if (pmd_protnone(pmd)) return 0; if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) return 0; diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 0d11c3dcd3a1..9cd8b21dddbe 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -67,7 +67,7 @@ enum mpol_rebind_step { #define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */ #define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */ #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ -#define MPOL_F_MORON (1 << 4) /* Migrate On pte_numa Reference On Node */ +#define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ #endif /* _UAPI_LINUX_MEMPOLICY_H */ diff --git a/mm/gup.c b/mm/gup.c index c2da1163986a..51bf0b06ca7b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -64,7 +64,7 @@ retry: migration_entry_wait(mm, pmd, address); goto retry; } - if ((flags & FOLL_NUMA) && pte_numa(pte)) + if ((flags & FOLL_NUMA) && pte_protnone(pte)) goto no_page; if ((flags & FOLL_WRITE) && !pte_write(pte)) { pte_unmap_unlock(ptep, ptl); @@ -184,7 +184,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma, return page; return no_page_table(vma, flags); } - if ((flags & FOLL_NUMA) && pmd_numa(*pmd)) + if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) return no_page_table(vma, flags); if (pmd_trans_huge(*pmd)) { if (flags & FOLL_SPLIT) { @@ -906,10 +906,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, /* * Similar to the PMD case below, NUMA hinting must take slow - * path + * path using the pte_protnone check. */ if (!pte_present(pte) || pte_special(pte) || - pte_numa(pte) || (write && !pte_write(pte))) + pte_protnone(pte) || (write && !pte_write(pte))) goto pte_unmap; VM_BUG_ON(!pfn_valid(pte_pfn(pte))); @@ -1104,7 +1104,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, * slowpath for accounting purposes and so that they * can be serialised against THP migration. */ - if (pmd_numa(pmd)) + if (pmd_protnone(pmd)) return 0; if (!gup_huge_pmd(pmd, pmdp, addr, next, write, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c6921362c5fc..915941c45169 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1211,7 +1211,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, return ERR_PTR(-EFAULT); /* Full NUMA hinting faults to serialise migration in fault paths */ - if ((flags & FOLL_NUMA) && pmd_numa(*pmd)) + if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) goto out; page = pmd_page(*pmd); @@ -1342,7 +1342,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, /* * Migrate the THP to the requested node, returns with page unlocked - * and pmd_numa cleared. + * and access rights restored. */ spin_unlock(ptl); migrated = migrate_misplaced_transhuge_page(mm, vma, @@ -1357,7 +1357,7 @@ clear_pmdnuma: BUG_ON(!PageLocked(page)); pmd = pmd_mknonnuma(pmd); set_pmd_at(mm, haddr, pmdp, pmd); - VM_BUG_ON(pmd_numa(*pmdp)); + VM_BUG_ON(pmd_protnone(*pmdp)); update_mmu_cache_pmd(vma, addr, pmdp); unlock_page(page); out_unlock: @@ -1483,7 +1483,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, ret = 1; if (!prot_numa) { entry = pmdp_get_and_clear_notify(mm, addr, pmd); - if (pmd_numa(entry)) + if (pmd_protnone(entry)) entry = pmd_mknonnuma(entry); entry = pmd_modify(entry, newprot); ret = HPAGE_PMD_NR; @@ -1499,7 +1499,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, * local vs remote hits on the zero page. */ if (!is_huge_zero_page(page) && - !pmd_numa(*pmd)) { + !pmd_protnone(*pmd)) { pmdp_set_numa(mm, addr, pmd); ret = HPAGE_PMD_NR; } @@ -1767,9 +1767,9 @@ static int __split_huge_page_map(struct page *page, pte_t *pte, entry; BUG_ON(PageCompound(page+i)); /* - * Note that pmd_numa is not transferred deliberately - * to avoid any possibility that pte_numa leaks to - * a PROT_NONE VMA by accident. + * Note that NUMA hinting access restrictions are not + * transferred to avoid any possibility of altering + * permissions across VMAs. */ entry = mk_pte(page + i, vma->vm_page_prot); entry = maybe_mkwrite(pte_mkdirty(entry), vma); diff --git a/mm/memory.c b/mm/memory.c index bbe6a73a899d..92e6a6299e86 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3124,7 +3124,7 @@ static int handle_pte_fault(struct mm_struct *mm, pte, pmd, flags, entry); } - if (pte_numa(entry)) + if (pte_protnone(entry)) return do_numa_page(mm, vma, address, entry, pte, pmd); ptl = pte_lockptr(mm, pmd); @@ -3202,7 +3202,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (pmd_trans_splitting(orig_pmd)) return 0; - if (pmd_numa(orig_pmd)) + if (pmd_protnone(orig_pmd)) return do_huge_pmd_numa_page(mm, vma, address, orig_pmd, pmd); diff --git a/mm/mprotect.c b/mm/mprotect.c index 33121662f08b..44ffa698484d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,36 +75,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; - bool updated = false; - if (!prot_numa) { - ptent = ptep_modify_prot_start(mm, addr, pte); - if (pte_numa(ptent)) - ptent = pte_mknonnuma(ptent); - ptent = pte_modify(ptent, newprot); - /* - * Avoid taking write faults for pages we - * know to be dirty. - */ - if (dirty_accountable && pte_dirty(ptent) && - (pte_soft_dirty(ptent) || - !(vma->vm_flags & VM_SOFTDIRTY))) - ptent = pte_mkwrite(ptent); - ptep_modify_prot_commit(mm, addr, pte, ptent); - updated = true; - } else { - struct page *page; - - page = vm_normal_page(vma, addr, oldpte); - if (page && !PageKsm(page)) { - if (!pte_numa(oldpte)) { - ptep_set_numa(mm, addr, pte); - updated = true; - } - } + ptent = ptep_modify_prot_start(mm, addr, pte); + ptent = pte_modify(ptent, newprot); + + /* Avoid taking write faults for known dirty pages */ + if (dirty_accountable && pte_dirty(ptent) && + (pte_soft_dirty(ptent) || + !(vma->vm_flags & VM_SOFTDIRTY))) { + ptent = pte_mkwrite(ptent); } - if (updated) - pages++; + ptep_modify_prot_commit(mm, addr, pte, ptent); + pages++; } else if (IS_ENABLED(CONFIG_MIGRATION)) { swp_entry_t entry = pte_to_swp_entry(oldpte); diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index dfb79e028ecb..4b8ad760dde3 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -193,7 +193,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { pmd_t entry = *pmdp; - if (pmd_numa(entry)) + if (pmd_protnone(entry)) entry = pmd_mknonnuma(entry); set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry)); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); -- cgit v1.2.3 From 4d9424669946532be754a6e116618dcb58430cb4 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:28 -0800 Subject: mm: convert p[te|md]_mknonnuma and remaining page table manipulations With PROT_NONE, the traditional page table manipulation functions are sufficient. [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()] [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS] Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar Tested-by: Sasha Levin Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Paul Mackerras Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm/include/asm/pgtable-3level.h | 5 ++++- include/linux/huge_mm.h | 3 +-- mm/huge_memory.c | 33 +++++++-------------------------- mm/memory.c | 10 ++++++---- mm/mempolicy.c | 2 +- mm/migrate.c | 2 +- mm/mprotect.c | 2 +- mm/pgtable-generic.c | 2 -- 8 files changed, 21 insertions(+), 38 deletions(-) (limited to 'mm') diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h index 18dbc82f85e5..423a5ac09d3a 100644 --- a/arch/arm/include/asm/pgtable-3level.h +++ b/arch/arm/include/asm/pgtable-3level.h @@ -257,7 +257,10 @@ PMD_BIT_FUNC(mkyoung, |= PMD_SECT_AF); #define mk_pmd(page,prot) pfn_pmd(page_to_pfn(page),prot) /* represent a notpresent pmd by zero, this is used by pmdp_invalidate */ -#define pmd_mknotpresent(pmd) (__pmd(0)) +static inline pmd_t pmd_mknotpresent(pmd_t pmd) +{ + return __pmd(0); +} static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) { diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f10b20f05159..062bd252e994 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, - int prot_numa); + unsigned long addr, pgprot_t newprot); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 915941c45169..cb9b3e847dac 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1355,9 +1355,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; clear_pmdnuma: BUG_ON(!PageLocked(page)); - pmd = pmd_mknonnuma(pmd); + pmd = pmd_modify(pmd, vma->vm_page_prot); set_pmd_at(mm, haddr, pmdp, pmd); - VM_BUG_ON(pmd_protnone(*pmdp)); update_mmu_cache_pmd(vma, addr, pmdp); unlock_page(page); out_unlock: @@ -1472,7 +1471,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, int prot_numa) + unsigned long addr, pgprot_t newprot) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1481,29 +1480,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; ret = 1; - if (!prot_numa) { - entry = pmdp_get_and_clear_notify(mm, addr, pmd); - if (pmd_protnone(entry)) - entry = pmd_mknonnuma(entry); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); - } else { - struct page *page = pmd_page(*pmd); - - /* - * Do not trap faults against the zero page. The - * read-only data is likely to be read-cached on the - * local CPU cache and it is less useful to know about - * local vs remote hits on the zero page. - */ - if (!is_huge_zero_page(page) && - !pmd_protnone(*pmd)) { - pmdp_set_numa(mm, addr, pmd); - ret = HPAGE_PMD_NR; - } - } + entry = pmdp_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); spin_unlock(ptl); } diff --git a/mm/memory.c b/mm/memory.c index 92e6a6299e86..d7921760cf79 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3018,9 +3018,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * validation through pte_unmap_same(). It's of NUMA type but * the pfn may be screwed if the read is non atomic. * - * ptep_modify_prot_start is not called as this is clearing - * the _PAGE_NUMA bit and it is not really expected that there - * would be concurrent hardware modifications to the PTE. + * We can safely just do a "set_pte_at()", because the old + * page table entry is not accessible, so there would be no + * concurrent hardware modifications to the PTE. */ ptl = pte_lockptr(mm, pmd); spin_lock(ptl); @@ -3029,7 +3029,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; } - pte = pte_mknonnuma(pte); + /* Make it present again */ + pte = pte_modify(pte, vma->vm_page_prot); + pte = pte_mkyoung(pte); set_pte_at(mm, addr, ptep, pte); update_mmu_cache(vma, addr, ptep); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index f1bd23803576..c75f4dcec808 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -569,7 +569,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma, { int nr_updated; - nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1); + nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1); if (nr_updated) count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated); diff --git a/mm/migrate.c b/mm/migrate.c index 5e8f03a8de2a..85e042686031 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1847,7 +1847,7 @@ out_fail: out_dropref: ptl = pmd_lock(mm, pmd); if (pmd_same(*pmd, entry)) { - entry = pmd_mknonnuma(entry); + entry = pmd_modify(entry, vma->vm_page_prot); set_pmd_at(mm, mmun_start, pmd, entry); update_mmu_cache_pmd(vma, address, &entry); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 44ffa698484d..76824d73380d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -142,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, split_huge_page_pmd(vma, addr, pmd); else { int nr_ptes = change_huge_pmd(vma, pmd, addr, - newprot, prot_numa); + newprot); if (nr_ptes) { if (nr_ptes == HPAGE_PMD_NR) { diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4b8ad760dde3..c25f94b33811 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -193,8 +193,6 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { pmd_t entry = *pmdp; - if (pmd_protnone(entry)) - entry = pmd_mknonnuma(entry); set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry)); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); } -- cgit v1.2.3 From e944fd67b625c02bda4a78ddf85e413c5e401474 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:35 -0800 Subject: mm: numa: do not trap faults on the huge zero page Faults on the huge zero page are pointless and there is a BUG_ON to catch them during fault time. This patch reintroduces a check that avoids marking the zero page PAGE_NONE. Signed-off-by: Mel Gorman Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 3 ++- mm/huge_memory.c | 13 ++++++++++++- mm/memory.c | 1 - mm/mprotect.c | 14 +++++++++++++- 4 files changed, 27 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 062bd252e994..f10b20f05159 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot); + unsigned long addr, pgprot_t newprot, + int prot_numa); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cb9b3e847dac..8e791a3db6b6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1471,7 +1471,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot) + unsigned long addr, pgprot_t newprot, int prot_numa) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1479,6 +1479,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; + + /* + * Avoid trapping faults against the zero page. The read-only + * data is likely to be read-cached on the local CPU and + * local/remote hits to the zero page are not interesting. + */ + if (prot_numa && is_huge_zero_pmd(*pmd)) { + spin_unlock(ptl); + return 0; + } + ret = 1; entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); diff --git a/mm/memory.c b/mm/memory.c index d7921760cf79..bf244f56b05a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3040,7 +3040,6 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_unmap_unlock(ptep, ptl); return 0; } - BUG_ON(is_zero_pfn(page_to_pfn(page))); /* * Avoid grouping on DSO/COW pages in specific and RO pages diff --git a/mm/mprotect.c b/mm/mprotect.c index 76824d73380d..dd599fc235c2 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -76,6 +76,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (pte_present(oldpte)) { pte_t ptent; + /* + * Avoid trapping faults against the zero or KSM + * pages. See similar comment in change_huge_pmd. + */ + if (prot_numa) { + struct page *page; + + page = vm_normal_page(vma, addr, oldpte); + if (!page || PageKsm(page)) + continue; + } + ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); @@ -142,7 +154,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, split_huge_page_pmd(vma, addr, pmd); else { int nr_ptes = change_huge_pmd(vma, pmd, addr, - newprot); + newprot, prot_numa); if (nr_ptes) { if (nr_ptes == HPAGE_PMD_NR) { -- cgit v1.2.3 From c0e7cad9f2390087b53e26e7b98958d8793ee02d Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:41 -0800 Subject: mm: numa: add paranoid check around pte_protnone_numa pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going to result in strangeness so add a check for it. BUG_ON looks like overkill but if this is hit then it's a serious bug that could result in corruption so do not even try recovering. It would have been more comprehensive to check VMA flags in pte_protnone_numa but it would have made the API ugly just for a debugging check. Signed-off-by: Mel Gorman Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 3 +++ mm/memory.c | 3 +++ 2 files changed, 6 insertions(+) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8e791a3db6b6..8e07342b52c0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1262,6 +1262,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + ptl = pmd_lock(mm, pmdp); if (unlikely(!pmd_same(pmd, *pmdp))) goto out_unlock; diff --git a/mm/memory.c b/mm/memory.c index bf244f56b05a..f7886ab036e7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3013,6 +3013,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + /* * The "pte" at this point cannot be used safely without * validation through pte_unmap_same(). It's of NUMA type but -- cgit v1.2.3 From 10c1045f28e86ac90589a188f0be2d7a4347efdf Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:44 -0800 Subject: mm: numa: avoid unnecessary TLB flushes when setting NUMA hinting entries If a PTE or PMD is already marked NUMA when scanning to mark entries for NUMA hinting then it is not necessary to update the entry and incur a TLB flush penalty. Avoid the avoidhead where possible. Signed-off-by: Mel Gorman Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 14 ++++++++------ mm/mprotect.c | 4 ++++ 2 files changed, 12 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8e07342b52c0..fc00c8cb5a82 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1493,12 +1493,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, return 0; } - ret = 1; - entry = pmdp_get_and_clear_notify(mm, addr, pmd); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); + if (!prot_numa || !pmd_protnone(*pmd)) { + ret = 1; + entry = pmdp_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); + } spin_unlock(ptl); } diff --git a/mm/mprotect.c b/mm/mprotect.c index dd599fc235c2..44727811bf4c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, page = vm_normal_page(vma, addr, oldpte); if (!page || PageKsm(page)) continue; + + /* Avoid TLB flush if possible */ + if (pte_protnone(oldpte)) + continue; } ptent = ptep_modify_prot_start(mm, addr, pte); -- cgit v1.2.3 From 503c358cf1925853195ee39ec437e51138bbb7df Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:58:47 -0800 Subject: list_lru: introduce list_lru_shrink_{count,walk} Kmem accounting of memcg is unusable now, because it lacks slab shrinker support. That means when we hit the limit we will get ENOMEM w/o any chance to recover. What we should do then is to call shrink_slab, which would reclaim old inode/dentry caches from this cgroup. This is what this patch set is intended to do. Basically, it does two things. First, it introduces the notion of per-memcg slab shrinker. A shrinker that wants to reclaim objects per cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be passed the memory cgroup to scan from in shrink_control->memcg. For such shrinkers shrink_slab iterates over the whole cgroup subtree under the target cgroup and calls the shrinker for each kmem-active memory cgroup. Secondly, this patch set makes the list_lru structure per-memcg. It's done transparently to list_lru users - everything they have to do is to tell list_lru_init that they want memcg-aware list_lru. Then the list_lru will automatically distribute objects among per-memcg lists basing on which cgroup the object is accounted to. This way to make FS shrinkers (icache, dcache) memcg-aware we only need to make them use memcg-aware list_lru, and this is what this patch set does. As before, this patch set only enables per-memcg kmem reclaim when the pressure goes from memory.limit, not from memory.kmem.limit. Handling memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and it is still unclear whether we will have this knob in the unified hierarchy. This patch (of 9): NUMA aware slab shrinkers use the list_lru structure to distribute objects coming from different NUMA nodes to different lists. Whenever such a shrinker needs to count or scan objects from a particular node, it issues commands like this: count = list_lru_count_node(lru, sc->nid); freed = list_lru_walk_node(lru, sc->nid, isolate_func, isolate_arg, &sc->nr_to_scan); where sc is an instance of the shrink_control structure passed to it from vmscan. To simplify this, let's add special list_lru functions to be used by shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which consolidate the nid and nr_to_scan arguments in the shrink_control structure. This will also allow us to avoid patching shrinkers that use list_lru when we make shrink_slab() per-memcg - all we will have to do is extend the shrink_control structure to include the target memcg and make list_lru_shrink_{count,walk} handle this appropriately. Signed-off-by: Vladimir Davydov Suggested-by: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/dcache.c | 14 ++++++-------- fs/gfs2/quota.c | 6 +++--- fs/inode.c | 7 +++---- fs/internal.h | 7 +++---- fs/super.c | 24 +++++++++++------------- fs/xfs/xfs_buf.c | 7 +++---- fs/xfs/xfs_qm.c | 7 +++---- include/linux/list_lru.h | 16 ++++++++++++++++ mm/workingset.c | 6 +++--- 9 files changed, 51 insertions(+), 43 deletions(-) (limited to 'mm') diff --git a/fs/dcache.c b/fs/dcache.c index e368d4f412f9..56c5da89f58a 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -930,24 +930,22 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) /** * prune_dcache_sb - shrink the dcache * @sb: superblock - * @nr_to_scan : number of entries to try to free - * @nid: which node to scan for freeable entities + * @sc: shrink control, passed to list_lru_shrink_walk() * - * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is - * done when we need more memory an called from the superblock shrinker + * Attempt to shrink the superblock dcache LRU by @sc->nr_to_scan entries. This + * is done when we need more memory and called from the superblock shrinker * function. * * This function may fail to free any resources if all the dentries are in * use. */ -long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan, - int nid) +long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc) { LIST_HEAD(dispose); long freed; - freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate, - &dispose, &nr_to_scan); + freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc, + dentry_lru_isolate, &dispose); shrink_dentry_list(&dispose); return freed; } diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c index 3e193cb36996..c15d6b216d0b 100644 --- a/fs/gfs2/quota.c +++ b/fs/gfs2/quota.c @@ -171,8 +171,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink, if (!(sc->gfp_mask & __GFP_FS)) return SHRINK_STOP; - freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate, - &dispose, &sc->nr_to_scan); + freed = list_lru_shrink_walk(&gfs2_qd_lru, sc, + gfs2_qd_isolate, &dispose); gfs2_qd_dispose(&dispose); @@ -182,7 +182,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink, static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink, struct shrink_control *sc) { - return vfs_pressure_ratio(list_lru_count_node(&gfs2_qd_lru, sc->nid)); + return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc)); } struct shrinker gfs2_qd_shrinker = { diff --git a/fs/inode.c b/fs/inode.c index 3a53b1da3fb8..524a32c2b0c6 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -751,14 +751,13 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) * to trim from the LRU. Inodes to be freed are moved to a temporary list and * then are freed outside inode_lock by dispose_list(). */ -long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan, - int nid) +long prune_icache_sb(struct super_block *sb, struct shrink_control *sc) { LIST_HEAD(freeable); long freed; - freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate, - &freeable, &nr_to_scan); + freed = list_lru_shrink_walk(&sb->s_inode_lru, sc, + inode_lru_isolate, &freeable); dispose_list(&freeable); return freed; } diff --git a/fs/internal.h b/fs/internal.h index e9a61fe67575..d92c346a793d 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -14,6 +14,7 @@ struct file_system_type; struct linux_binprm; struct path; struct mount; +struct shrink_control; /* * block_dev.c @@ -111,8 +112,7 @@ extern int open_check_o_direct(struct file *f); * inode.c */ extern spinlock_t inode_sb_list_lock; -extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan, - int nid); +extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); extern void inode_add_lru(struct inode *inode); /* @@ -129,8 +129,7 @@ extern int invalidate_inodes(struct super_block *, bool); */ extern struct dentry *__d_alloc(struct super_block *, const struct qstr *); extern int d_set_mounted(struct dentry *dentry); -extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan, - int nid); +extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc); /* * read_write.c diff --git a/fs/super.c b/fs/super.c index eae088f6aaae..4554ac257647 100644 --- a/fs/super.c +++ b/fs/super.c @@ -77,8 +77,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink, if (sb->s_op->nr_cached_objects) fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid); - inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid); - dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid); + inodes = list_lru_shrink_count(&sb->s_inode_lru, sc); + dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc); total_objects = dentries + inodes + fs_objects + 1; if (!total_objects) total_objects = 1; @@ -86,20 +86,20 @@ static unsigned long super_cache_scan(struct shrinker *shrink, /* proportion the scan between the caches */ dentries = mult_frac(sc->nr_to_scan, dentries, total_objects); inodes = mult_frac(sc->nr_to_scan, inodes, total_objects); + fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects); /* * prune the dcache first as the icache is pinned by it, then * prune the icache, followed by the filesystem specific caches */ - freed = prune_dcache_sb(sb, dentries, sc->nid); - freed += prune_icache_sb(sb, inodes, sc->nid); + sc->nr_to_scan = dentries; + freed = prune_dcache_sb(sb, sc); + sc->nr_to_scan = inodes; + freed += prune_icache_sb(sb, sc); - if (fs_objects) { - fs_objects = mult_frac(sc->nr_to_scan, fs_objects, - total_objects); + if (fs_objects) freed += sb->s_op->free_cached_objects(sb, fs_objects, sc->nid); - } drop_super(sb); return freed; @@ -118,17 +118,15 @@ static unsigned long super_cache_count(struct shrinker *shrink, * scalability bottleneck. The counts could get updated * between super_cache_count and super_cache_scan anyway. * Call to super_cache_count with shrinker_rwsem held - * ensures the safety of call to list_lru_count_node() and + * ensures the safety of call to list_lru_shrink_count() and * s_op->nr_cached_objects(). */ if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc->nid); - total_objects += list_lru_count_node(&sb->s_dentry_lru, - sc->nid); - total_objects += list_lru_count_node(&sb->s_inode_lru, - sc->nid); + total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc); + total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc); total_objects = vfs_pressure_ratio(total_objects); return total_objects; diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index bb502a391792..15c9d224c721 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1583,10 +1583,9 @@ xfs_buftarg_shrink_scan( struct xfs_buftarg, bt_shrinker); LIST_HEAD(dispose); unsigned long freed; - unsigned long nr_to_scan = sc->nr_to_scan; - freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate, - &dispose, &nr_to_scan); + freed = list_lru_shrink_walk(&btp->bt_lru, sc, + xfs_buftarg_isolate, &dispose); while (!list_empty(&dispose)) { struct xfs_buf *bp; @@ -1605,7 +1604,7 @@ xfs_buftarg_shrink_count( { struct xfs_buftarg *btp = container_of(shrink, struct xfs_buftarg, bt_shrinker); - return list_lru_count_node(&btp->bt_lru, sc->nid); + return list_lru_shrink_count(&btp->bt_lru, sc); } void diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index 3e8186279541..4f4b1274e144 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -523,7 +523,6 @@ xfs_qm_shrink_scan( struct xfs_qm_isolate isol; unsigned long freed; int error; - unsigned long nr_to_scan = sc->nr_to_scan; if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT)) return 0; @@ -531,8 +530,8 @@ xfs_qm_shrink_scan( INIT_LIST_HEAD(&isol.buffers); INIT_LIST_HEAD(&isol.dispose); - freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol, - &nr_to_scan); + freed = list_lru_shrink_walk(&qi->qi_lru, sc, + xfs_qm_dquot_isolate, &isol); error = xfs_buf_delwri_submit(&isol.buffers); if (error) @@ -557,7 +556,7 @@ xfs_qm_shrink_count( struct xfs_quotainfo *qi = container_of(shrink, struct xfs_quotainfo, qi_shrinker); - return list_lru_count_node(&qi->qi_lru, sc->nid); + return list_lru_shrink_count(&qi->qi_lru, sc); } /* diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index f3434533fbf8..f500a2e39b13 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -9,6 +9,7 @@ #include #include +#include /* list_lru_walk_cb has to always return one of those */ enum lru_status { @@ -81,6 +82,13 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item); * Callers that want such a guarantee need to provide an outer lock. */ unsigned long list_lru_count_node(struct list_lru *lru, int nid); + +static inline unsigned long list_lru_shrink_count(struct list_lru *lru, + struct shrink_control *sc) +{ + return list_lru_count_node(lru, sc->nid); +} + static inline unsigned long list_lru_count(struct list_lru *lru) { long count = 0; @@ -119,6 +127,14 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate, void *cb_arg, unsigned long *nr_to_walk); +static inline unsigned long +list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc, + list_lru_walk_cb isolate, void *cb_arg) +{ + return list_lru_walk_node(lru, sc->nid, isolate, cb_arg, + &sc->nr_to_scan); +} + static inline unsigned long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate, void *cb_arg, unsigned long nr_to_walk) diff --git a/mm/workingset.c b/mm/workingset.c index f7216fa7da27..d4fa7fb10a52 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -275,7 +275,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, /* list_lru lock nests inside IRQ-safe mapping->tree_lock */ local_irq_disable(); - shadow_nodes = list_lru_count_node(&workingset_shadow_nodes, sc->nid); + shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc); local_irq_enable(); pages = node_present_pages(sc->nid); @@ -376,8 +376,8 @@ static unsigned long scan_shadow_nodes(struct shrinker *shrinker, /* list_lru lock nests inside IRQ-safe mapping->tree_lock */ local_irq_disable(); - ret = list_lru_walk_node(&workingset_shadow_nodes, sc->nid, - shadow_lru_isolate, NULL, &sc->nr_to_scan); + ret = list_lru_shrink_walk(&workingset_shadow_nodes, sc, + shadow_lru_isolate, NULL); local_irq_enable(); return ret; } -- cgit v1.2.3 From cb731d6c62bbc2f890b08ea3d0386d5dad887326 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:58:54 -0800 Subject: vmscan: per memory cgroup slab shrinkers This patch adds SHRINKER_MEMCG_AWARE flag. If a shrinker has this flag set, it will be called per memory cgroup. The memory cgroup to scan objects from is passed in shrink_control->memcg. If the memory cgroup is NULL, a memcg aware shrinker is supposed to scan objects from the global list. Unaware shrinkers are only called on global pressure with memcg=NULL. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/drop_caches.c | 14 -------- include/linux/memcontrol.h | 7 ++++ include/linux/mm.h | 5 ++- include/linux/shrinker.h | 6 +++- mm/memcontrol.c | 2 +- mm/memory-failure.c | 11 ++---- mm/vmscan.c | 85 ++++++++++++++++++++++++++++++++++------------ 7 files changed, 80 insertions(+), 50 deletions(-) (limited to 'mm') diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 2bc2c87f35e7..5718cb9f7273 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -37,20 +37,6 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) iput(toput_inode); } -static void drop_slab(void) -{ - int nr_objects; - - do { - int nid; - - nr_objects = 0; - for_each_online_node(nid) - nr_objects += shrink_node_slabs(GFP_KERNEL, nid, - 1000, 1000); - } while (nr_objects > 10); -} - int drop_caches_sysctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6cfd934c7c9b..54992fe0959f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -413,6 +413,8 @@ static inline bool memcg_kmem_enabled(void) return static_key_false(&memcg_kmem_enabled_key); } +bool memcg_kmem_is_active(struct mem_cgroup *memcg); + /* * In general, we'll do everything in our power to not incur in any overhead * for non-memcg users for the kmem functions. Not even a function call, if we @@ -542,6 +544,11 @@ static inline bool memcg_kmem_enabled(void) return false; } +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg) +{ + return false; +} + static inline bool memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) { diff --git a/include/linux/mm.h b/include/linux/mm.h index a4d24f3c5430..af4ff88a11e0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2168,9 +2168,8 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); #endif -unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, - unsigned long nr_scanned, - unsigned long nr_eligible); +void drop_slab(void); +void drop_slab_node(int nid); #ifndef CONFIG_MMU #define randomize_va_space 0 diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index f4aee75f00b1..4fcacd915d45 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -20,6 +20,9 @@ struct shrink_control { /* current node being shrunk (for NUMA aware shrinkers) */ int nid; + + /* current memcg being shrunk (for memcg aware shrinkers) */ + struct mem_cgroup *memcg; }; #define SHRINK_STOP (~0UL) @@ -61,7 +64,8 @@ struct shrinker { #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ /* Flags */ -#define SHRINKER_NUMA_AWARE (1 << 0) +#define SHRINKER_NUMA_AWARE (1 << 0) +#define SHRINKER_MEMCG_AWARE (1 << 1) extern int register_shrinker(struct shrinker *); extern void unregister_shrinker(struct shrinker *); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 095c1f96fbec..3c2a1a8286ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -352,7 +352,7 @@ struct mem_cgroup { }; #ifdef CONFIG_MEMCG_KMEM -static bool memcg_kmem_is_active(struct mem_cgroup *memcg) +bool memcg_kmem_is_active(struct mem_cgroup *memcg) { return memcg->kmemcg_id >= 0; } diff --git a/mm/memory-failure.c b/mm/memory-failure.c index feb803bf3443..1a735fad2a13 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -242,15 +242,8 @@ void shake_page(struct page *p, int access) * Only call shrink_node_slabs here (which would also shrink * other caches) if access is not potentially fatal. */ - if (access) { - int nr; - int nid = page_to_nid(p); - do { - nr = shrink_node_slabs(GFP_KERNEL, nid, 1000, 1000); - if (page_count(p) == 1) - break; - } while (nr > 10); - } + if (access) + drop_slab_node(page_to_nid(p)); } EXPORT_SYMBOL_GPL(shake_page); diff --git a/mm/vmscan.c b/mm/vmscan.c index 8e645ee52045..803886b8e353 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -232,10 +232,10 @@ EXPORT_SYMBOL(unregister_shrinker); #define SHRINK_BATCH 128 -static unsigned long shrink_slabs(struct shrink_control *shrinkctl, - struct shrinker *shrinker, - unsigned long nr_scanned, - unsigned long nr_eligible) +static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, + struct shrinker *shrinker, + unsigned long nr_scanned, + unsigned long nr_eligible) { unsigned long freed = 0; unsigned long long delta; @@ -344,9 +344,10 @@ static unsigned long shrink_slabs(struct shrink_control *shrinkctl, } /** - * shrink_node_slabs - shrink slab caches of a given node + * shrink_slab - shrink slab caches * @gfp_mask: allocation context * @nid: node whose slab caches to target + * @memcg: memory cgroup whose slab caches to target * @nr_scanned: pressure numerator * @nr_eligible: pressure denominator * @@ -355,6 +356,12 @@ static unsigned long shrink_slabs(struct shrink_control *shrinkctl, * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set, * unaware shrinkers will receive a node id of 0 instead. * + * @memcg specifies the memory cgroup to target. If it is not NULL, + * only shrinkers with SHRINKER_MEMCG_AWARE set will be called to scan + * objects from the memory cgroup specified. Otherwise all shrinkers + * are called, and memcg aware shrinkers are supposed to scan the + * global list then. + * * @nr_scanned and @nr_eligible form a ratio that indicate how much of * the available objects should be scanned. Page reclaim for example * passes the number of pages scanned and the number of pages on the @@ -365,13 +372,17 @@ static unsigned long shrink_slabs(struct shrink_control *shrinkctl, * * Returns the number of reclaimed slab objects. */ -unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, - unsigned long nr_scanned, - unsigned long nr_eligible) +static unsigned long shrink_slab(gfp_t gfp_mask, int nid, + struct mem_cgroup *memcg, + unsigned long nr_scanned, + unsigned long nr_eligible) { struct shrinker *shrinker; unsigned long freed = 0; + if (memcg && !memcg_kmem_is_active(memcg)) + return 0; + if (nr_scanned == 0) nr_scanned = SWAP_CLUSTER_MAX; @@ -390,12 +401,16 @@ unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, + .memcg = memcg, }; + if (memcg && !(shrinker->flags & SHRINKER_MEMCG_AWARE)) + continue; + if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) sc.nid = 0; - freed += shrink_slabs(&sc, shrinker, nr_scanned, nr_eligible); + freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible); } up_read(&shrinker_rwsem); @@ -404,6 +419,29 @@ out: return freed; } +void drop_slab_node(int nid) +{ + unsigned long freed; + + do { + struct mem_cgroup *memcg = NULL; + + freed = 0; + do { + freed += shrink_slab(GFP_KERNEL, nid, memcg, + 1000, 1000); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); + } while (freed > 10); +} + +void drop_slab(void) +{ + int nid; + + for_each_online_node(nid) + drop_slab_node(nid); +} + static inline int is_page_cache_freeable(struct page *page) { /* @@ -2276,6 +2314,7 @@ static inline bool should_continue_reclaim(struct zone *zone, static bool shrink_zone(struct zone *zone, struct scan_control *sc, bool is_classzone) { + struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_reclaimed, nr_scanned; bool reclaimable = false; @@ -2294,6 +2333,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, memcg = mem_cgroup_iter(root, NULL, &reclaim); do { unsigned long lru_pages; + unsigned long scanned; struct lruvec *lruvec; int swappiness; @@ -2305,10 +2345,16 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, lruvec = mem_cgroup_zone_lruvec(zone, memcg); swappiness = mem_cgroup_swappiness(memcg); + scanned = sc->nr_scanned; shrink_lruvec(lruvec, swappiness, sc, &lru_pages); zone_lru_pages += lru_pages; + if (memcg && is_classzone) + shrink_slab(sc->gfp_mask, zone_to_nid(zone), + memcg, sc->nr_scanned - scanned, + lru_pages); + /* * Direct reclaim and kswapd have to scan all memory * cgroups to fulfill the overall scan target for the @@ -2330,19 +2376,14 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, * Shrink the slab caches in the same proportion that * the eligible LRU pages were scanned. */ - if (global_reclaim(sc) && is_classzone) { - struct reclaim_state *reclaim_state; - - shrink_node_slabs(sc->gfp_mask, zone_to_nid(zone), - sc->nr_scanned - nr_scanned, - zone_lru_pages); - - reclaim_state = current->reclaim_state; - if (reclaim_state) { - sc->nr_reclaimed += - reclaim_state->reclaimed_slab; - reclaim_state->reclaimed_slab = 0; - } + if (global_reclaim(sc) && is_classzone) + shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL, + sc->nr_scanned - nr_scanned, + zone_lru_pages); + + if (reclaim_state) { + sc->nr_reclaimed += reclaim_state->reclaimed_slab; + reclaim_state->reclaimed_slab = 0; } vmpressure(sc->gfp_mask, sc->target_mem_cgroup, -- cgit v1.2.3 From dbcf73e26cd0b3d66e6db65ab595e664a55e58ff Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:58:57 -0800 Subject: memcg: rename some cache id related variables memcg_limited_groups_array_size, which defines the size of memcg_caches arrays, sounds rather cumbersome. Also it doesn't point anyhow that it's related to kmem/caches stuff. So let's rename it to memcg_nr_cache_ids. It's concise and points us directly to memcg_cache_id. Also, rename kmem_limited_groups to memcg_cache_ida. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 4 ++-- mm/memcontrol.c | 19 +++++++++---------- mm/slab_common.c | 4 ++-- 3 files changed, 13 insertions(+), 14 deletions(-) (limited to 'mm') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 54992fe0959f..2607c91230af 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -398,7 +398,7 @@ static inline void sock_release_memcg(struct sock *sk) #ifdef CONFIG_MEMCG_KMEM extern struct static_key memcg_kmem_enabled_key; -extern int memcg_limited_groups_array_size; +extern int memcg_nr_cache_ids; /* * Helper macro to loop through all memcg-specific caches. Callers must still @@ -406,7 +406,7 @@ extern int memcg_limited_groups_array_size; * the slab_mutex must be held when looping through those caches */ #define for_each_memcg_cache_index(_idx) \ - for ((_idx) = 0; (_idx) < memcg_limited_groups_array_size; (_idx)++) + for ((_idx) = 0; (_idx) < memcg_nr_cache_ids; (_idx)++) static inline bool memcg_kmem_enabled(void) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3c2a1a8286ac..8608fa543b84 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -538,12 +538,11 @@ static void disarm_sock_keys(struct mem_cgroup *memcg) * memcgs, and none but the 200th is kmem-limited, we'd have to have a * 200 entry array for that. * - * The current size of the caches array is stored in - * memcg_limited_groups_array_size. It will double each time we have to - * increase it. + * The current size of the caches array is stored in memcg_nr_cache_ids. It + * will double each time we have to increase it. */ -static DEFINE_IDA(kmem_limited_groups); -int memcg_limited_groups_array_size; +static DEFINE_IDA(memcg_cache_ida); +int memcg_nr_cache_ids; /* * MIN_SIZE is different than 1, because we would like to avoid going through @@ -2538,12 +2537,12 @@ static int memcg_alloc_cache_id(void) int id, size; int err; - id = ida_simple_get(&kmem_limited_groups, + id = ida_simple_get(&memcg_cache_ida, 0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL); if (id < 0) return id; - if (id < memcg_limited_groups_array_size) + if (id < memcg_nr_cache_ids) return id; /* @@ -2559,7 +2558,7 @@ static int memcg_alloc_cache_id(void) err = memcg_update_all_caches(size); if (err) { - ida_simple_remove(&kmem_limited_groups, id); + ida_simple_remove(&memcg_cache_ida, id); return err; } return id; @@ -2567,7 +2566,7 @@ static int memcg_alloc_cache_id(void) static void memcg_free_cache_id(int id) { - ida_simple_remove(&kmem_limited_groups, id); + ida_simple_remove(&memcg_cache_ida, id); } /* @@ -2577,7 +2576,7 @@ static void memcg_free_cache_id(int id) */ void memcg_update_array_size(int num) { - memcg_limited_groups_array_size = num; + memcg_nr_cache_ids = num; } struct memcg_kmem_cache_create_work { diff --git a/mm/slab_common.c b/mm/slab_common.c index 6e1e4cf65836..f8899eedab68 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -116,7 +116,7 @@ static int memcg_alloc_cache_params(struct mem_cgroup *memcg, if (!memcg) { size = offsetof(struct memcg_cache_params, memcg_caches); - size += memcg_limited_groups_array_size * sizeof(void *); + size += memcg_nr_cache_ids * sizeof(void *); } else size = sizeof(struct memcg_cache_params); @@ -154,7 +154,7 @@ static int memcg_update_cache_params(struct kmem_cache *s, int num_memcgs) cur_params = s->memcg_params; memcpy(new_params->memcg_caches, cur_params->memcg_caches, - memcg_limited_groups_array_size * sizeof(void *)); + memcg_nr_cache_ids * sizeof(void *)); new_params->is_root_cache = true; -- cgit v1.2.3 From 05257a1a3dcc196c197714b5c9a8dd35b7f6aefc Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:01 -0800 Subject: memcg: add rwsem to synchronize against memcg_caches arrays relocation We need a stable value of memcg_nr_cache_ids in kmem_cache_create() (memcg_alloc_cache_params() wants it for root caches), where we only hold the slab_mutex and no memcg-related locks. As a result, we have to update memcg_nr_cache_ids under the slab_mutex, which we can only take on the slab's side (see memcg_update_array_size). This looks awkward and will become even worse when per-memcg list_lru is introduced, which also wants stable access to memcg_nr_cache_ids. To get rid of this dependency between the memcg_nr_cache_ids and the slab_mutex, this patch introduces a special rwsem. The rwsem is held for writing during memcg_caches arrays relocation and memcg_nr_cache_ids updates. Therefore one can take it for reading to get a stable access to memcg_caches arrays and/or memcg_nr_cache_ids. Currently the semaphore is taken for reading only from kmem_cache_create, right before taking the slab_mutex, so right now there's no much point in using rwsem instead of mutex. However, once list_lru is made per-memcg it will allow list_lru initializations to proceed concurrently. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 12 ++++++++++-- mm/memcontrol.c | 29 +++++++++++++++++++---------- mm/slab_common.c | 9 ++++----- 3 files changed, 33 insertions(+), 17 deletions(-) (limited to 'mm') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 2607c91230af..dbc4baa3619c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -399,6 +399,8 @@ static inline void sock_release_memcg(struct sock *sk) extern struct static_key memcg_kmem_enabled_key; extern int memcg_nr_cache_ids; +extern void memcg_get_cache_ids(void); +extern void memcg_put_cache_ids(void); /* * Helper macro to loop through all memcg-specific caches. Callers must still @@ -434,8 +436,6 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order); int memcg_cache_id(struct mem_cgroup *memcg); -void memcg_update_array_size(int num_groups); - struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep); void __memcg_kmem_put_cache(struct kmem_cache *cachep); @@ -569,6 +569,14 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg) return -1; } +static inline void memcg_get_cache_ids(void) +{ +} + +static inline void memcg_put_cache_ids(void) +{ +} + static inline struct kmem_cache * memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8608fa543b84..6706e5fa5ac0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -544,6 +544,19 @@ static void disarm_sock_keys(struct mem_cgroup *memcg) static DEFINE_IDA(memcg_cache_ida); int memcg_nr_cache_ids; +/* Protects memcg_nr_cache_ids */ +static DECLARE_RWSEM(memcg_cache_ids_sem); + +void memcg_get_cache_ids(void) +{ + down_read(&memcg_cache_ids_sem); +} + +void memcg_put_cache_ids(void) +{ + up_read(&memcg_cache_ids_sem); +} + /* * MIN_SIZE is different than 1, because we would like to avoid going through * the alloc/free process all the time. In a small machine, 4 kmem-limited @@ -2549,6 +2562,7 @@ static int memcg_alloc_cache_id(void) * There's no space for the new id in memcg_caches arrays, * so we have to grow them. */ + down_write(&memcg_cache_ids_sem); size = 2 * (id + 1); if (size < MEMCG_CACHES_MIN_SIZE) @@ -2557,6 +2571,11 @@ static int memcg_alloc_cache_id(void) size = MEMCG_CACHES_MAX_SIZE; err = memcg_update_all_caches(size); + if (!err) + memcg_nr_cache_ids = size; + + up_write(&memcg_cache_ids_sem); + if (err) { ida_simple_remove(&memcg_cache_ida, id); return err; @@ -2569,16 +2588,6 @@ static void memcg_free_cache_id(int id) ida_simple_remove(&memcg_cache_ida, id); } -/* - * We should update the current array size iff all caches updates succeed. This - * can only be done from the slab side. The slab mutex needs to be held when - * calling this. - */ -void memcg_update_array_size(int num) -{ - memcg_nr_cache_ids = num; -} - struct memcg_kmem_cache_create_work { struct mem_cgroup *memcg; struct kmem_cache *cachep; diff --git a/mm/slab_common.c b/mm/slab_common.c index f8899eedab68..23f5fcde6043 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -169,8 +169,8 @@ int memcg_update_all_caches(int num_memcgs) { struct kmem_cache *s; int ret = 0; - mutex_lock(&slab_mutex); + mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) { if (!is_root_cache(s)) continue; @@ -181,11 +181,8 @@ int memcg_update_all_caches(int num_memcgs) * up to this point in an updated state. */ if (ret) - goto out; + break; } - - memcg_update_array_size(num_memcgs); -out: mutex_unlock(&slab_mutex); return ret; } @@ -369,6 +366,7 @@ kmem_cache_create(const char *name, size_t size, size_t align, get_online_cpus(); get_online_mems(); + memcg_get_cache_ids(); mutex_lock(&slab_mutex); @@ -407,6 +405,7 @@ kmem_cache_create(const char *name, size_t size, size_t align, out_unlock: mutex_unlock(&slab_mutex); + memcg_put_cache_ids(); put_online_mems(); put_online_cpus(); -- cgit v1.2.3 From ff0b67ef5b1687692bc1fd3ce4bc3d1ff83587c7 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:04 -0800 Subject: list_lru: get rid of ->active_nodes The active_nodes mask allows us to skip empty nodes when walking over list_lru items from all nodes in list_lru_count/walk. However, these functions are never called from hot paths, so it doesn't seem we need such kind of optimization there. OTOH, removing the mask will make it easier to make list_lru per-memcg. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/list_lru.h | 5 ++--- mm/list_lru.c | 10 +++------- 2 files changed, 5 insertions(+), 10 deletions(-) (limited to 'mm') diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index f500a2e39b13..53c1d6b78270 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -31,7 +31,6 @@ struct list_lru_node { struct list_lru { struct list_lru_node *node; - nodemask_t active_nodes; }; void list_lru_destroy(struct list_lru *lru); @@ -94,7 +93,7 @@ static inline unsigned long list_lru_count(struct list_lru *lru) long count = 0; int nid; - for_each_node_mask(nid, lru->active_nodes) + for_each_node_state(nid, N_NORMAL_MEMORY) count += list_lru_count_node(lru, nid); return count; @@ -142,7 +141,7 @@ list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate, long isolated = 0; int nid; - for_each_node_mask(nid, lru->active_nodes) { + for_each_node_state(nid, N_NORMAL_MEMORY) { isolated += list_lru_walk_node(lru, nid, isolate, cb_arg, &nr_to_walk); if (nr_to_walk <= 0) diff --git a/mm/list_lru.c b/mm/list_lru.c index f1a0db194173..07e198c77888 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -19,8 +19,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item) WARN_ON_ONCE(nlru->nr_items < 0); if (list_empty(item)) { list_add_tail(item, &nlru->list); - if (nlru->nr_items++ == 0) - node_set(nid, lru->active_nodes); + nlru->nr_items++; spin_unlock(&nlru->lock); return true; } @@ -37,8 +36,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item) spin_lock(&nlru->lock); if (!list_empty(item)) { list_del_init(item); - if (--nlru->nr_items == 0) - node_clear(nid, lru->active_nodes); + nlru->nr_items--; WARN_ON_ONCE(nlru->nr_items < 0); spin_unlock(&nlru->lock); return true; @@ -90,8 +88,7 @@ restart: case LRU_REMOVED_RETRY: assert_spin_locked(&nlru->lock); case LRU_REMOVED: - if (--nlru->nr_items == 0) - node_clear(nid, lru->active_nodes); + nlru->nr_items--; WARN_ON_ONCE(nlru->nr_items < 0); isolated++; /* @@ -133,7 +130,6 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key) if (!lru->node) return -ENOMEM; - nodes_clear(lru->active_nodes); for (i = 0; i < nr_node_ids; i++) { spin_lock_init(&lru->node[i].lock); if (key) -- cgit v1.2.3 From c0a5b560938a0f2fd2fbf66ddc446c7c2b41383a Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:07 -0800 Subject: list_lru: organize all list_lrus to list To make list_lru memcg aware, we need all list_lrus to be kept on a list protected by a mutex, so that we could sleep while walking over the list. Therefore after this change list_lru_destroy may sleep. Fortunately, there is only one user that calls it from an atomic context - it's put_super - and we can easily fix it by calling list_lru_destroy before put_super in destroy_locked_super - anyway we don't longer need lrus by that time. Another point that should be noted is that list_lru_destroy is allowed to be called on an uninitialized zeroed-out object, in which case it is a no-op. Before this patch this was guaranteed by kfree, but now we need an explicit check there. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/super.c | 8 ++++++++ include/linux/list_lru.h | 3 +++ mm/list_lru.c | 34 ++++++++++++++++++++++++++++++++++ 3 files changed, 45 insertions(+) (limited to 'mm') diff --git a/fs/super.c b/fs/super.c index a2b735a42e74..b027849d92d2 100644 --- a/fs/super.c +++ b/fs/super.c @@ -282,6 +282,14 @@ void deactivate_locked_super(struct super_block *s) unregister_shrinker(&s->s_shrink); fs->kill_sb(s); + /* + * Since list_lru_destroy() may sleep, we cannot call it from + * put_super(), where we hold the sb_lock. Therefore we destroy + * the lru lists right now. + */ + list_lru_destroy(&s->s_dentry_lru); + list_lru_destroy(&s->s_inode_lru); + put_filesystem(fs); put_super(s); } else { diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 53c1d6b78270..ee9486ac0621 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -31,6 +31,9 @@ struct list_lru_node { struct list_lru { struct list_lru_node *node; +#ifdef CONFIG_MEMCG_KMEM + struct list_head list; +#endif }; void list_lru_destroy(struct list_lru *lru); diff --git a/mm/list_lru.c b/mm/list_lru.c index 07e198c77888..a9021cb3ccde 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -9,6 +9,34 @@ #include #include #include +#include + +#ifdef CONFIG_MEMCG_KMEM +static LIST_HEAD(list_lrus); +static DEFINE_MUTEX(list_lrus_mutex); + +static void list_lru_register(struct list_lru *lru) +{ + mutex_lock(&list_lrus_mutex); + list_add(&lru->list, &list_lrus); + mutex_unlock(&list_lrus_mutex); +} + +static void list_lru_unregister(struct list_lru *lru) +{ + mutex_lock(&list_lrus_mutex); + list_del(&lru->list); + mutex_unlock(&list_lrus_mutex); +} +#else +static void list_lru_register(struct list_lru *lru) +{ +} + +static void list_lru_unregister(struct list_lru *lru) +{ +} +#endif /* CONFIG_MEMCG_KMEM */ bool list_lru_add(struct list_lru *lru, struct list_head *item) { @@ -137,12 +165,18 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key) INIT_LIST_HEAD(&lru->node[i].list); lru->node[i].nr_items = 0; } + list_lru_register(lru); return 0; } EXPORT_SYMBOL_GPL(list_lru_init_key); void list_lru_destroy(struct list_lru *lru) { + /* Already destroyed or not yet initialized? */ + if (!lru->node) + return; + list_lru_unregister(lru); kfree(lru->node); + lru->node = NULL; } EXPORT_SYMBOL_GPL(list_lru_destroy); -- cgit v1.2.3 From 60d3fd32a7a9da4c8c93a9f89cfda22a0b4c65ce Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:10 -0800 Subject: list_lru: introduce per-memcg lists There are several FS shrinkers, including super_block::s_shrink, that keep reclaimable objects in the list_lru structure. Hence to turn them to memcg-aware shrinkers, it is enough to make list_lru per-memcg. This patch does the trick. It adds an array of lru lists to the list_lru_node structure (per-node part of the list_lru), one for each kmem-active memcg, and dispatches every item addition or removal to the list corresponding to the memcg which the item is accounted to. So now the list_lru structure is not just per node, but per node and per memcg. Not all list_lrus need this feature, so this patch also adds a new method, list_lru_init_memcg, which initializes a list_lru as memcg aware. Otherwise (i.e. if initialized with old list_lru_init), the list_lru won't have per memcg lists. Just like per memcg caches arrays, the arrays of per-memcg lists are indexed by memcg_cache_id, so we must grow them whenever memcg_nr_cache_ids is increased. So we introduce a callback, memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id space is full. The locking is implemented in a manner similar to lruvecs, i.e. we have one lock per node that protects all lists (both global and per cgroup) on the node. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/list_lru.h | 52 +++++-- include/linux/memcontrol.h | 14 ++ mm/list_lru.c | 374 ++++++++++++++++++++++++++++++++++++++++++--- mm/memcontrol.c | 20 +++ 4 files changed, 424 insertions(+), 36 deletions(-) (limited to 'mm') diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index ee9486ac0621..305b598abac2 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -11,6 +11,8 @@ #include #include +struct mem_cgroup; + /* list_lru_walk_cb has to always return one of those */ enum lru_status { LRU_REMOVED, /* item removed from list */ @@ -22,11 +24,26 @@ enum lru_status { internally, but has to return locked. */ }; -struct list_lru_node { - spinlock_t lock; +struct list_lru_one { struct list_head list; /* kept as signed so we can catch imbalance bugs */ long nr_items; +}; + +struct list_lru_memcg { + /* array of per cgroup lists, indexed by memcg_cache_id */ + struct list_lru_one *lru[0]; +}; + +struct list_lru_node { + /* protects all lists on the node, including per cgroup */ + spinlock_t lock; + /* global list, used for the root cgroup in cgroup aware lrus */ + struct list_lru_one lru; +#ifdef CONFIG_MEMCG_KMEM + /* for cgroup aware lrus points to per cgroup lists, otherwise NULL */ + struct list_lru_memcg *memcg_lrus; +#endif } ____cacheline_aligned_in_smp; struct list_lru { @@ -37,11 +54,14 @@ struct list_lru { }; void list_lru_destroy(struct list_lru *lru); -int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key); -static inline int list_lru_init(struct list_lru *lru) -{ - return list_lru_init_key(lru, NULL); -} +int __list_lru_init(struct list_lru *lru, bool memcg_aware, + struct lock_class_key *key); + +#define list_lru_init(lru) __list_lru_init((lru), false, NULL) +#define list_lru_init_key(lru, key) __list_lru_init((lru), false, (key)) +#define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL) + +int memcg_update_all_list_lrus(int num_memcgs); /** * list_lru_add: add an element to the lru list's tail @@ -75,20 +95,23 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item); bool list_lru_del(struct list_lru *lru, struct list_head *item); /** - * list_lru_count_node: return the number of objects currently held by @lru + * list_lru_count_one: return the number of objects currently held by @lru * @lru: the lru pointer. * @nid: the node id to count from. + * @memcg: the cgroup to count from. * * Always return a non-negative number, 0 for empty lists. There is no * guarantee that the list is not updated while the count is being computed. * Callers that want such a guarantee need to provide an outer lock. */ +unsigned long list_lru_count_one(struct list_lru *lru, + int nid, struct mem_cgroup *memcg); unsigned long list_lru_count_node(struct list_lru *lru, int nid); static inline unsigned long list_lru_shrink_count(struct list_lru *lru, struct shrink_control *sc) { - return list_lru_count_node(lru, sc->nid); + return list_lru_count_one(lru, sc->nid, sc->memcg); } static inline unsigned long list_lru_count(struct list_lru *lru) @@ -105,9 +128,10 @@ static inline unsigned long list_lru_count(struct list_lru *lru) typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg); /** - * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items. + * list_lru_walk_one: walk a list_lru, isolating and disposing freeable items. * @lru: the lru pointer. * @nid: the node id to scan from. + * @memcg: the cgroup to scan from. * @isolate: callback function that is resposible for deciding what to do with * the item currently being scanned * @cb_arg: opaque type that will be passed to @isolate @@ -125,6 +149,10 @@ typedef enum lru_status * * Return value: the number of objects effectively removed from the LRU. */ +unsigned long list_lru_walk_one(struct list_lru *lru, + int nid, struct mem_cgroup *memcg, + list_lru_walk_cb isolate, void *cb_arg, + unsigned long *nr_to_walk); unsigned long list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate, void *cb_arg, unsigned long *nr_to_walk); @@ -133,8 +161,8 @@ static inline unsigned long list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc, list_lru_walk_cb isolate, void *cb_arg) { - return list_lru_walk_node(lru, sc->nid, isolate, cb_arg, - &sc->nr_to_scan); + return list_lru_walk_one(lru, sc->nid, sc->memcg, isolate, cb_arg, + &sc->nr_to_scan); } static inline unsigned long diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index dbc4baa3619c..72dff5fb0d0c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -439,6 +439,8 @@ int memcg_cache_id(struct mem_cgroup *memcg); struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep); void __memcg_kmem_put_cache(struct kmem_cache *cachep); +struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr); + int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, unsigned long nr_pages); void memcg_uncharge_kmem(struct mem_cgroup *memcg, unsigned long nr_pages); @@ -535,6 +537,13 @@ static __always_inline void memcg_kmem_put_cache(struct kmem_cache *cachep) if (memcg_kmem_enabled()) __memcg_kmem_put_cache(cachep); } + +static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) +{ + if (!memcg_kmem_enabled()) + return NULL; + return __mem_cgroup_from_kmem(ptr); +} #else #define for_each_memcg_cache_index(_idx) \ for (; NULL; ) @@ -586,6 +595,11 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) static inline void memcg_kmem_put_cache(struct kmem_cache *cachep) { } + +static inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) +{ + return NULL; +} #endif /* CONFIG_MEMCG_KMEM */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/list_lru.c b/mm/list_lru.c index a9021cb3ccde..79aee70c3b9d 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -10,6 +10,7 @@ #include #include #include +#include #ifdef CONFIG_MEMCG_KMEM static LIST_HEAD(list_lrus); @@ -38,16 +39,71 @@ static void list_lru_unregister(struct list_lru *lru) } #endif /* CONFIG_MEMCG_KMEM */ +#ifdef CONFIG_MEMCG_KMEM +static inline bool list_lru_memcg_aware(struct list_lru *lru) +{ + return !!lru->node[0].memcg_lrus; +} + +static inline struct list_lru_one * +list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx) +{ + /* + * The lock protects the array of per cgroup lists from relocation + * (see memcg_update_list_lru_node). + */ + lockdep_assert_held(&nlru->lock); + if (nlru->memcg_lrus && idx >= 0) + return nlru->memcg_lrus->lru[idx]; + + return &nlru->lru; +} + +static inline struct list_lru_one * +list_lru_from_kmem(struct list_lru_node *nlru, void *ptr) +{ + struct mem_cgroup *memcg; + + if (!nlru->memcg_lrus) + return &nlru->lru; + + memcg = mem_cgroup_from_kmem(ptr); + if (!memcg) + return &nlru->lru; + + return list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg)); +} +#else +static inline bool list_lru_memcg_aware(struct list_lru *lru) +{ + return false; +} + +static inline struct list_lru_one * +list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx) +{ + return &nlru->lru; +} + +static inline struct list_lru_one * +list_lru_from_kmem(struct list_lru_node *nlru, void *ptr) +{ + return &nlru->lru; +} +#endif /* CONFIG_MEMCG_KMEM */ + bool list_lru_add(struct list_lru *lru, struct list_head *item) { int nid = page_to_nid(virt_to_page(item)); struct list_lru_node *nlru = &lru->node[nid]; + struct list_lru_one *l; spin_lock(&nlru->lock); - WARN_ON_ONCE(nlru->nr_items < 0); + l = list_lru_from_kmem(nlru, item); + WARN_ON_ONCE(l->nr_items < 0); if (list_empty(item)) { - list_add_tail(item, &nlru->list); - nlru->nr_items++; + list_add_tail(item, &l->list); + l->nr_items++; spin_unlock(&nlru->lock); return true; } @@ -60,12 +116,14 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item) { int nid = page_to_nid(virt_to_page(item)); struct list_lru_node *nlru = &lru->node[nid]; + struct list_lru_one *l; spin_lock(&nlru->lock); + l = list_lru_from_kmem(nlru, item); if (!list_empty(item)) { list_del_init(item); - nlru->nr_items--; - WARN_ON_ONCE(nlru->nr_items < 0); + l->nr_items--; + WARN_ON_ONCE(l->nr_items < 0); spin_unlock(&nlru->lock); return true; } @@ -74,33 +132,58 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item) } EXPORT_SYMBOL_GPL(list_lru_del); -unsigned long -list_lru_count_node(struct list_lru *lru, int nid) +static unsigned long __list_lru_count_one(struct list_lru *lru, + int nid, int memcg_idx) { - unsigned long count = 0; struct list_lru_node *nlru = &lru->node[nid]; + struct list_lru_one *l; + unsigned long count; spin_lock(&nlru->lock); - WARN_ON_ONCE(nlru->nr_items < 0); - count += nlru->nr_items; + l = list_lru_from_memcg_idx(nlru, memcg_idx); + WARN_ON_ONCE(l->nr_items < 0); + count = l->nr_items; spin_unlock(&nlru->lock); return count; } + +unsigned long list_lru_count_one(struct list_lru *lru, + int nid, struct mem_cgroup *memcg) +{ + return __list_lru_count_one(lru, nid, memcg_cache_id(memcg)); +} +EXPORT_SYMBOL_GPL(list_lru_count_one); + +unsigned long list_lru_count_node(struct list_lru *lru, int nid) +{ + long count = 0; + int memcg_idx; + + count += __list_lru_count_one(lru, nid, -1); + if (list_lru_memcg_aware(lru)) { + for_each_memcg_cache_index(memcg_idx) + count += __list_lru_count_one(lru, nid, memcg_idx); + } + return count; +} EXPORT_SYMBOL_GPL(list_lru_count_node); -unsigned long -list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate, - void *cb_arg, unsigned long *nr_to_walk) +static unsigned long +__list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx, + list_lru_walk_cb isolate, void *cb_arg, + unsigned long *nr_to_walk) { - struct list_lru_node *nlru = &lru->node[nid]; + struct list_lru_node *nlru = &lru->node[nid]; + struct list_lru_one *l; struct list_head *item, *n; unsigned long isolated = 0; spin_lock(&nlru->lock); + l = list_lru_from_memcg_idx(nlru, memcg_idx); restart: - list_for_each_safe(item, n, &nlru->list) { + list_for_each_safe(item, n, &l->list) { enum lru_status ret; /* @@ -116,8 +199,8 @@ restart: case LRU_REMOVED_RETRY: assert_spin_locked(&nlru->lock); case LRU_REMOVED: - nlru->nr_items--; - WARN_ON_ONCE(nlru->nr_items < 0); + l->nr_items--; + WARN_ON_ONCE(l->nr_items < 0); isolated++; /* * If the lru lock has been dropped, our list @@ -128,7 +211,7 @@ restart: goto restart; break; case LRU_ROTATE: - list_move_tail(item, &nlru->list); + list_move_tail(item, &l->list); break; case LRU_SKIP: break; @@ -147,36 +230,279 @@ restart: spin_unlock(&nlru->lock); return isolated; } + +unsigned long +list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg, + list_lru_walk_cb isolate, void *cb_arg, + unsigned long *nr_to_walk) +{ + return __list_lru_walk_one(lru, nid, memcg_cache_id(memcg), + isolate, cb_arg, nr_to_walk); +} +EXPORT_SYMBOL_GPL(list_lru_walk_one); + +unsigned long list_lru_walk_node(struct list_lru *lru, int nid, + list_lru_walk_cb isolate, void *cb_arg, + unsigned long *nr_to_walk) +{ + long isolated = 0; + int memcg_idx; + + isolated += __list_lru_walk_one(lru, nid, -1, isolate, cb_arg, + nr_to_walk); + if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) { + for_each_memcg_cache_index(memcg_idx) { + isolated += __list_lru_walk_one(lru, nid, memcg_idx, + isolate, cb_arg, nr_to_walk); + if (*nr_to_walk <= 0) + break; + } + } + return isolated; +} EXPORT_SYMBOL_GPL(list_lru_walk_node); -int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key) +static void init_one_lru(struct list_lru_one *l) +{ + INIT_LIST_HEAD(&l->list); + l->nr_items = 0; +} + +#ifdef CONFIG_MEMCG_KMEM +static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus, + int begin, int end) +{ + int i; + + for (i = begin; i < end; i++) + kfree(memcg_lrus->lru[i]); +} + +static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus, + int begin, int end) +{ + int i; + + for (i = begin; i < end; i++) { + struct list_lru_one *l; + + l = kmalloc(sizeof(struct list_lru_one), GFP_KERNEL); + if (!l) + goto fail; + + init_one_lru(l); + memcg_lrus->lru[i] = l; + } + return 0; +fail: + __memcg_destroy_list_lru_node(memcg_lrus, begin, i - 1); + return -ENOMEM; +} + +static int memcg_init_list_lru_node(struct list_lru_node *nlru) +{ + int size = memcg_nr_cache_ids; + + nlru->memcg_lrus = kmalloc(size * sizeof(void *), GFP_KERNEL); + if (!nlru->memcg_lrus) + return -ENOMEM; + + if (__memcg_init_list_lru_node(nlru->memcg_lrus, 0, size)) { + kfree(nlru->memcg_lrus); + return -ENOMEM; + } + + return 0; +} + +static void memcg_destroy_list_lru_node(struct list_lru_node *nlru) +{ + __memcg_destroy_list_lru_node(nlru->memcg_lrus, 0, memcg_nr_cache_ids); + kfree(nlru->memcg_lrus); +} + +static int memcg_update_list_lru_node(struct list_lru_node *nlru, + int old_size, int new_size) +{ + struct list_lru_memcg *old, *new; + + BUG_ON(old_size > new_size); + + old = nlru->memcg_lrus; + new = kmalloc(new_size * sizeof(void *), GFP_KERNEL); + if (!new) + return -ENOMEM; + + if (__memcg_init_list_lru_node(new, old_size, new_size)) { + kfree(new); + return -ENOMEM; + } + + memcpy(new, old, old_size * sizeof(void *)); + + /* + * The lock guarantees that we won't race with a reader + * (see list_lru_from_memcg_idx). + * + * Since list_lru_{add,del} may be called under an IRQ-safe lock, + * we have to use IRQ-safe primitives here to avoid deadlock. + */ + spin_lock_irq(&nlru->lock); + nlru->memcg_lrus = new; + spin_unlock_irq(&nlru->lock); + + kfree(old); + return 0; +} + +static void memcg_cancel_update_list_lru_node(struct list_lru_node *nlru, + int old_size, int new_size) +{ + /* do not bother shrinking the array back to the old size, because we + * cannot handle allocation failures here */ + __memcg_destroy_list_lru_node(nlru->memcg_lrus, old_size, new_size); +} + +static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) +{ + int i; + + for (i = 0; i < nr_node_ids; i++) { + if (!memcg_aware) + lru->node[i].memcg_lrus = NULL; + else if (memcg_init_list_lru_node(&lru->node[i])) + goto fail; + } + return 0; +fail: + for (i = i - 1; i >= 0; i--) + memcg_destroy_list_lru_node(&lru->node[i]); + return -ENOMEM; +} + +static void memcg_destroy_list_lru(struct list_lru *lru) +{ + int i; + + if (!list_lru_memcg_aware(lru)) + return; + + for (i = 0; i < nr_node_ids; i++) + memcg_destroy_list_lru_node(&lru->node[i]); +} + +static int memcg_update_list_lru(struct list_lru *lru, + int old_size, int new_size) +{ + int i; + + if (!list_lru_memcg_aware(lru)) + return 0; + + for (i = 0; i < nr_node_ids; i++) { + if (memcg_update_list_lru_node(&lru->node[i], + old_size, new_size)) + goto fail; + } + return 0; +fail: + for (i = i - 1; i >= 0; i--) + memcg_cancel_update_list_lru_node(&lru->node[i], + old_size, new_size); + return -ENOMEM; +} + +static void memcg_cancel_update_list_lru(struct list_lru *lru, + int old_size, int new_size) +{ + int i; + + if (!list_lru_memcg_aware(lru)) + return; + + for (i = 0; i < nr_node_ids; i++) + memcg_cancel_update_list_lru_node(&lru->node[i], + old_size, new_size); +} + +int memcg_update_all_list_lrus(int new_size) +{ + int ret = 0; + struct list_lru *lru; + int old_size = memcg_nr_cache_ids; + + mutex_lock(&list_lrus_mutex); + list_for_each_entry(lru, &list_lrus, list) { + ret = memcg_update_list_lru(lru, old_size, new_size); + if (ret) + goto fail; + } +out: + mutex_unlock(&list_lrus_mutex); + return ret; +fail: + list_for_each_entry_continue_reverse(lru, &list_lrus, list) + memcg_cancel_update_list_lru(lru, old_size, new_size); + goto out; +} +#else +static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) +{ + return 0; +} + +static void memcg_destroy_list_lru(struct list_lru *lru) +{ +} +#endif /* CONFIG_MEMCG_KMEM */ + +int __list_lru_init(struct list_lru *lru, bool memcg_aware, + struct lock_class_key *key) { int i; size_t size = sizeof(*lru->node) * nr_node_ids; + int err = -ENOMEM; + + memcg_get_cache_ids(); lru->node = kzalloc(size, GFP_KERNEL); if (!lru->node) - return -ENOMEM; + goto out; for (i = 0; i < nr_node_ids; i++) { spin_lock_init(&lru->node[i].lock); if (key) lockdep_set_class(&lru->node[i].lock, key); - INIT_LIST_HEAD(&lru->node[i].list); - lru->node[i].nr_items = 0; + init_one_lru(&lru->node[i].lru); + } + + err = memcg_init_list_lru(lru, memcg_aware); + if (err) { + kfree(lru->node); + goto out; } + list_lru_register(lru); - return 0; +out: + memcg_put_cache_ids(); + return err; } -EXPORT_SYMBOL_GPL(list_lru_init_key); +EXPORT_SYMBOL_GPL(__list_lru_init); void list_lru_destroy(struct list_lru *lru) { /* Already destroyed or not yet initialized? */ if (!lru->node) return; + + memcg_get_cache_ids(); + list_lru_unregister(lru); + + memcg_destroy_list_lru(lru); kfree(lru->node); lru->node = NULL; + + memcg_put_cache_ids(); } EXPORT_SYMBOL_GPL(list_lru_destroy); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6706e5fa5ac0..afa55bb38cbd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2571,6 +2571,8 @@ static int memcg_alloc_cache_id(void) size = MEMCG_CACHES_MAX_SIZE; err = memcg_update_all_caches(size); + if (!err) + err = memcg_update_all_list_lrus(size); if (!err) memcg_nr_cache_ids = size; @@ -2765,6 +2767,24 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order) memcg_uncharge_kmem(memcg, 1 << order); page->mem_cgroup = NULL; } + +struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr) +{ + struct mem_cgroup *memcg = NULL; + struct kmem_cache *cachep; + struct page *page; + + page = virt_to_head_page(ptr); + if (PageSlab(page)) { + cachep = page->slab_cache; + if (!is_root_cache(cachep)) + memcg = cachep->memcg_params->memcg; + } else + /* page allocated by alloc_kmem_pages */ + memcg = page->mem_cgroup; + + return memcg; +} #endif /* CONFIG_MEMCG_KMEM */ #ifdef CONFIG_TRANSPARENT_HUGEPAGE -- cgit v1.2.3 From f7ce3190c4a35bf887adb7a1aa1ba899b679872d Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:20 -0800 Subject: slab: embed memcg_cache_params to kmem_cache Currently, kmem_cache stores a pointer to struct memcg_cache_params instead of embedding it. The rationale is to save memory when kmem accounting is disabled. However, the memcg_cache_params has shrivelled drastically since it was first introduced: * Initially: struct memcg_cache_params { bool is_root_cache; union { struct kmem_cache *memcg_caches[0]; struct { struct mem_cgroup *memcg; struct list_head list; struct kmem_cache *root_cache; bool dead; atomic_t nr_pages; struct work_struct destroy; }; }; }; * Now: struct memcg_cache_params { bool is_root_cache; union { struct { struct rcu_head rcu_head; struct kmem_cache *memcg_caches[0]; }; struct { struct mem_cgroup *memcg; struct kmem_cache *root_cache; }; }; }; So the memory saving does not seem to be a clear win anymore. OTOH, keeping a pointer to memcg_cache_params struct instead of embedding it results in touching one more cache line on kmem alloc/free hot paths. Besides, it makes linking kmem caches in a list chained by a field of struct memcg_cache_params really painful due to a level of indirection, while I want to make them linked in the following patch. That said, let us embed it. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Cc: Dan Carpenter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 17 +++---- include/linux/slab_def.h | 2 +- include/linux/slub_def.h | 2 +- mm/memcontrol.c | 11 ++-- mm/slab.h | 48 +++++++++--------- mm/slab_common.c | 129 ++++++++++++++++++++++++++--------------------- mm/slub.c | 5 +- 7 files changed, 111 insertions(+), 103 deletions(-) (limited to 'mm') diff --git a/include/linux/slab.h b/include/linux/slab.h index 2e3b448cfa2d..1e03c11bbfbd 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -473,14 +473,14 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) #ifndef ARCH_SLAB_MINALIGN #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long) #endif + +struct memcg_cache_array { + struct rcu_head rcu; + struct kmem_cache *entries[0]; +}; + /* * This is the main placeholder for memcg-related information in kmem caches. - * struct kmem_cache will hold a pointer to it, so the memory cost while - * disabled is 1 pointer. The runtime cost while enabled, gets bigger than it - * would otherwise be if that would be bundled in kmem_cache: we'll need an - * extra pointer chase. But the trade off clearly lays in favor of not - * penalizing non-users. - * * Both the root cache and the child caches will have it. For the root cache, * this will hold a dynamically allocated array large enough to hold * information about the currently limited memcgs in the system. To allow the @@ -495,10 +495,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) struct memcg_cache_params { bool is_root_cache; union { - struct { - struct rcu_head rcu_head; - struct kmem_cache *memcg_caches[0]; - }; + struct memcg_cache_array __rcu *memcg_caches; struct { struct mem_cgroup *memcg; struct kmem_cache *root_cache; diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index b869d1662ba3..33d049066c3d 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -70,7 +70,7 @@ struct kmem_cache { int obj_offset; #endif /* CONFIG_DEBUG_SLAB */ #ifdef CONFIG_MEMCG_KMEM - struct memcg_cache_params *memcg_params; + struct memcg_cache_params memcg_params; #endif struct kmem_cache_node *node[MAX_NUMNODES]; diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index d82abd40a3c0..9abf04ed0999 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -85,7 +85,7 @@ struct kmem_cache { struct kobject kobj; /* For sysfs */ #endif #ifdef CONFIG_MEMCG_KMEM - struct memcg_cache_params *memcg_params; + struct memcg_cache_params memcg_params; int max_attr_size; /* for propagation, maximum size of a stored attr */ #ifdef CONFIG_SYSFS struct kset *memcg_kset; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index afa55bb38cbd..6f3c0fcd7a2d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -332,7 +332,7 @@ struct mem_cgroup { struct cg_proto tcp_mem; #endif #if defined(CONFIG_MEMCG_KMEM) - /* Index in the kmem_cache->memcg_params->memcg_caches array */ + /* Index in the kmem_cache->memcg_params.memcg_caches array */ int kmemcg_id; #endif @@ -531,7 +531,7 @@ static void disarm_sock_keys(struct mem_cgroup *memcg) #ifdef CONFIG_MEMCG_KMEM /* - * This will be the memcg's index in each cache's ->memcg_params->memcg_caches. + * This will be the memcg's index in each cache's ->memcg_params.memcg_caches. * The main reason for not using cgroup id for this: * this works better in sparse environments, where we have a lot of memcgs, * but only a few kmem-limited. Or also, if we have, for instance, 200 @@ -2667,8 +2667,7 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) struct mem_cgroup *memcg; struct kmem_cache *memcg_cachep; - VM_BUG_ON(!cachep->memcg_params); - VM_BUG_ON(!cachep->memcg_params->is_root_cache); + VM_BUG_ON(!is_root_cache(cachep)); if (current->memcg_kmem_skip_account) return cachep; @@ -2702,7 +2701,7 @@ out: void __memcg_kmem_put_cache(struct kmem_cache *cachep) { if (!is_root_cache(cachep)) - css_put(&cachep->memcg_params->memcg->css); + css_put(&cachep->memcg_params.memcg->css); } /* @@ -2778,7 +2777,7 @@ struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr) if (PageSlab(page)) { cachep = page->slab_cache; if (!is_root_cache(cachep)) - memcg = cachep->memcg_params->memcg; + memcg = cachep->memcg_params.memcg; } else /* page allocated by alloc_kmem_pages */ memcg = page->mem_cgroup; diff --git a/mm/slab.h b/mm/slab.h index 90430d6f665e..53a623f85931 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -86,8 +86,6 @@ extern struct kmem_cache *create_kmalloc_cache(const char *name, size_t size, extern void create_boot_cache(struct kmem_cache *, const char *name, size_t size, unsigned long flags); -struct mem_cgroup; - int slab_unmergeable(struct kmem_cache *s); struct kmem_cache *find_mergeable(size_t size, size_t align, unsigned long flags, const char *name, void (*ctor)(void *)); @@ -167,14 +165,13 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer, #ifdef CONFIG_MEMCG_KMEM static inline bool is_root_cache(struct kmem_cache *s) { - return !s->memcg_params || s->memcg_params->is_root_cache; + return s->memcg_params.is_root_cache; } static inline bool slab_equal_or_root(struct kmem_cache *s, - struct kmem_cache *p) + struct kmem_cache *p) { - return (p == s) || - (s->memcg_params && (p == s->memcg_params->root_cache)); + return p == s || p == s->memcg_params.root_cache; } /* @@ -185,37 +182,30 @@ static inline bool slab_equal_or_root(struct kmem_cache *s, static inline const char *cache_name(struct kmem_cache *s) { if (!is_root_cache(s)) - return s->memcg_params->root_cache->name; + s = s->memcg_params.root_cache; return s->name; } /* * Note, we protect with RCU only the memcg_caches array, not per-memcg caches. - * That said the caller must assure the memcg's cache won't go away. Since once - * created a memcg's cache is destroyed only along with the root cache, it is - * true if we are going to allocate from the cache or hold a reference to the - * root cache by other means. Otherwise, we should hold either the slab_mutex - * or the memcg's slab_caches_mutex while calling this function and accessing - * the returned value. + * That said the caller must assure the memcg's cache won't go away by either + * taking a css reference to the owner cgroup, or holding the slab_mutex. */ static inline struct kmem_cache * cache_from_memcg_idx(struct kmem_cache *s, int idx) { struct kmem_cache *cachep; - struct memcg_cache_params *params; - - if (!s->memcg_params) - return NULL; + struct memcg_cache_array *arr; rcu_read_lock(); - params = rcu_dereference(s->memcg_params); + arr = rcu_dereference(s->memcg_params.memcg_caches); /* * Make sure we will access the up-to-date value. The code updating * memcg_caches issues a write barrier to match this (see - * memcg_register_cache()). + * memcg_create_kmem_cache()). */ - cachep = lockless_dereference(params->memcg_caches[idx]); + cachep = lockless_dereference(arr->entries[idx]); rcu_read_unlock(); return cachep; @@ -225,7 +215,7 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) { if (is_root_cache(s)) return s; - return s->memcg_params->root_cache; + return s->memcg_params.root_cache; } static __always_inline int memcg_charge_slab(struct kmem_cache *s, @@ -235,7 +225,7 @@ static __always_inline int memcg_charge_slab(struct kmem_cache *s, return 0; if (is_root_cache(s)) return 0; - return memcg_charge_kmem(s->memcg_params->memcg, gfp, 1 << order); + return memcg_charge_kmem(s->memcg_params.memcg, gfp, 1 << order); } static __always_inline void memcg_uncharge_slab(struct kmem_cache *s, int order) @@ -244,9 +234,13 @@ static __always_inline void memcg_uncharge_slab(struct kmem_cache *s, int order) return; if (is_root_cache(s)) return; - memcg_uncharge_kmem(s->memcg_params->memcg, 1 << order); + memcg_uncharge_kmem(s->memcg_params.memcg, 1 << order); } -#else + +extern void slab_init_memcg_params(struct kmem_cache *); + +#else /* !CONFIG_MEMCG_KMEM */ + static inline bool is_root_cache(struct kmem_cache *s) { return true; @@ -282,7 +276,11 @@ static inline int memcg_charge_slab(struct kmem_cache *s, gfp_t gfp, int order) static inline void memcg_uncharge_slab(struct kmem_cache *s, int order) { } -#endif + +static inline void slab_init_memcg_params(struct kmem_cache *s) +{ +} +#endif /* CONFIG_MEMCG_KMEM */ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x) { diff --git a/mm/slab_common.c b/mm/slab_common.c index 23f5fcde6043..7cc32cf126ef 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -106,62 +106,66 @@ static inline int kmem_cache_sanity_check(const char *name, size_t size) #endif #ifdef CONFIG_MEMCG_KMEM -static int memcg_alloc_cache_params(struct mem_cgroup *memcg, - struct kmem_cache *s, struct kmem_cache *root_cache) +void slab_init_memcg_params(struct kmem_cache *s) { - size_t size; + s->memcg_params.is_root_cache = true; + RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL); +} + +static int init_memcg_params(struct kmem_cache *s, + struct mem_cgroup *memcg, struct kmem_cache *root_cache) +{ + struct memcg_cache_array *arr; - if (!memcg_kmem_enabled()) + if (memcg) { + s->memcg_params.is_root_cache = false; + s->memcg_params.memcg = memcg; + s->memcg_params.root_cache = root_cache; return 0; + } - if (!memcg) { - size = offsetof(struct memcg_cache_params, memcg_caches); - size += memcg_nr_cache_ids * sizeof(void *); - } else - size = sizeof(struct memcg_cache_params); + slab_init_memcg_params(s); - s->memcg_params = kzalloc(size, GFP_KERNEL); - if (!s->memcg_params) - return -ENOMEM; + if (!memcg_nr_cache_ids) + return 0; - if (memcg) { - s->memcg_params->memcg = memcg; - s->memcg_params->root_cache = root_cache; - } else - s->memcg_params->is_root_cache = true; + arr = kzalloc(sizeof(struct memcg_cache_array) + + memcg_nr_cache_ids * sizeof(void *), + GFP_KERNEL); + if (!arr) + return -ENOMEM; + RCU_INIT_POINTER(s->memcg_params.memcg_caches, arr); return 0; } -static void memcg_free_cache_params(struct kmem_cache *s) +static void destroy_memcg_params(struct kmem_cache *s) { - kfree(s->memcg_params); + if (is_root_cache(s)) + kfree(rcu_access_pointer(s->memcg_params.memcg_caches)); } -static int memcg_update_cache_params(struct kmem_cache *s, int num_memcgs) +static int update_memcg_params(struct kmem_cache *s, int new_array_size) { - int size; - struct memcg_cache_params *new_params, *cur_params; + struct memcg_cache_array *old, *new; - BUG_ON(!is_root_cache(s)); - - size = offsetof(struct memcg_cache_params, memcg_caches); - size += num_memcgs * sizeof(void *); + if (!is_root_cache(s)) + return 0; - new_params = kzalloc(size, GFP_KERNEL); - if (!new_params) + new = kzalloc(sizeof(struct memcg_cache_array) + + new_array_size * sizeof(void *), GFP_KERNEL); + if (!new) return -ENOMEM; - cur_params = s->memcg_params; - memcpy(new_params->memcg_caches, cur_params->memcg_caches, - memcg_nr_cache_ids * sizeof(void *)); - - new_params->is_root_cache = true; - - rcu_assign_pointer(s->memcg_params, new_params); - if (cur_params) - kfree_rcu(cur_params, rcu_head); + old = rcu_dereference_protected(s->memcg_params.memcg_caches, + lockdep_is_held(&slab_mutex)); + if (old) + memcpy(new->entries, old->entries, + memcg_nr_cache_ids * sizeof(void *)); + rcu_assign_pointer(s->memcg_params.memcg_caches, new); + if (old) + kfree_rcu(old, rcu); return 0; } @@ -172,10 +176,7 @@ int memcg_update_all_caches(int num_memcgs) mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) { - if (!is_root_cache(s)) - continue; - - ret = memcg_update_cache_params(s, num_memcgs); + ret = update_memcg_params(s, num_memcgs); /* * Instead of freeing the memory, we'll just leave the caches * up to this point in an updated state. @@ -187,13 +188,13 @@ int memcg_update_all_caches(int num_memcgs) return ret; } #else -static inline int memcg_alloc_cache_params(struct mem_cgroup *memcg, - struct kmem_cache *s, struct kmem_cache *root_cache) +static inline int init_memcg_params(struct kmem_cache *s, + struct mem_cgroup *memcg, struct kmem_cache *root_cache) { return 0; } -static inline void memcg_free_cache_params(struct kmem_cache *s) +static inline void destroy_memcg_params(struct kmem_cache *s) { } #endif /* CONFIG_MEMCG_KMEM */ @@ -311,7 +312,7 @@ do_kmem_cache_create(char *name, size_t object_size, size_t size, size_t align, s->align = align; s->ctor = ctor; - err = memcg_alloc_cache_params(memcg, s, root_cache); + err = init_memcg_params(s, memcg, root_cache); if (err) goto out_free_cache; @@ -327,7 +328,7 @@ out: return s; out_free_cache: - memcg_free_cache_params(s); + destroy_memcg_params(s); kmem_cache_free(kmem_cache, s); goto out; } @@ -439,11 +440,15 @@ static int do_kmem_cache_shutdown(struct kmem_cache *s, #ifdef CONFIG_MEMCG_KMEM if (!is_root_cache(s)) { - struct kmem_cache *root_cache = s->memcg_params->root_cache; - int memcg_id = memcg_cache_id(s->memcg_params->memcg); - - BUG_ON(root_cache->memcg_params->memcg_caches[memcg_id] != s); - root_cache->memcg_params->memcg_caches[memcg_id] = NULL; + int idx; + struct memcg_cache_array *arr; + + idx = memcg_cache_id(s->memcg_params.memcg); + arr = rcu_dereference_protected(s->memcg_params.root_cache-> + memcg_params.memcg_caches, + lockdep_is_held(&slab_mutex)); + BUG_ON(arr->entries[idx] != s); + arr->entries[idx] = NULL; } #endif list_move(&s->list, release); @@ -481,27 +486,32 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, struct kmem_cache *root_cache) { static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */ - int memcg_id = memcg_cache_id(memcg); + struct memcg_cache_array *arr; struct kmem_cache *s = NULL; char *cache_name; + int idx; get_online_cpus(); get_online_mems(); mutex_lock(&slab_mutex); + idx = memcg_cache_id(memcg); + arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches, + lockdep_is_held(&slab_mutex)); + /* * Since per-memcg caches are created asynchronously on first * allocation (see memcg_kmem_get_cache()), several threads can try to * create the same cache, but only one of them may succeed. */ - if (cache_from_memcg_idx(root_cache, memcg_id)) + if (arr->entries[idx]) goto out_unlock; cgroup_name(mem_cgroup_css(memcg)->cgroup, memcg_name_buf, sizeof(memcg_name_buf)); cache_name = kasprintf(GFP_KERNEL, "%s(%d:%s)", root_cache->name, - memcg_cache_id(memcg), memcg_name_buf); + idx, memcg_name_buf); if (!cache_name) goto out_unlock; @@ -525,7 +535,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, * initialized. */ smp_wmb(); - root_cache->memcg_params->memcg_caches[memcg_id] = s; + arr->entries[idx] = s; out_unlock: mutex_unlock(&slab_mutex); @@ -545,7 +555,7 @@ void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) mutex_lock(&slab_mutex); list_for_each_entry_safe(s, s2, &slab_caches, list) { - if (is_root_cache(s) || s->memcg_params->memcg != memcg) + if (is_root_cache(s) || s->memcg_params.memcg != memcg) continue; /* * The cgroup is about to be freed and therefore has no charges @@ -564,7 +574,7 @@ void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) void slab_kmem_cache_release(struct kmem_cache *s) { - memcg_free_cache_params(s); + destroy_memcg_params(s); kfree(s->name); kmem_cache_free(kmem_cache, s); } @@ -640,6 +650,9 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name, size_t siz s->name = name; s->size = s->object_size = size; s->align = calculate_alignment(flags, ARCH_KMALLOC_MINALIGN, size); + + slab_init_memcg_params(s); + err = __kmem_cache_create(s, flags); if (err) @@ -980,7 +993,7 @@ int memcg_slab_show(struct seq_file *m, void *p) if (p == slab_caches.next) print_slabinfo_header(m); - if (!is_root_cache(s) && s->memcg_params->memcg == memcg) + if (!is_root_cache(s) && s->memcg_params.memcg == memcg) cache_show(s, m); return 0; } diff --git a/mm/slub.c b/mm/slub.c index 8b8508adf9c2..75d55fdfe3a1 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3577,6 +3577,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache) p->slab_cache = s; #endif } + slab_init_memcg_params(s); list_add(&s->list, &slab_caches); return s; } @@ -4964,7 +4965,7 @@ static void memcg_propagate_slab_attrs(struct kmem_cache *s) if (is_root_cache(s)) return; - root_cache = s->memcg_params->root_cache; + root_cache = s->memcg_params.root_cache; /* * This mean this cache had no attribute written. Therefore, no point @@ -5044,7 +5045,7 @@ static inline struct kset *cache_kset(struct kmem_cache *s) { #ifdef CONFIG_MEMCG_KMEM if (!is_root_cache(s)) - return s->memcg_params->root_cache->memcg_kset; + return s->memcg_params.root_cache->memcg_kset; #endif return slab_kset; } -- cgit v1.2.3 From 426589f571f7d6d5ab2ca33ece73164149279ca1 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:23 -0800 Subject: slab: link memcg caches of the same kind into a list Sometimes, we need to iterate over all memcg copies of a particular root kmem cache. Currently, we use memcg_cache_params->memcg_caches array for that, because it contains all existing memcg caches. However, it's a bad practice to keep all caches, including those that belong to offline cgroups, in this array, because it will be growing beyond any bounds then. I'm going to wipe away dead caches from it to save space. To still be able to perform iterations over all memcg caches of the same kind, let us link them into a list. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 4 ++++ mm/slab.c | 13 +++++-------- mm/slab.h | 17 +++++++++++++++++ mm/slab_common.c | 21 ++++++++++----------- mm/slub.c | 19 +++++-------------- 5 files changed, 41 insertions(+), 33 deletions(-) (limited to 'mm') diff --git a/include/linux/slab.h b/include/linux/slab.h index 1e03c11bbfbd..26d99f41b410 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -491,9 +491,13 @@ struct memcg_cache_array { * * @memcg: pointer to the memcg this cache belongs to * @root_cache: pointer to the global, root cache, this cache was derived from + * + * Both root and child caches of the same kind are linked into a list chained + * through @list. */ struct memcg_cache_params { bool is_root_cache; + struct list_head list; union { struct memcg_cache_array __rcu *memcg_caches; struct { diff --git a/mm/slab.c b/mm/slab.c index 65b5dcb6f671..7894017bc160 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3708,8 +3708,7 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit, int batchcount, int shared, gfp_t gfp) { int ret; - struct kmem_cache *c = NULL; - int i = 0; + struct kmem_cache *c; ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp); @@ -3719,12 +3718,10 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit, if ((ret < 0) || !is_root_cache(cachep)) return ret; - VM_BUG_ON(!mutex_is_locked(&slab_mutex)); - for_each_memcg_cache_index(i) { - c = cache_from_memcg_idx(cachep, i); - if (c) - /* return value determined by the parent cache only */ - __do_tune_cpucache(c, limit, batchcount, shared, gfp); + lockdep_assert_held(&slab_mutex); + for_each_memcg_cache(c, cachep) { + /* return value determined by the root cache only */ + __do_tune_cpucache(c, limit, batchcount, shared, gfp); } return ret; diff --git a/mm/slab.h b/mm/slab.h index 53a623f85931..0a56d76ac0e9 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -163,6 +163,18 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos); #ifdef CONFIG_MEMCG_KMEM +/* + * Iterate over all memcg caches of the given root cache. The caller must hold + * slab_mutex. + */ +#define for_each_memcg_cache(iter, root) \ + list_for_each_entry(iter, &(root)->memcg_params.list, \ + memcg_params.list) + +#define for_each_memcg_cache_safe(iter, tmp, root) \ + list_for_each_entry_safe(iter, tmp, &(root)->memcg_params.list, \ + memcg_params.list) + static inline bool is_root_cache(struct kmem_cache *s) { return s->memcg_params.is_root_cache; @@ -241,6 +253,11 @@ extern void slab_init_memcg_params(struct kmem_cache *); #else /* !CONFIG_MEMCG_KMEM */ +#define for_each_memcg_cache(iter, root) \ + for ((void)(iter), (void)(root); 0; ) +#define for_each_memcg_cache_safe(iter, tmp, root) \ + for ((void)(iter), (void)(tmp), (void)(root); 0; ) + static inline bool is_root_cache(struct kmem_cache *s) { return true; diff --git a/mm/slab_common.c b/mm/slab_common.c index 7cc32cf126ef..989784bd88be 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -109,6 +109,7 @@ static inline int kmem_cache_sanity_check(const char *name, size_t size) void slab_init_memcg_params(struct kmem_cache *s) { s->memcg_params.is_root_cache = true; + INIT_LIST_HEAD(&s->memcg_params.list); RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL); } @@ -449,6 +450,7 @@ static int do_kmem_cache_shutdown(struct kmem_cache *s, lockdep_is_held(&slab_mutex)); BUG_ON(arr->entries[idx] != s); arr->entries[idx] = NULL; + list_del(&s->memcg_params.list); } #endif list_move(&s->list, release); @@ -529,6 +531,8 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, goto out_unlock; } + list_add(&s->memcg_params.list, &root_cache->memcg_params.list); + /* * Since readers won't lock (see cache_from_memcg_idx()), we need a * barrier here to ensure nobody will see the kmem_cache partially @@ -581,11 +585,13 @@ void slab_kmem_cache_release(struct kmem_cache *s) void kmem_cache_destroy(struct kmem_cache *s) { - int i; + struct kmem_cache *c, *c2; LIST_HEAD(release); bool need_rcu_barrier = false; bool busy = false; + BUG_ON(!is_root_cache(s)); + get_online_cpus(); get_online_mems(); @@ -595,10 +601,8 @@ void kmem_cache_destroy(struct kmem_cache *s) if (s->refcount) goto out_unlock; - for_each_memcg_cache_index(i) { - struct kmem_cache *c = cache_from_memcg_idx(s, i); - - if (c && do_kmem_cache_shutdown(c, &release, &need_rcu_barrier)) + for_each_memcg_cache_safe(c, c2, s) { + if (do_kmem_cache_shutdown(c, &release, &need_rcu_barrier)) busy = true; } @@ -932,16 +936,11 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info) { struct kmem_cache *c; struct slabinfo sinfo; - int i; if (!is_root_cache(s)) return; - for_each_memcg_cache_index(i) { - c = cache_from_memcg_idx(s, i); - if (!c) - continue; - + for_each_memcg_cache(c, s) { memset(&sinfo, 0, sizeof(sinfo)); get_slabinfo(c, &sinfo); diff --git a/mm/slub.c b/mm/slub.c index 75d55fdfe3a1..1e5a4636cb23 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3636,13 +3636,10 @@ struct kmem_cache * __kmem_cache_alias(const char *name, size_t size, size_t align, unsigned long flags, void (*ctor)(void *)) { - struct kmem_cache *s; + struct kmem_cache *s, *c; s = find_mergeable(size, align, flags, name, ctor); if (s) { - int i; - struct kmem_cache *c; - s->refcount++; /* @@ -3652,10 +3649,7 @@ __kmem_cache_alias(const char *name, size_t size, size_t align, s->object_size = max(s->object_size, (int)size); s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *))); - for_each_memcg_cache_index(i) { - c = cache_from_memcg_idx(s, i); - if (!c) - continue; + for_each_memcg_cache(c, s) { c->object_size = s->object_size; c->inuse = max_t(int, c->inuse, ALIGN(size, sizeof(void *))); @@ -4921,7 +4915,7 @@ static ssize_t slab_attr_store(struct kobject *kobj, err = attribute->store(s, buf, len); #ifdef CONFIG_MEMCG_KMEM if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { - int i; + struct kmem_cache *c; mutex_lock(&slab_mutex); if (s->max_attr_size < len) @@ -4944,11 +4938,8 @@ static ssize_t slab_attr_store(struct kobject *kobj, * directly either failed or succeeded, in which case we loop * through the descendants with best-effort propagation. */ - for_each_memcg_cache_index(i) { - struct kmem_cache *c = cache_from_memcg_idx(s, i); - if (c) - attribute->store(c, buf, len); - } + for_each_memcg_cache(c, s) + attribute->store(c, buf, len); mutex_unlock(&slab_mutex); } #endif -- cgit v1.2.3 From f1008365bbe4931d6a94dcfc11cf4cdada359664 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:29 -0800 Subject: slab: use css id for naming per memcg caches Currently, we use mem_cgroup->kmemcg_id to guarantee kmem_cache->name uniqueness. This is correct, because kmemcg_id is only released on css free after destroying all per memcg caches. However, I am going to change that and release kmemcg_id on css offline, because it is not wise to keep it for so long, wasting valuable entries of memcg_cache_params->memcg_caches arrays. Therefore, to preserve cache name uniqueness, let us switch to css->id. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slab_common.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/slab_common.c b/mm/slab_common.c index 989784bd88be..6087b1f9a385 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -488,6 +488,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, struct kmem_cache *root_cache) { static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */ + struct cgroup_subsys_state *css = mem_cgroup_css(memcg); struct memcg_cache_array *arr; struct kmem_cache *s = NULL; char *cache_name; @@ -510,10 +511,9 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, if (arr->entries[idx]) goto out_unlock; - cgroup_name(mem_cgroup_css(memcg)->cgroup, - memcg_name_buf, sizeof(memcg_name_buf)); + cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf)); cache_name = kasprintf(GFP_KERNEL, "%s(%d:%s)", root_cache->name, - idx, memcg_name_buf); + css->id, memcg_name_buf); if (!cache_name) goto out_unlock; -- cgit v1.2.3 From 2a4db7eb9391a544ff58f4fa11d35246e87c87af Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:32 -0800 Subject: memcg: free memcg_caches slot on css offline We need to look up a kmem_cache in ->memcg_params.memcg_caches arrays only on allocations, so there is no need to have the array entries set until css free - we can clear them on css offline. This will allow us to reuse array entries more efficiently and avoid costly array relocations. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 10 +++++----- mm/memcontrol.c | 38 ++++++++++++++++++++++++++++++++------ mm/slab_common.c | 39 ++++++++++++++++++++++++++++----------- 3 files changed, 65 insertions(+), 22 deletions(-) (limited to 'mm') diff --git a/include/linux/slab.h b/include/linux/slab.h index 26d99f41b410..ed2ffaab59ea 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -115,13 +115,12 @@ int slab_is_available(void); struct kmem_cache *kmem_cache_create(const char *, size_t, size_t, unsigned long, void (*)(void *)); -#ifdef CONFIG_MEMCG_KMEM -void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); -void memcg_destroy_kmem_caches(struct mem_cgroup *); -#endif void kmem_cache_destroy(struct kmem_cache *); int kmem_cache_shrink(struct kmem_cache *); -void kmem_cache_free(struct kmem_cache *, void *); + +void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); +void memcg_deactivate_kmem_caches(struct mem_cgroup *); +void memcg_destroy_kmem_caches(struct mem_cgroup *); /* * Please use this macro to create slab caches. Simply specify the @@ -288,6 +287,7 @@ static __always_inline int kmalloc_index(size_t size) void *__kmalloc(size_t size, gfp_t flags); void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags); +void kmem_cache_free(struct kmem_cache *, void *); #ifdef CONFIG_NUMA void *__kmalloc_node(size_t size, gfp_t flags, int node); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6f3c0fcd7a2d..abfe0135bfdc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -334,6 +334,7 @@ struct mem_cgroup { #if defined(CONFIG_MEMCG_KMEM) /* Index in the kmem_cache->memcg_params.memcg_caches array */ int kmemcg_id; + bool kmem_acct_active; #endif int last_scanned_node; @@ -354,7 +355,7 @@ struct mem_cgroup { #ifdef CONFIG_MEMCG_KMEM bool memcg_kmem_is_active(struct mem_cgroup *memcg) { - return memcg->kmemcg_id >= 0; + return memcg->kmem_acct_active; } #endif @@ -585,7 +586,7 @@ static void memcg_free_cache_id(int id); static void disarm_kmem_keys(struct mem_cgroup *memcg) { - if (memcg_kmem_is_active(memcg)) { + if (memcg->kmemcg_id >= 0) { static_key_slow_dec(&memcg_kmem_enabled_key); memcg_free_cache_id(memcg->kmemcg_id); } @@ -2666,6 +2667,7 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) { struct mem_cgroup *memcg; struct kmem_cache *memcg_cachep; + int kmemcg_id; VM_BUG_ON(!is_root_cache(cachep)); @@ -2673,10 +2675,11 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) return cachep; memcg = get_mem_cgroup_from_mm(current->mm); - if (!memcg_kmem_is_active(memcg)) + kmemcg_id = ACCESS_ONCE(memcg->kmemcg_id); + if (kmemcg_id < 0) goto out; - memcg_cachep = cache_from_memcg_idx(cachep, memcg_cache_id(memcg)); + memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id); if (likely(memcg_cachep)) return memcg_cachep; @@ -3318,8 +3321,8 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg, int err = 0; int memcg_id; - if (memcg_kmem_is_active(memcg)) - return 0; + BUG_ON(memcg->kmemcg_id >= 0); + BUG_ON(memcg->kmem_acct_active); /* * For simplicity, we won't allow this to be disabled. It also can't @@ -3362,6 +3365,7 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg, * patched. */ memcg->kmemcg_id = memcg_id; + memcg->kmem_acct_active = true; out: return err; } @@ -4041,6 +4045,22 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) return mem_cgroup_sockets_init(memcg, ss); } +static void memcg_deactivate_kmem(struct mem_cgroup *memcg) +{ + if (!memcg->kmem_acct_active) + return; + + /* + * Clear the 'active' flag before clearing memcg_caches arrays entries. + * Since we take the slab_mutex in memcg_deactivate_kmem_caches(), it + * guarantees no cache will be created for this cgroup after we are + * done (see memcg_create_kmem_cache()). + */ + memcg->kmem_acct_active = false; + + memcg_deactivate_kmem_caches(memcg); +} + static void memcg_destroy_kmem(struct mem_cgroup *memcg) { memcg_destroy_kmem_caches(memcg); @@ -4052,6 +4072,10 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) return 0; } +static void memcg_deactivate_kmem(struct mem_cgroup *memcg) +{ +} + static void memcg_destroy_kmem(struct mem_cgroup *memcg) { } @@ -4608,6 +4632,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) spin_unlock(&memcg->event_list_lock); vmpressure_cleanup(&memcg->vmpressure); + + memcg_deactivate_kmem(memcg); } static void mem_cgroup_css_free(struct cgroup_subsys_state *css) diff --git a/mm/slab_common.c b/mm/slab_common.c index 6087b1f9a385..0873bcc61c7a 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -440,18 +440,8 @@ static int do_kmem_cache_shutdown(struct kmem_cache *s, *need_rcu_barrier = true; #ifdef CONFIG_MEMCG_KMEM - if (!is_root_cache(s)) { - int idx; - struct memcg_cache_array *arr; - - idx = memcg_cache_id(s->memcg_params.memcg); - arr = rcu_dereference_protected(s->memcg_params.root_cache-> - memcg_params.memcg_caches, - lockdep_is_held(&slab_mutex)); - BUG_ON(arr->entries[idx] != s); - arr->entries[idx] = NULL; + if (!is_root_cache(s)) list_del(&s->memcg_params.list); - } #endif list_move(&s->list, release); return 0; @@ -499,6 +489,13 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, mutex_lock(&slab_mutex); + /* + * The memory cgroup could have been deactivated while the cache + * creation work was pending. + */ + if (!memcg_kmem_is_active(memcg)) + goto out_unlock; + idx = memcg_cache_id(memcg); arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches, lockdep_is_held(&slab_mutex)); @@ -548,6 +545,26 @@ out_unlock: put_online_cpus(); } +void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) +{ + int idx; + struct memcg_cache_array *arr; + struct kmem_cache *s; + + idx = memcg_cache_id(memcg); + + mutex_lock(&slab_mutex); + list_for_each_entry(s, &slab_caches, list) { + if (!is_root_cache(s)) + continue; + + arr = rcu_dereference_protected(s->memcg_params.memcg_caches, + lockdep_is_held(&slab_mutex)); + arr->entries[idx] = NULL; + } + mutex_unlock(&slab_mutex); +} + void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) { LIST_HEAD(release); -- cgit v1.2.3 From 3f97b163207c67a3b35931494ad3db1de66356f0 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:35 -0800 Subject: list_lru: add helpers to isolate items Currently, the isolate callback passed to the list_lru_walk family of functions is supposed to just delete an item from the list upon returning LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by __list_lru_walk_one after the callback returns. Since the callback is allowed to drop the lock after removing an item (it has to return LRU_REMOVED_RETRY then), the nr_items can be less than the actual number of elements on the list even if we check them under the lock. This makes it difficult to move items from one list_lru_one to another, which is required for per-memcg list_lru reparenting - we can't just splice the lists, we have to move entries one by one. This patch therefore introduces helpers that must be used by callback functions to isolate items instead of raw list_del/list_move. These are list_lru_isolate and list_lru_isolate_move. They not only remove the entry from the list, but also fix the nr_items counter, making sure nr_items always reflects the actual number of elements on the list if checked under the appropriate lock. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/dcache.c | 21 +++++++++++---------- fs/gfs2/quota.c | 5 +++-- fs/inode.c | 8 ++++---- fs/xfs/xfs_buf.c | 6 ++++-- fs/xfs/xfs_qm.c | 5 +++-- include/linux/list_lru.h | 9 +++++++-- mm/list_lru.c | 19 ++++++++++++++++--- mm/workingset.c | 3 ++- 8 files changed, 50 insertions(+), 26 deletions(-) (limited to 'mm') diff --git a/fs/dcache.c b/fs/dcache.c index 56c5da89f58a..d04be762b216 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -400,19 +400,20 @@ static void d_shrink_add(struct dentry *dentry, struct list_head *list) * LRU lists entirely, while shrink_move moves it to the indicated * private list. */ -static void d_lru_isolate(struct dentry *dentry) +static void d_lru_isolate(struct list_lru_one *lru, struct dentry *dentry) { D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST); dentry->d_flags &= ~DCACHE_LRU_LIST; this_cpu_dec(nr_dentry_unused); - list_del_init(&dentry->d_lru); + list_lru_isolate(lru, &dentry->d_lru); } -static void d_lru_shrink_move(struct dentry *dentry, struct list_head *list) +static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry, + struct list_head *list) { D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST); dentry->d_flags |= DCACHE_SHRINK_LIST; - list_move_tail(&dentry->d_lru, list); + list_lru_isolate_move(lru, &dentry->d_lru, list); } /* @@ -869,8 +870,8 @@ static void shrink_dentry_list(struct list_head *list) } } -static enum lru_status -dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) +static enum lru_status dentry_lru_isolate(struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { struct list_head *freeable = arg; struct dentry *dentry = container_of(item, struct dentry, d_lru); @@ -890,7 +891,7 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) * another pass through the LRU. */ if (dentry->d_lockref.count) { - d_lru_isolate(dentry); + d_lru_isolate(lru, dentry); spin_unlock(&dentry->d_lock); return LRU_REMOVED; } @@ -921,7 +922,7 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) return LRU_ROTATE; } - d_lru_shrink_move(dentry, freeable); + d_lru_shrink_move(lru, dentry, freeable); spin_unlock(&dentry->d_lock); return LRU_REMOVED; @@ -951,7 +952,7 @@ long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc) } static enum lru_status dentry_lru_isolate_shrink(struct list_head *item, - spinlock_t *lru_lock, void *arg) + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { struct list_head *freeable = arg; struct dentry *dentry = container_of(item, struct dentry, d_lru); @@ -964,7 +965,7 @@ static enum lru_status dentry_lru_isolate_shrink(struct list_head *item, if (!spin_trylock(&dentry->d_lock)) return LRU_SKIP; - d_lru_shrink_move(dentry, freeable); + d_lru_shrink_move(lru, dentry, freeable); spin_unlock(&dentry->d_lock); return LRU_REMOVED; diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c index c15d6b216d0b..3aa17d4d1cfc 100644 --- a/fs/gfs2/quota.c +++ b/fs/gfs2/quota.c @@ -145,7 +145,8 @@ static void gfs2_qd_dispose(struct list_head *list) } -static enum lru_status gfs2_qd_isolate(struct list_head *item, spinlock_t *lock, void *arg) +static enum lru_status gfs2_qd_isolate(struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { struct list_head *dispose = arg; struct gfs2_quota_data *qd = list_entry(item, struct gfs2_quota_data, qd_lru); @@ -155,7 +156,7 @@ static enum lru_status gfs2_qd_isolate(struct list_head *item, spinlock_t *lock, if (qd->qd_lockref.count == 0) { lockref_mark_dead(&qd->qd_lockref); - list_move(&qd->qd_lru, dispose); + list_lru_isolate_move(lru, &qd->qd_lru, dispose); } spin_unlock(&qd->qd_lockref.lock); diff --git a/fs/inode.c b/fs/inode.c index 524a32c2b0c6..86c612b92c6f 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -685,8 +685,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) * LRU does not have strict ordering. Hence we don't want to reclaim inodes * with this flag set because they are the inodes that are out of order. */ -static enum lru_status -inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) +static enum lru_status inode_lru_isolate(struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { struct list_head *freeable = arg; struct inode *inode = container_of(item, struct inode, i_lru); @@ -704,7 +704,7 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) */ if (atomic_read(&inode->i_count) || (inode->i_state & ~I_REFERENCED)) { - list_del_init(&inode->i_lru); + list_lru_isolate(lru, &inode->i_lru); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; @@ -738,7 +738,7 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) WARN_ON(inode->i_state & I_NEW); inode->i_state |= I_FREEING; - list_move(&inode->i_lru, freeable); + list_lru_isolate_move(lru, &inode->i_lru, freeable); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 15c9d224c721..1790b00bea7a 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1488,6 +1488,7 @@ xfs_buf_iomove( static enum lru_status xfs_buftarg_wait_rele( struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) @@ -1509,7 +1510,7 @@ xfs_buftarg_wait_rele( */ atomic_set(&bp->b_lru_ref, 0); bp->b_state |= XFS_BSTATE_DISPOSE; - list_move(item, dispose); + list_lru_isolate_move(lru, item, dispose); spin_unlock(&bp->b_lock); return LRU_REMOVED; } @@ -1546,6 +1547,7 @@ xfs_wait_buftarg( static enum lru_status xfs_buftarg_isolate( struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { @@ -1569,7 +1571,7 @@ xfs_buftarg_isolate( } bp->b_state |= XFS_BSTATE_DISPOSE; - list_move(item, dispose); + list_lru_isolate_move(lru, item, dispose); spin_unlock(&bp->b_lock); return LRU_REMOVED; } diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index 4f4b1274e144..53cc2aaf8d2b 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -430,6 +430,7 @@ struct xfs_qm_isolate { static enum lru_status xfs_qm_dquot_isolate( struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) __releases(lru_lock) __acquires(lru_lock) @@ -450,7 +451,7 @@ xfs_qm_dquot_isolate( XFS_STATS_INC(xs_qm_dqwants); trace_xfs_dqreclaim_want(dqp); - list_del_init(&dqp->q_lru); + list_lru_isolate(lru, &dqp->q_lru); XFS_STATS_DEC(xs_qm_dquot_unused); return LRU_REMOVED; } @@ -494,7 +495,7 @@ xfs_qm_dquot_isolate( xfs_dqunlock(dqp); ASSERT(dqp->q_nrefs == 0); - list_move_tail(&dqp->q_lru, &isol->dispose); + list_lru_isolate_move(lru, &dqp->q_lru, &isol->dispose); XFS_STATS_DEC(xs_qm_dquot_unused); trace_xfs_dqreclaim_done(dqp); XFS_STATS_INC(xs_qm_dqreclaims); diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 305b598abac2..7edf9c9ab9eb 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -125,8 +125,13 @@ static inline unsigned long list_lru_count(struct list_lru *lru) return count; } -typedef enum lru_status -(*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg); +void list_lru_isolate(struct list_lru_one *list, struct list_head *item); +void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item, + struct list_head *head); + +typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, + struct list_lru_one *list, spinlock_t *lock, void *cb_arg); + /** * list_lru_walk_one: walk a list_lru, isolating and disposing freeable items. * @lru: the lru pointer. diff --git a/mm/list_lru.c b/mm/list_lru.c index 79aee70c3b9d..8d9d168c6c38 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -132,6 +132,21 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item) } EXPORT_SYMBOL_GPL(list_lru_del); +void list_lru_isolate(struct list_lru_one *list, struct list_head *item) +{ + list_del_init(item); + list->nr_items--; +} +EXPORT_SYMBOL_GPL(list_lru_isolate); + +void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item, + struct list_head *head) +{ + list_move(item, head); + list->nr_items--; +} +EXPORT_SYMBOL_GPL(list_lru_isolate_move); + static unsigned long __list_lru_count_one(struct list_lru *lru, int nid, int memcg_idx) { @@ -194,13 +209,11 @@ restart: break; --*nr_to_walk; - ret = isolate(item, &nlru->lock, cb_arg); + ret = isolate(item, l, &nlru->lock, cb_arg); switch (ret) { case LRU_REMOVED_RETRY: assert_spin_locked(&nlru->lock); case LRU_REMOVED: - l->nr_items--; - WARN_ON_ONCE(l->nr_items < 0); isolated++; /* * If the lru lock has been dropped, our list diff --git a/mm/workingset.c b/mm/workingset.c index d4fa7fb10a52..aa017133744b 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -302,6 +302,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, } static enum lru_status shadow_lru_isolate(struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { @@ -332,7 +333,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, goto out; } - list_del_init(item); + list_lru_isolate(lru, item); spin_unlock(lru_lock); /* -- cgit v1.2.3 From 2788cf0c401c268b4819c5407493a8769b7007aa Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:38 -0800 Subject: memcg: reparent list_lrus and free kmemcg_id on css offline Now, the only reason to keep kmemcg_id till css free is list_lru, which uses it to distribute elements between per-memcg lists. However, it can be easily sorted out - we only need to change kmemcg_id of an offline cgroup to its parent's id, making further list_lru_add()'s add elements to the parent's list, and then move all elements from the offline cgroup's list to the one of its parent. It will work, because a racing list_lru_del() does not need to know the list it is deleting the element from. It can decrement the wrong nr_items counter though, but the ongoing reparenting will fix it. After list_lru reparenting is done we are free to release kmemcg_id saving a valuable slot in a per-memcg array for new cgroups. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/list_lru.h | 3 ++- mm/list_lru.c | 46 +++++++++++++++++++++++++++++++++++++++++++--- mm/memcontrol.c | 39 ++++++++++++++++++++++++++++++++++----- 3 files changed, 79 insertions(+), 9 deletions(-) (limited to 'mm') diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 7edf9c9ab9eb..2a6b9947aaa3 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -26,7 +26,7 @@ enum lru_status { struct list_lru_one { struct list_head list; - /* kept as signed so we can catch imbalance bugs */ + /* may become negative during memcg reparenting */ long nr_items; }; @@ -62,6 +62,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, #define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL) int memcg_update_all_list_lrus(int num_memcgs); +void memcg_drain_all_list_lrus(int src_idx, int dst_idx); /** * list_lru_add: add an element to the lru list's tail diff --git a/mm/list_lru.c b/mm/list_lru.c index 8d9d168c6c38..909eca2c820e 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -100,7 +100,6 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item) spin_lock(&nlru->lock); l = list_lru_from_kmem(nlru, item); - WARN_ON_ONCE(l->nr_items < 0); if (list_empty(item)) { list_add_tail(item, &l->list); l->nr_items++; @@ -123,7 +122,6 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item) if (!list_empty(item)) { list_del_init(item); l->nr_items--; - WARN_ON_ONCE(l->nr_items < 0); spin_unlock(&nlru->lock); return true; } @@ -156,7 +154,6 @@ static unsigned long __list_lru_count_one(struct list_lru *lru, spin_lock(&nlru->lock); l = list_lru_from_memcg_idx(nlru, memcg_idx); - WARN_ON_ONCE(l->nr_items < 0); count = l->nr_items; spin_unlock(&nlru->lock); @@ -458,6 +455,49 @@ fail: memcg_cancel_update_list_lru(lru, old_size, new_size); goto out; } + +static void memcg_drain_list_lru_node(struct list_lru_node *nlru, + int src_idx, int dst_idx) +{ + struct list_lru_one *src, *dst; + + /* + * Since list_lru_{add,del} may be called under an IRQ-safe lock, + * we have to use IRQ-safe primitives here to avoid deadlock. + */ + spin_lock_irq(&nlru->lock); + + src = list_lru_from_memcg_idx(nlru, src_idx); + dst = list_lru_from_memcg_idx(nlru, dst_idx); + + list_splice_init(&src->list, &dst->list); + dst->nr_items += src->nr_items; + src->nr_items = 0; + + spin_unlock_irq(&nlru->lock); +} + +static void memcg_drain_list_lru(struct list_lru *lru, + int src_idx, int dst_idx) +{ + int i; + + if (!list_lru_memcg_aware(lru)) + return; + + for (i = 0; i < nr_node_ids; i++) + memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_idx); +} + +void memcg_drain_all_list_lrus(int src_idx, int dst_idx) +{ + struct list_lru *lru; + + mutex_lock(&list_lrus_mutex); + list_for_each_entry(lru, &list_lrus, list) + memcg_drain_list_lru(lru, src_idx, dst_idx); + mutex_unlock(&list_lrus_mutex); +} #else static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index abfe0135bfdc..419c06b1794a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -334,6 +334,7 @@ struct mem_cgroup { #if defined(CONFIG_MEMCG_KMEM) /* Index in the kmem_cache->memcg_params.memcg_caches array */ int kmemcg_id; + bool kmem_acct_activated; bool kmem_acct_active; #endif @@ -582,14 +583,10 @@ void memcg_put_cache_ids(void) struct static_key memcg_kmem_enabled_key; EXPORT_SYMBOL(memcg_kmem_enabled_key); -static void memcg_free_cache_id(int id); - static void disarm_kmem_keys(struct mem_cgroup *memcg) { - if (memcg->kmemcg_id >= 0) { + if (memcg->kmem_acct_activated) static_key_slow_dec(&memcg_kmem_enabled_key); - memcg_free_cache_id(memcg->kmemcg_id); - } /* * This check can't live in kmem destruction function, * since the charges will outlive the cgroup @@ -3322,6 +3319,7 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg, int memcg_id; BUG_ON(memcg->kmemcg_id >= 0); + BUG_ON(memcg->kmem_acct_activated); BUG_ON(memcg->kmem_acct_active); /* @@ -3365,6 +3363,7 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg, * patched. */ memcg->kmemcg_id = memcg_id; + memcg->kmem_acct_activated = true; memcg->kmem_acct_active = true; out: return err; @@ -4047,6 +4046,10 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) static void memcg_deactivate_kmem(struct mem_cgroup *memcg) { + struct cgroup_subsys_state *css; + struct mem_cgroup *parent, *child; + int kmemcg_id; + if (!memcg->kmem_acct_active) return; @@ -4059,6 +4062,32 @@ static void memcg_deactivate_kmem(struct mem_cgroup *memcg) memcg->kmem_acct_active = false; memcg_deactivate_kmem_caches(memcg); + + kmemcg_id = memcg->kmemcg_id; + BUG_ON(kmemcg_id < 0); + + parent = parent_mem_cgroup(memcg); + if (!parent) + parent = root_mem_cgroup; + + /* + * Change kmemcg_id of this cgroup and all its descendants to the + * parent's id, and then move all entries from this cgroup's list_lrus + * to ones of the parent. After we have finished, all list_lrus + * corresponding to this cgroup are guaranteed to remain empty. The + * ordering is imposed by list_lru_node->lock taken by + * memcg_drain_all_list_lrus(). + */ + css_for_each_descendant_pre(css, &memcg->css) { + child = mem_cgroup_from_css(css); + BUG_ON(child->kmemcg_id != kmemcg_id); + child->kmemcg_id = parent->kmemcg_id; + if (!memcg->use_hierarchy) + break; + } + memcg_drain_all_list_lrus(kmemcg_id, parent->kmemcg_id); + + memcg_free_cache_id(kmemcg_id); } static void memcg_destroy_kmem(struct mem_cgroup *memcg) -- cgit v1.2.3 From 832f37f5d5f5c7281880c21eb09508750b67f540 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:41 -0800 Subject: slub: never fail to shrink cache SLUB's version of __kmem_cache_shrink() not only removes empty slabs, but also tries to rearrange the partial lists to place slabs filled up most to the head to cope with fragmentation. To achieve that, it allocates a temporary array of lists used to sort slabs by the number of objects in use. If the allocation fails, the whole procedure is aborted. This is unacceptable for the kernel memory accounting extension of the memory cgroup, where we want to make sure that kmem_cache_shrink() successfully discarded empty slabs. Although the allocation failure is utterly unlikely with the current page allocator implementation, which retries GFP_KERNEL allocations of order <= 2 infinitely, it is better not to rely on that. This patch therefore makes __kmem_cache_shrink() allocate the array on stack instead of calling kmalloc, which may fail. The array size is chosen to be equal to 32, because most SLUB caches store not more than 32 objects per slab page. Slab pages with <= 32 free objects are sorted using the array by the number of objects in use and promoted to the head of the partial list, while slab pages with > 32 free objects are left in the end of the list without any ordering imposed on them. Signed-off-by: Vladimir Davydov Acked-by: Christoph Lameter Acked-by: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Huang Ying Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 58 +++++++++++++++++++++++++++++++--------------------------- 1 file changed, 31 insertions(+), 27 deletions(-) (limited to 'mm') diff --git a/mm/slub.c b/mm/slub.c index 1e5a4636cb23..d97b692165d2 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3358,11 +3358,12 @@ void kfree(const void *x) } EXPORT_SYMBOL(kfree); +#define SHRINK_PROMOTE_MAX 32 + /* - * kmem_cache_shrink removes empty slabs from the partial lists and sorts - * the remaining slabs by the number of items in use. The slabs with the - * most items in use come first. New allocations will then fill those up - * and thus they can be removed from the partial lists. + * kmem_cache_shrink discards empty slabs and promotes the slabs filled + * up most to the head of the partial lists. New allocations will then + * fill those up and thus they can be removed from the partial lists. * * The slabs with the least items are placed last. This results in them * being allocated from last increasing the chance that the last objects @@ -3375,51 +3376,57 @@ int __kmem_cache_shrink(struct kmem_cache *s) struct kmem_cache_node *n; struct page *page; struct page *t; - int objects = oo_objects(s->max); - struct list_head *slabs_by_inuse = - kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL); + struct list_head discard; + struct list_head promote[SHRINK_PROMOTE_MAX]; unsigned long flags; - if (!slabs_by_inuse) - return -ENOMEM; - flush_all(s); for_each_kmem_cache_node(s, node, n) { if (!n->nr_partial) continue; - for (i = 0; i < objects; i++) - INIT_LIST_HEAD(slabs_by_inuse + i); + INIT_LIST_HEAD(&discard); + for (i = 0; i < SHRINK_PROMOTE_MAX; i++) + INIT_LIST_HEAD(promote + i); spin_lock_irqsave(&n->list_lock, flags); /* - * Build lists indexed by the items in use in each slab. + * Build lists of slabs to discard or promote. * * Note that concurrent frees may occur while we hold the * list_lock. page->inuse here is the upper limit. */ list_for_each_entry_safe(page, t, &n->partial, lru) { - list_move(&page->lru, slabs_by_inuse + page->inuse); - if (!page->inuse) + int free = page->objects - page->inuse; + + /* Do not reread page->inuse */ + barrier(); + + /* We do not keep full slabs on the list */ + BUG_ON(free <= 0); + + if (free == page->objects) { + list_move(&page->lru, &discard); n->nr_partial--; + } else if (free <= SHRINK_PROMOTE_MAX) + list_move(&page->lru, promote + free - 1); } /* - * Rebuild the partial list with the slabs filled up most - * first and the least used slabs at the end. + * Promote the slabs filled up most to the head of the + * partial list. */ - for (i = objects - 1; i > 0; i--) - list_splice(slabs_by_inuse + i, n->partial.prev); + for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--) + list_splice(promote + i, &n->partial); spin_unlock_irqrestore(&n->list_lock, flags); /* Release empty slabs */ - list_for_each_entry_safe(page, t, slabs_by_inuse, lru) + list_for_each_entry_safe(page, t, &discard, lru) discard_slab(s, page); } - kfree(slabs_by_inuse); return 0; } @@ -4686,12 +4693,9 @@ static ssize_t shrink_show(struct kmem_cache *s, char *buf) static ssize_t shrink_store(struct kmem_cache *s, const char *buf, size_t length) { - if (buf[0] == '1') { - int rc = kmem_cache_shrink(s); - - if (rc) - return rc; - } else + if (buf[0] == '1') + kmem_cache_shrink(s); + else return -EINVAL; return length; } -- cgit v1.2.3 From ce3712d74d8ed531a9fd0fbb711ff8fefbacdd9f Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:44 -0800 Subject: slub: fix kmem_cache_shrink return value It is supposed to return 0 if the cache has no remaining objects and 1 otherwise, while currently it always returns 0. Fix it. Signed-off-by: Vladimir Davydov Acked-by: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/slub.c b/mm/slub.c index d97b692165d2..7fa27aee9b6e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3379,6 +3379,7 @@ int __kmem_cache_shrink(struct kmem_cache *s) struct list_head discard; struct list_head promote[SHRINK_PROMOTE_MAX]; unsigned long flags; + int ret = 0; flush_all(s); for_each_kmem_cache_node(s, node, n) { @@ -3425,9 +3426,12 @@ int __kmem_cache_shrink(struct kmem_cache *s) /* Release empty slabs */ list_for_each_entry_safe(page, t, &discard, lru) discard_slab(s, page); + + if (slabs_node(s, node)) + ret = 1; } - return 0; + return ret; } static int slab_mem_going_offline_callback(void *arg) -- cgit v1.2.3 From d6e0b7fa11862433773d986b5f995ffdf47ce672 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:47 -0800 Subject: slub: make dead caches discard free slabs immediately To speed up further allocations SLUB may store empty slabs in per cpu/node partial lists instead of freeing them immediately. This prevents per memcg caches destruction, because kmem caches created for a memory cgroup are only destroyed after the last page charged to the cgroup is freed. To fix this issue, this patch resurrects approach first proposed in [1]. It forbids SLUB to cache empty slabs after the memory cgroup that the cache belongs to was destroyed. It is achieved by setting kmem_cache's cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so that it would drop frozen empty slabs immediately if cpu_partial = 0. The runtime overhead is minimal. From all the hot functions, we only touch relatively cold put_cpu_partial(): we make it call unfreeze_partials() after freezing a slab that belongs to an offline memory cgroup. Since slab freezing exists to avoid moving slabs from/to a partial list on free/alloc, and there can't be allocations from dead caches, it shouldn't cause any overhead. We do have to disable preemption for put_cpu_partial() to achieve that though. The original patch was accepted well and even merged to the mm tree. However, I decided to withdraw it due to changes happening to the memcg core at that time. I had an idea of introducing per-memcg shrinkers for kmem caches, but now, as memcg has finally settled down, I do not see it as an option, because SLUB shrinker would be too costly to call since SLUB does not keep free slabs on a separate list. Besides, we currently do not even call per-memcg shrinkers for offline memcgs. Overall, it would introduce much more complexity to both SLUB and memcg than this small patch. Regarding to SLAB, there's no problem with it, because it shrinks per-cpu/node caches periodically. Thanks to list_lru reparenting, we no longer keep entries for offline cgroups in per-memcg arrays (such as memcg_cache_params->memcg_caches), so we do not have to bother if a per-memcg cache will be shrunk a bit later than it could be. [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650 Signed-off-by: Vladimir Davydov Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slab.c | 4 ++-- mm/slab.h | 2 +- mm/slab_common.c | 15 +++++++++++++-- mm/slob.c | 2 +- mm/slub.c | 31 ++++++++++++++++++++++++++----- 5 files changed, 43 insertions(+), 11 deletions(-) (limited to 'mm') diff --git a/mm/slab.c b/mm/slab.c index 7894017bc160..c4b89eaf4c96 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2382,7 +2382,7 @@ out: return nr_freed; } -int __kmem_cache_shrink(struct kmem_cache *cachep) +int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate) { int ret = 0; int node; @@ -2404,7 +2404,7 @@ int __kmem_cache_shutdown(struct kmem_cache *cachep) { int i; struct kmem_cache_node *n; - int rc = __kmem_cache_shrink(cachep); + int rc = __kmem_cache_shrink(cachep, false); if (rc) return rc; diff --git a/mm/slab.h b/mm/slab.h index 0a56d76ac0e9..4c3ac12dd644 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -138,7 +138,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size, #define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS) int __kmem_cache_shutdown(struct kmem_cache *); -int __kmem_cache_shrink(struct kmem_cache *); +int __kmem_cache_shrink(struct kmem_cache *, bool); void slab_kmem_cache_release(struct kmem_cache *); struct seq_file; diff --git a/mm/slab_common.c b/mm/slab_common.c index 0873bcc61c7a..1a1cc89acaa3 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -549,10 +549,13 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) { int idx; struct memcg_cache_array *arr; - struct kmem_cache *s; + struct kmem_cache *s, *c; idx = memcg_cache_id(memcg); + get_online_cpus(); + get_online_mems(); + mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) { if (!is_root_cache(s)) @@ -560,9 +563,17 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) arr = rcu_dereference_protected(s->memcg_params.memcg_caches, lockdep_is_held(&slab_mutex)); + c = arr->entries[idx]; + if (!c) + continue; + + __kmem_cache_shrink(c, true); arr->entries[idx] = NULL; } mutex_unlock(&slab_mutex); + + put_online_mems(); + put_online_cpus(); } void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) @@ -649,7 +660,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep) get_online_cpus(); get_online_mems(); - ret = __kmem_cache_shrink(cachep); + ret = __kmem_cache_shrink(cachep, false); put_online_mems(); put_online_cpus(); return ret; diff --git a/mm/slob.c b/mm/slob.c index 96a86206a26b..94a7fede6d48 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -618,7 +618,7 @@ int __kmem_cache_shutdown(struct kmem_cache *c) return 0; } -int __kmem_cache_shrink(struct kmem_cache *d) +int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate) { return 0; } diff --git a/mm/slub.c b/mm/slub.c index 7fa27aee9b6e..06cdb1829dc9 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) int pages; int pobjects; + preempt_disable(); do { pages = 0; pobjects = 0; @@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage); + if (unlikely(!s->cpu_partial)) { + unsigned long flags; + + local_irq_save(flags); + unfreeze_partials(s, this_cpu_ptr(s->cpu_slab)); + local_irq_restore(flags); + } + preempt_enable(); #endif } @@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree); * being allocated from last increasing the chance that the last objects * are freed in them. */ -int __kmem_cache_shrink(struct kmem_cache *s) +int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate) { int node; int i; @@ -3381,11 +3390,23 @@ int __kmem_cache_shrink(struct kmem_cache *s) unsigned long flags; int ret = 0; + if (deactivate) { + /* + * Disable empty slabs caching. Used to avoid pinning offline + * memory cgroups by kmem pages that can be freed. + */ + s->cpu_partial = 0; + s->min_partial = 0; + + /* + * s->cpu_partial is checked locklessly (see put_cpu_partial), + * so we have to make sure the change is visible. + */ + kick_all_cpus_sync(); + } + flush_all(s); for_each_kmem_cache_node(s, node, n) { - if (!n->nr_partial) - continue; - INIT_LIST_HEAD(&discard); for (i = 0; i < SHRINK_PROMOTE_MAX; i++) INIT_LIST_HEAD(promote + i); @@ -3440,7 +3461,7 @@ static int slab_mem_going_offline_callback(void *arg) mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) - __kmem_cache_shrink(s); + __kmem_cache_shrink(s, false); mutex_unlock(&slab_mutex); return 0; -- cgit v1.2.3 From 372549c2a3778fd3df445819811c944ad54609ca Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Thu, 12 Feb 2015 14:59:50 -0800 Subject: mm/compaction: fix wrong order check in compact_finished() What we want to check here is whether there is highorder freepage in buddy list of other migratetype in order to steal it without fragmentation. But, current code just checks cc->order which means allocation request order. So, this is wrong. Without this fix, non-movable synchronous compaction below pageblock order would not stopped until compaction is complete, because migratetype of most pageblocks are movable and high order freepage made by compaction is usually on movable type buddy list. There is some report related to this bug. See below link. http://www.spinics.net/lists/linux-mm/msg81666.html Although the issued system still has load spike comes from compaction, this makes that system completely stable and responsive according to his report. stress-highalloc test in mmtests with non movable order 7 allocation doesn't show any notable difference in allocation success rate, but, it shows more compaction success rate. Compaction success rate (Compaction success * 100 / Compaction stalls, %) 18.47 : 28.94 Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available") Signed-off-by: Joonsoo Kim Acked-by: Vlastimil Babka Reviewed-by: Zhang Yanfei Cc: Mel Gorman Cc: David Rientjes Cc: Rik van Riel Cc: [3.7+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/compaction.c b/mm/compaction.c index b68736c8a1ce..4954e196680c 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1173,7 +1173,7 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc, return COMPACT_PARTIAL; /* Job done if allocation would set block type */ - if (cc->order >= pageblock_order && area->nr_free) + if (order >= pageblock_order && area->nr_free) return COMPACT_PARTIAL; } -- cgit v1.2.3 From 932ff6bbbdcadd85b309ef4fd59d4d8a77329b8b Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Thu, 12 Feb 2015 14:59:53 -0800 Subject: mm/compaction: stop the isolation when we isolate enough freepage Currently, freepage isolation in one pageblock doesn't consider how many freepages we isolate. When I traced flow of compaction, compaction sometimes isolates more than 256 freepages to migrate just 32 pages. In this patch, freepage isolation is stopped at the point that we have more isolated freepage than isolated page for migration. This results in slowing down free page scanner and make compaction success rate higher. stress-highalloc test in mmtests with non movable order 7 allocation shows increase of compaction success rate. Compaction success rate (Compaction success * 100 / Compaction stalls, %) 27.13 : 31.82 pfn where both scanners meets on compaction complete (separate test due to enormous tracepoint buffer) (zone_start=4096, zone_end=1048576) 586034 : 654378 In fact, I didn't fully understand why this patch results in such good result. There was a guess that not used freepages are released to pcp list and on next compaction trial we won't isolate them again so compaction success rate would decrease. To prevent this effect, I tested with adding pcp drain code on release_freepages(), but, it has no good effect. Anyway, this patch reduces waste time to isolate unneeded freepages so seems reasonable. Vlastimil said: : I briefly tried it on top of the pivot-changing series and with order-9 : allocations it reduced free page scanned counter by almost 10%. No effect : on success rates (maybe because pivot changing already took care of the : scanners meeting problem) but the scanning reduction is good on its own. : : It also explains why e14c720efdd7 ("mm, compaction: remember position : within pageblock in free pages scanner") had less than expected : improvements. It would only actually stop within pageblock in case of : async compaction detecting contention. I guess that's also why the : infinite loop problem fixed by 1d5bfe1ffb5b affected so relatively few : people. Signed-off-by: Joonsoo Kim Acked-by: Vlastimil Babka Tested-by: Vlastimil Babka Reviewed-by: Zhang Yanfei Cc: Mel Gorman Cc: David Rientjes Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/compaction.c b/mm/compaction.c index 4954e196680c..782772df62c8 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -490,6 +490,13 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, /* If a page was split, advance to the end of it */ if (isolated) { + cc->nr_freepages += isolated; + if (!strict && + cc->nr_migratepages <= cc->nr_freepages) { + blockpfn += isolated; + break; + } + blockpfn += isolated - 1; cursor += isolated - 1; continue; @@ -899,7 +906,6 @@ static void isolate_freepages(struct compact_control *cc) unsigned long isolate_start_pfn; /* exact pfn we start at */ unsigned long block_end_pfn; /* end of current pageblock */ unsigned long low_pfn; /* lowest pfn scanner is able to scan */ - int nr_freepages = cc->nr_freepages; struct list_head *freelist = &cc->freepages; /* @@ -924,11 +930,11 @@ static void isolate_freepages(struct compact_control *cc) * pages on cc->migratepages. We stop searching if the migrate * and free page scanners meet or enough free pages are isolated. */ - for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages; + for (; block_start_pfn >= low_pfn && + cc->nr_migratepages > cc->nr_freepages; block_end_pfn = block_start_pfn, block_start_pfn -= pageblock_nr_pages, isolate_start_pfn = block_start_pfn) { - unsigned long isolated; /* * This can iterate a massively long zone without finding any @@ -953,9 +959,8 @@ static void isolate_freepages(struct compact_control *cc) continue; /* Found a block suitable for isolating free pages from. */ - isolated = isolate_freepages_block(cc, &isolate_start_pfn, + isolate_freepages_block(cc, &isolate_start_pfn, block_end_pfn, freelist, false); - nr_freepages += isolated; /* * Remember where the free scanner should restart next time, @@ -987,8 +992,6 @@ static void isolate_freepages(struct compact_control *cc) */ if (block_start_pfn < low_pfn) cc->free_pfn = cc->migrate_pfn; - - cc->nr_freepages = nr_freepages; } /* -- cgit v1.2.3 From f48b80a5e22200347e91f96b8b237b24b93c7192 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:56 -0800 Subject: memcg: cleanup static keys decrement Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of wrapper functions. Although this patch moves static keys decrement from __mem_cgroup_free to mem_cgroup_css_free, it does not introduce any functional changes, because the keys are incremented on setting the limit (tcp or kmem), which can only happen after successful mem_cgroup_css_online. Signed-off-by: Vladimir Davydov Cc: Glauber Costa Cc: KAMEZAWA Hiroyuki Cc: Eric W. Biederman Cc: David S. Miller Cc: Johannes Weiner Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/net/sock.h | 5 ----- mm/memcontrol.c | 38 +++++--------------------------------- net/ipv4/tcp_memcontrol.c | 4 ++++ 3 files changed, 9 insertions(+), 38 deletions(-) (limited to 'mm') diff --git a/include/net/sock.h b/include/net/sock.h index e13824570b0f..ab186b1d31ff 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1077,11 +1077,6 @@ static inline bool memcg_proto_active(struct cg_proto *cg_proto) return test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); } -static inline bool memcg_proto_activated(struct cg_proto *cg_proto) -{ - return test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags); -} - #ifdef SOCK_REFCNT_DEBUG static inline void sk_refcnt_debug_inc(struct sock *sk) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 419c06b1794a..d18d3a6e7337 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -519,16 +519,6 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg) } EXPORT_SYMBOL(tcp_proto_cgroup); -static void disarm_sock_keys(struct mem_cgroup *memcg) -{ - if (!memcg_proto_activated(&memcg->tcp_mem)) - return; - static_key_slow_dec(&memcg_socket_limit_enabled); -} -#else -static void disarm_sock_keys(struct mem_cgroup *memcg) -{ -} #endif #ifdef CONFIG_MEMCG_KMEM @@ -583,28 +573,8 @@ void memcg_put_cache_ids(void) struct static_key memcg_kmem_enabled_key; EXPORT_SYMBOL(memcg_kmem_enabled_key); -static void disarm_kmem_keys(struct mem_cgroup *memcg) -{ - if (memcg->kmem_acct_activated) - static_key_slow_dec(&memcg_kmem_enabled_key); - /* - * This check can't live in kmem destruction function, - * since the charges will outlive the cgroup - */ - WARN_ON(page_counter_read(&memcg->kmem)); -} -#else -static void disarm_kmem_keys(struct mem_cgroup *memcg) -{ -} #endif /* CONFIG_MEMCG_KMEM */ -static void disarm_static_keys(struct mem_cgroup *memcg) -{ - disarm_sock_keys(memcg); - disarm_kmem_keys(memcg); -} - static struct mem_cgroup_per_zone * mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone) { @@ -4092,7 +4062,11 @@ static void memcg_deactivate_kmem(struct mem_cgroup *memcg) static void memcg_destroy_kmem(struct mem_cgroup *memcg) { - memcg_destroy_kmem_caches(memcg); + if (memcg->kmem_acct_activated) { + memcg_destroy_kmem_caches(memcg); + static_key_slow_dec(&memcg_kmem_enabled_key); + WARN_ON(page_counter_read(&memcg->kmem)); + } mem_cgroup_sockets_destroy(memcg); } #else @@ -4523,8 +4497,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) free_mem_cgroup_per_zone_info(memcg, node); free_percpu(memcg->stat); - - disarm_static_keys(memcg); kfree(memcg); } diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c index c2a75c6957a1..2379c1b4efb2 100644 --- a/net/ipv4/tcp_memcontrol.c +++ b/net/ipv4/tcp_memcontrol.c @@ -47,6 +47,10 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg) return; percpu_counter_destroy(&cg_proto->sockets_allocated); + + if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags)) + static_key_slow_dec(&memcg_socket_limit_enabled); + } EXPORT_SYMBOL(tcp_destroy_cgroup); -- cgit v1.2.3 From fc5199d1a9c9ed14e22651d0fd3b10c79e7e1f6d Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:00:02 -0800 Subject: mm/internal.h: don't split printk call in two All users of mminit_dprintk pass a compile-time constant as level, so this just makes gcc emit a single printk call instead of two. Signed-off-by: Rasmus Villemoes Cc: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Cc: David Rientjes Cc: Vishnu Pratap Singh Cc: Pintu Kumar Cc: Michal Nazarewicz Cc: Mel Gorman Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Tim Chen Cc: Hugh Dickins Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/internal.h b/mm/internal.h index c4d6c9b43491..a96da5b0029d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -351,8 +351,10 @@ extern int mminit_loglevel; #define mminit_dprintk(level, prefix, fmt, arg...) \ do { \ if (level < mminit_loglevel) { \ - printk(level <= MMINIT_WARNING ? KERN_WARNING : KERN_DEBUG); \ - printk(KERN_CONT "mminit::" prefix " " fmt, ##arg); \ + if (level <= MMINIT_WARNING) \ + printk(KERN_WARNING "mminit::" prefix " " fmt, ##arg); \ + else \ + printk(KERN_DEBUG "mminit::" prefix " " fmt, ##arg); \ } \ } while (0) -- cgit v1.2.3 From 061f67bc4d053e03970a268fca99a55b6859f301 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:00:06 -0800 Subject: mm/page_alloc.c: pull out init code from build_all_zonelists Pulling the code protected by if (system_state == SYSTEM_BOOTING) into its own helper allows us to shrink .text a little. This relies on build_all_zonelists already having a __ref annotation. Add a comment explaining why so one doesn't have to track it down through git log. The real saving comes in 3/5, ("mm/mm_init.c: Mark mminit_verify_zonelist as __init"), where we save about 400 bytes Signed-off-by: Rasmus Villemoes Cc: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Cc: David Rientjes Cc: Vishnu Pratap Singh Cc: Pintu Kumar Cc: Michal Nazarewicz Cc: Mel Gorman Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Tim Chen Cc: Hugh Dickins Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8d52ab18fe0d..375327b8e932 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3871,18 +3871,29 @@ static int __build_all_zonelists(void *data) return 0; } +static noinline void __init +build_all_zonelists_init(void) +{ + __build_all_zonelists(NULL); + mminit_verify_zonelist(); + cpuset_init_current_mems_allowed(); +} + /* * Called with zonelists_mutex held always * unless system_state == SYSTEM_BOOTING. + * + * __ref due to (1) call of __meminit annotated setup_zone_pageset + * [we're only called with non-NULL zone through __meminit paths] and + * (2) call of __init annotated helper build_all_zonelists_init + * [protected by SYSTEM_BOOTING]. */ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) { set_zonelist_order(); if (system_state == SYSTEM_BOOTING) { - __build_all_zonelists(NULL); - mminit_verify_zonelist(); - cpuset_init_current_mems_allowed(); + build_all_zonelists_init(); } else { #ifdef CONFIG_MEMORY_HOTPLUG if (zone) -- cgit v1.2.3 From 0e2342c709aa568b90cde3387d6e588ca862a0ba Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:00:09 -0800 Subject: mm/mm_init.c: park mminit_verify_zonelist as __init The only caller of mminit_verify_zonelist is build_all_zonelists_init, which is annotated with __init, so it should be safe to also mark the former as __init, saving ~400 bytes of .text. Signed-off-by: Rasmus Villemoes Cc: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Cc: David Rientjes Cc: Vishnu Pratap Singh Cc: Pintu Kumar Cc: Michal Nazarewicz Cc: Mel Gorman Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Tim Chen Cc: Hugh Dickins Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mm_init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/mm_init.c b/mm/mm_init.c index 4074caf9936b..e17c758b27bf 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -21,7 +21,7 @@ int mminit_loglevel; #endif /* The zonelists are simply reported, validation is manual. */ -void mminit_verify_zonelist(void) +void __init mminit_verify_zonelist(void) { int nid; -- cgit v1.2.3 From 194e81512063e96763025a990f841f5ab24815ee Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:00:12 -0800 Subject: mm/mm_init.c: mark mminit_loglevel __meminitdata mminit_loglevel is only referenced from __init and __meminit functions, so we can mark it __meminitdata. Signed-off-by: Rasmus Villemoes Cc: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Cc: David Rientjes Cc: Vishnu Pratap Singh Cc: Pintu Kumar Cc: Michal Nazarewicz Cc: Mel Gorman Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Tim Chen Cc: Hugh Dickins Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mm_init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/mm_init.c b/mm/mm_init.c index e17c758b27bf..5f420f7fafa1 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -14,7 +14,7 @@ #include "internal.h" #ifdef CONFIG_DEBUG_MEMORY_INIT -int mminit_loglevel; +int __meminitdata mminit_loglevel; #ifndef SECTIONS_SHIFT #define SECTIONS_SHIFT 0 -- cgit v1.2.3 From 9cb12d7b4ccaa976f97ce0c5fd0f1b6a83bc2a75 Mon Sep 17 00:00:00 2001 From: Grazvydas Ignotas Date: Thu, 12 Feb 2015 15:00:19 -0800 Subject: mm/memory.c: actually remap enough memory For whatever reason, generic_access_phys() only remaps one page, but actually allows to access arbitrary size. It's quite easy to trigger large reads, like printing out large structure with gdb, which leads to a crash. Fix it by remapping correct size. Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure") Signed-off-by: Grazvydas Ignotas Cc: Rik van Riel Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index f7886ab036e7..99275325f303 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3462,7 +3462,7 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, if (follow_phys(vma, addr, write, &prot, &phys_addr)) return -EINVAL; - maddr = ioremap_prot(phys_addr, PAGE_SIZE, prot); + maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot); if (write) memcpy_toio(maddr + offset, buf, len); else -- cgit v1.2.3 From 84109e15ddc8fe00c832d5228ca2aedf95d13b97 Mon Sep 17 00:00:00 2001 From: Yaowei Bai Date: Thu, 12 Feb 2015 15:00:22 -0800 Subject: mm/page_alloc: fix comment Add a necessary 'leave'. Signed-off-by: Yaowei Bai Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 375327b8e932..cb4758263f6b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -172,7 +172,7 @@ static void __free_pages_ok(struct page *page, unsigned int order); * 1G machine -> (16M dma, 784M normal, 224M high) * NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA * HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL - * HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA + * HIGHMEM allocation will leave (224M+784M)/256 of ram reserved in ZONE_DMA * * TBD: should special case ZONE_DMA32 machines here - in those we normally * don't need any ZONE_NORMAL reservation -- cgit v1.2.3 From 9ab3b598d2dfbdb0153ffa7e4b1456bbff59a25d Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Thu, 12 Feb 2015 15:00:25 -0800 Subject: mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page() A race condition starts to be visible in recent mmotm, where a PG_hwpoison flag is set on a migration source page *before* it's back in buddy page poo= l. This is problematic because no page flag is supposed to be set when freeing (see __free_one_page().) So the user-visible effect of this race is that it could trigger the BUG_ON() when soft-offlining is called. The root cause is that we call lru_add_drain_all() to make sure that the page is in buddy, but that doesn't work because this function just schedule= s a work item and doesn't wait its completion. drain_all_pages() does drainin= g directly, so simply dropping lru_add_drain_all() solves this problem. Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining") Signed-off-by: Naoya Horiguchi Cc: Andi Kleen Cc: Tony Luck Cc: Chen Gong Cc: [3.11+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'mm') diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 1a735fad2a13..d487f8dc6d39 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1646,8 +1646,6 @@ static int __soft_offline_page(struct page *page, int flags) * source page should be freed back to buddy before * setting PG_hwpoison. */ - if (!is_free_buddy_page(page)) - lru_add_drain_all(); if (!is_free_buddy_page(page)) drain_all_pages(page_zone(page)); SetPageHWPoison(page); -- cgit v1.2.3 From ff59909a077b3c51c168cb658601c6b63136a347 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Thu, 12 Feb 2015 15:00:28 -0800 Subject: mm: fix negative nr_isolated counts The vmstat interfaces are good at hiding negative counts (at least when CONFIG_SMP); but if you peer behind the curtain, you find that nr_isolated_anon and nr_isolated_file soon go negative, and grow ever more negative: so they can absorb larger and larger numbers of isolated pages, yet still appear to be zero. I'm happy to avoid a congestion_wait() when too_many_isolated() myself; but I guess it's there for a good reason, in which case we ought to get too_many_isolated() working again. The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case: putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot to call acct_isolated() to increment them. It is possible that the bug whcih this patch fixes could cause OOM kills when the system still has a lot of reclaimable page cache. Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()") Signed-off-by: Hugh Dickins Acked-by: Vlastimil Babka Acked-by: Joonsoo Kim Cc: [3.18+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/compaction.c b/mm/compaction.c index 782772df62c8..d50d6de6f1b6 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1103,8 +1103,10 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn, isolate_mode); - if (!low_pfn || cc->contended) + if (!low_pfn || cc->contended) { + acct_isolated(zone, cc); return ISOLATE_ABORT; + } /* * Either we isolated something and proceed with migration. Or -- cgit v1.2.3 From 3eba0c6a56c04f2b017b43641a821f1ebfb7fb4c Mon Sep 17 00:00:00 2001 From: Ganesh Mahendran Date: Thu, 12 Feb 2015 15:00:51 -0800 Subject: mm/zpool: add name argument to create zpool Currently the underlay of zpool: zsmalloc/zbud, do not know who creates them. There is not a method to let zsmalloc/zbud find which caller they belong to. Now we want to add statistics collection in zsmalloc. We need to name the debugfs dir for each pool created. The way suggested by Minchan Kim is to use a name passed by caller(such as zram) to create the zsmalloc pool. /sys/kernel/debug/zsmalloc/zram0 This patch adds an argument `name' to zs_create_pool() and other related functions. Signed-off-by: Ganesh Mahendran Acked-by: Minchan Kim Cc: Seth Jennings Cc: Nitin Gupta Cc: Dan Streetman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 8 +++++--- include/linux/zpool.h | 5 +++-- include/linux/zsmalloc.h | 2 +- mm/zbud.c | 3 ++- mm/zpool.c | 6 ++++-- mm/zsmalloc.c | 6 +++--- mm/zswap.c | 5 +++-- 7 files changed, 21 insertions(+), 14 deletions(-) (limited to 'mm') diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index eca4b67274c1..8e233edd7a09 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -327,9 +327,10 @@ static void zram_meta_free(struct zram_meta *meta, u64 disksize) kfree(meta); } -static struct zram_meta *zram_meta_alloc(u64 disksize) +static struct zram_meta *zram_meta_alloc(int device_id, u64 disksize) { size_t num_pages; + char pool_name[8]; struct zram_meta *meta = kmalloc(sizeof(*meta), GFP_KERNEL); if (!meta) @@ -342,7 +343,8 @@ static struct zram_meta *zram_meta_alloc(u64 disksize) goto out_error; } - meta->mem_pool = zs_create_pool(GFP_NOIO | __GFP_HIGHMEM); + snprintf(pool_name, sizeof(pool_name), "zram%d", device_id); + meta->mem_pool = zs_create_pool(pool_name, GFP_NOIO | __GFP_HIGHMEM); if (!meta->mem_pool) { pr_err("Error creating memory pool\n"); goto out_error; @@ -783,7 +785,7 @@ static ssize_t disksize_store(struct device *dev, return -EINVAL; disksize = PAGE_ALIGN(disksize); - meta = zram_meta_alloc(disksize); + meta = zram_meta_alloc(zram->disk->first_minor, disksize); if (!meta) return -ENOMEM; diff --git a/include/linux/zpool.h b/include/linux/zpool.h index f14bd75f08b3..56529b34dc63 100644 --- a/include/linux/zpool.h +++ b/include/linux/zpool.h @@ -36,7 +36,8 @@ enum zpool_mapmode { ZPOOL_MM_DEFAULT = ZPOOL_MM_RW }; -struct zpool *zpool_create_pool(char *type, gfp_t gfp, struct zpool_ops *ops); +struct zpool *zpool_create_pool(char *type, char *name, + gfp_t gfp, struct zpool_ops *ops); char *zpool_get_type(struct zpool *pool); @@ -80,7 +81,7 @@ struct zpool_driver { atomic_t refcount; struct list_head list; - void *(*create)(gfp_t gfp, struct zpool_ops *ops); + void *(*create)(char *name, gfp_t gfp, struct zpool_ops *ops); void (*destroy)(void *pool); int (*malloc)(void *pool, size_t size, gfp_t gfp, diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index 05c214760977..3283c6a55425 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -36,7 +36,7 @@ enum zs_mapmode { struct zs_pool; -struct zs_pool *zs_create_pool(gfp_t flags); +struct zs_pool *zs_create_pool(char *name, gfp_t flags); void zs_destroy_pool(struct zs_pool *pool); unsigned long zs_malloc(struct zs_pool *pool, size_t size); diff --git a/mm/zbud.c b/mm/zbud.c index 4e387bea702e..2ee4e4520493 100644 --- a/mm/zbud.c +++ b/mm/zbud.c @@ -130,7 +130,8 @@ static struct zbud_ops zbud_zpool_ops = { .evict = zbud_zpool_evict }; -static void *zbud_zpool_create(gfp_t gfp, struct zpool_ops *zpool_ops) +static void *zbud_zpool_create(char *name, gfp_t gfp, + struct zpool_ops *zpool_ops) { return zbud_create_pool(gfp, zpool_ops ? &zbud_zpool_ops : NULL); } diff --git a/mm/zpool.c b/mm/zpool.c index 739cdf0d183a..bacdab6e47de 100644 --- a/mm/zpool.c +++ b/mm/zpool.c @@ -129,6 +129,7 @@ static void zpool_put_driver(struct zpool_driver *driver) /** * zpool_create_pool() - Create a new zpool * @type The type of the zpool to create (e.g. zbud, zsmalloc) + * @name The name of the zpool (e.g. zram0, zswap) * @gfp The GFP flags to use when allocating the pool. * @ops The optional ops callback. * @@ -140,7 +141,8 @@ static void zpool_put_driver(struct zpool_driver *driver) * * Returns: New zpool on success, NULL on failure. */ -struct zpool *zpool_create_pool(char *type, gfp_t gfp, struct zpool_ops *ops) +struct zpool *zpool_create_pool(char *type, char *name, gfp_t gfp, + struct zpool_ops *ops) { struct zpool_driver *driver; struct zpool *zpool; @@ -168,7 +170,7 @@ struct zpool *zpool_create_pool(char *type, gfp_t gfp, struct zpool_ops *ops) zpool->type = driver->type; zpool->driver = driver; - zpool->pool = driver->create(gfp, ops); + zpool->pool = driver->create(name, gfp, ops); zpool->ops = ops; if (!zpool->pool) { diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index b72403927aa4..2359e61b02bf 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -246,9 +246,9 @@ struct mapping_area { #ifdef CONFIG_ZPOOL -static void *zs_zpool_create(gfp_t gfp, struct zpool_ops *zpool_ops) +static void *zs_zpool_create(char *name, gfp_t gfp, struct zpool_ops *zpool_ops) { - return zs_create_pool(gfp); + return zs_create_pool(name, gfp); } static void zs_zpool_destroy(void *pool) @@ -1148,7 +1148,7 @@ EXPORT_SYMBOL_GPL(zs_free); * On success, a pointer to the newly created pool is returned, * otherwise NULL. */ -struct zs_pool *zs_create_pool(gfp_t flags) +struct zs_pool *zs_create_pool(char *name, gfp_t flags) { int i; struct zs_pool *pool; diff --git a/mm/zswap.c b/mm/zswap.c index 0cfce9bc51e4..4249e82ff934 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -906,11 +906,12 @@ static int __init init_zswap(void) pr_info("loading zswap\n"); - zswap_pool = zpool_create_pool(zswap_zpool_type, gfp, &zswap_zpool_ops); + zswap_pool = zpool_create_pool(zswap_zpool_type, "zswap", gfp, + &zswap_zpool_ops); if (!zswap_pool && strcmp(zswap_zpool_type, ZSWAP_ZPOOL_DEFAULT)) { pr_info("%s zpool not available\n", zswap_zpool_type); zswap_zpool_type = ZSWAP_ZPOOL_DEFAULT; - zswap_pool = zpool_create_pool(zswap_zpool_type, gfp, + zswap_pool = zpool_create_pool(zswap_zpool_type, "zswap", gfp, &zswap_zpool_ops); } if (!zswap_pool) { -- cgit v1.2.3 From 0f050d997e275cf0e47ddc7006284eaa3c6fe049 Mon Sep 17 00:00:00 2001 From: Ganesh Mahendran Date: Thu, 12 Feb 2015 15:00:54 -0800 Subject: mm/zsmalloc: add statistics support Keeping fragmentation of zsmalloc in a low level is our target. But now we still need to add the debug code in zsmalloc to get the quantitative data. This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the statistics collection for developers. Currently only the objects statatitics in each class are collected. User can get the information via debugfs. cat /sys/kernel/debug/zsmalloc/zram0/... For example: After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem: class size obj_allocated obj_used pages_used 0 32 0 0 0 1 48 256 12 3 2 64 64 14 1 3 80 51 7 1 4 96 128 5 3 5 112 73 5 2 6 128 32 4 1 7 144 0 0 0 8 160 0 0 0 9 176 0 0 0 10 192 0 0 0 11 208 0 0 0 12 224 0 0 0 13 240 0 0 0 14 256 16 1 1 15 272 15 9 1 16 288 0 0 0 17 304 0 0 0 18 320 0 0 0 19 336 0 0 0 20 352 0 0 0 21 368 0 0 0 22 384 0 0 0 23 400 0 0 0 24 416 0 0 0 25 432 0 0 0 26 448 0 0 0 27 464 0 0 0 28 480 0 0 0 29 496 33 1 4 30 512 0 0 0 31 528 0 0 0 32 544 0 0 0 33 560 0 0 0 34 576 0 0 0 35 592 0 0 0 36 608 0 0 0 37 624 0 0 0 38 640 0 0 0 40 672 0 0 0 42 704 0 0 0 43 720 17 1 3 44 736 0 0 0 46 768 0 0 0 49 816 0 0 0 51 848 0 0 0 52 864 14 1 3 54 896 0 0 0 57 944 13 1 3 58 960 0 0 0 62 1024 4 1 1 66 1088 15 2 4 67 1104 0 0 0 71 1168 0 0 0 74 1216 0 0 0 76 1248 0 0 0 83 1360 3 1 1 91 1488 11 1 4 94 1536 0 0 0 100 1632 5 1 2 107 1744 0 0 0 111 1808 9 1 4 126 2048 4 4 2 144 2336 7 3 4 151 2448 0 0 0 168 2720 15 15 10 190 3072 28 27 21 202 3264 0 0 0 254 4096 36209 36209 36209 Total 37022 36326 36288 We can calculate the overall fragentation by the last line: Total 37022 36326 36288 (37022 - 36326) / 37022 = 1.87% Also by analysing objects alocated in every class we know why we got so low fragmentation: Most of the allocated objects is in . And there is only 1 page in class 254 zspage. So, No fragmentation will be introduced by allocating objs in class 254. And in future, we can collect other zsmalloc statistics as we need and analyse them. Signed-off-by: Ganesh Mahendran Suggested-by: Minchan Kim Acked-by: Minchan Kim Cc: Nitin Gupta Cc: Seth Jennings Cc: Dan Streetman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/Kconfig | 10 +++ mm/zsmalloc.c | 233 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 239 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/Kconfig b/mm/Kconfig index 4395b12869c8..de5239c152f9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -602,6 +602,16 @@ config PGTABLE_MAPPING You can check speed with zsmalloc benchmark: https://github.com/spartacus06/zsmapbench +config ZSMALLOC_STAT + bool "Export zsmalloc statistics" + depends on ZSMALLOC + select DEBUG_FS + help + This option enables code in the zsmalloc to collect various + statistics about whats happening in zsmalloc and exports that + information to userspace via debugfs. + If unsure, say N. + config GENERIC_EARLY_IOREMAP bool diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 2359e61b02bf..0dec1fa5f656 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -91,6 +91,7 @@ #include #include #include +#include #include #include @@ -168,6 +169,22 @@ enum fullness_group { ZS_FULL }; +enum zs_stat_type { + OBJ_ALLOCATED, + OBJ_USED, + NR_ZS_STAT_TYPE, +}; + +#ifdef CONFIG_ZSMALLOC_STAT + +static struct dentry *zs_stat_root; + +struct zs_size_stat { + unsigned long objs[NR_ZS_STAT_TYPE]; +}; + +#endif + /* * number of size_classes */ @@ -200,6 +217,10 @@ struct size_class { /* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */ int pages_per_zspage; +#ifdef CONFIG_ZSMALLOC_STAT + struct zs_size_stat stats; +#endif + spinlock_t lock; struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS]; @@ -217,10 +238,16 @@ struct link_free { }; struct zs_pool { + char *name; + struct size_class **size_class; gfp_t flags; /* allocation flags used when growing pool */ atomic_long_t pages_allocated; + +#ifdef CONFIG_ZSMALLOC_STAT + struct dentry *stat_dentry; +#endif }; /* @@ -942,6 +969,166 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage) return true; } +#ifdef CONFIG_ZSMALLOC_STAT + +static inline void zs_stat_inc(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ + class->stats.objs[type] += cnt; +} + +static inline void zs_stat_dec(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ + class->stats.objs[type] -= cnt; +} + +static inline unsigned long zs_stat_get(struct size_class *class, + enum zs_stat_type type) +{ + return class->stats.objs[type]; +} + +static int __init zs_stat_init(void) +{ + if (!debugfs_initialized()) + return -ENODEV; + + zs_stat_root = debugfs_create_dir("zsmalloc", NULL); + if (!zs_stat_root) + return -ENOMEM; + + return 0; +} + +static void __exit zs_stat_exit(void) +{ + debugfs_remove_recursive(zs_stat_root); +} + +static int zs_stats_size_show(struct seq_file *s, void *v) +{ + int i; + struct zs_pool *pool = s->private; + struct size_class *class; + int objs_per_zspage; + unsigned long obj_allocated, obj_used, pages_used; + unsigned long total_objs = 0, total_used_objs = 0, total_pages = 0; + + seq_printf(s, " %5s %5s %13s %10s %10s\n", "class", "size", + "obj_allocated", "obj_used", "pages_used"); + + for (i = 0; i < zs_size_classes; i++) { + class = pool->size_class[i]; + + if (class->index != i) + continue; + + spin_lock(&class->lock); + obj_allocated = zs_stat_get(class, OBJ_ALLOCATED); + obj_used = zs_stat_get(class, OBJ_USED); + spin_unlock(&class->lock); + + objs_per_zspage = get_maxobj_per_zspage(class->size, + class->pages_per_zspage); + pages_used = obj_allocated / objs_per_zspage * + class->pages_per_zspage; + + seq_printf(s, " %5u %5u %10lu %10lu %10lu\n", i, + class->size, obj_allocated, obj_used, pages_used); + + total_objs += obj_allocated; + total_used_objs += obj_used; + total_pages += pages_used; + } + + seq_puts(s, "\n"); + seq_printf(s, " %5s %5s %10lu %10lu %10lu\n", "Total", "", + total_objs, total_used_objs, total_pages); + + return 0; +} + +static int zs_stats_size_open(struct inode *inode, struct file *file) +{ + return single_open(file, zs_stats_size_show, inode->i_private); +} + +static const struct file_operations zs_stat_size_ops = { + .open = zs_stats_size_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int zs_pool_stat_create(char *name, struct zs_pool *pool) +{ + struct dentry *entry; + + if (!zs_stat_root) + return -ENODEV; + + entry = debugfs_create_dir(name, zs_stat_root); + if (!entry) { + pr_warn("debugfs dir <%s> creation failed\n", name); + return -ENOMEM; + } + pool->stat_dentry = entry; + + entry = debugfs_create_file("obj_in_classes", S_IFREG | S_IRUGO, + pool->stat_dentry, pool, &zs_stat_size_ops); + if (!entry) { + pr_warn("%s: debugfs file entry <%s> creation failed\n", + name, "obj_in_classes"); + return -ENOMEM; + } + + return 0; +} + +static void zs_pool_stat_destroy(struct zs_pool *pool) +{ + debugfs_remove_recursive(pool->stat_dentry); +} + +#else /* CONFIG_ZSMALLOC_STAT */ + +static inline void zs_stat_inc(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ +} + +static inline void zs_stat_dec(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ +} + +static inline unsigned long zs_stat_get(struct size_class *class, + enum zs_stat_type type) +{ + return 0; +} + +static int __init zs_stat_init(void) +{ + return 0; +} + +static void __exit zs_stat_exit(void) +{ +} + +static inline int zs_pool_stat_create(char *name, struct zs_pool *pool) +{ + return 0; +} + +static inline void zs_pool_stat_destroy(struct zs_pool *pool) +{ +} + +#endif + unsigned long zs_get_total_pages(struct zs_pool *pool) { return atomic_long_read(&pool->pages_allocated); @@ -1074,7 +1261,10 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size) set_zspage_mapping(first_page, class->index, ZS_EMPTY); atomic_long_add(class->pages_per_zspage, &pool->pages_allocated); + spin_lock(&class->lock); + zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage( + class->size, class->pages_per_zspage)); } obj = (unsigned long)first_page->freelist; @@ -1088,6 +1278,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size) kunmap_atomic(vaddr); first_page->inuse++; + zs_stat_inc(class, OBJ_USED, 1); /* Now move the zspage to another fullness group, if required */ fix_fullness_group(pool, first_page); spin_unlock(&class->lock); @@ -1128,6 +1319,12 @@ void zs_free(struct zs_pool *pool, unsigned long obj) first_page->inuse--; fullness = fix_fullness_group(pool, first_page); + + zs_stat_dec(class, OBJ_USED, 1); + if (fullness == ZS_EMPTY) + zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage( + class->size, class->pages_per_zspage)); + spin_unlock(&class->lock); if (fullness == ZS_EMPTY) { @@ -1158,9 +1355,16 @@ struct zs_pool *zs_create_pool(char *name, gfp_t flags) if (!pool) return NULL; + pool->name = kstrdup(name, GFP_KERNEL); + if (!pool->name) { + kfree(pool); + return NULL; + } + pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *), GFP_KERNEL); if (!pool->size_class) { + kfree(pool->name); kfree(pool); return NULL; } @@ -1210,6 +1414,9 @@ struct zs_pool *zs_create_pool(char *name, gfp_t flags) pool->flags = flags; + if (zs_pool_stat_create(name, pool)) + goto err; + return pool; err: @@ -1222,6 +1429,8 @@ void zs_destroy_pool(struct zs_pool *pool) { int i; + zs_pool_stat_destroy(pool); + for (i = 0; i < zs_size_classes; i++) { int fg; struct size_class *class = pool->size_class[i]; @@ -1242,6 +1451,7 @@ void zs_destroy_pool(struct zs_pool *pool) } kfree(pool->size_class); + kfree(pool->name); kfree(pool); } EXPORT_SYMBOL_GPL(zs_destroy_pool); @@ -1250,17 +1460,30 @@ static int __init zs_init(void) { int ret = zs_register_cpu_notifier(); - if (ret) { - zs_unregister_cpu_notifier(); - return ret; - } + if (ret) + goto notifier_fail; init_zs_size_classes(); #ifdef CONFIG_ZPOOL zpool_register_driver(&zs_zpool_driver); #endif + + ret = zs_stat_init(); + if (ret) { + pr_err("zs stat initialization failed\n"); + goto stat_fail; + } return 0; + +stat_fail: +#ifdef CONFIG_ZPOOL + zpool_unregister_driver(&zs_zpool_driver); +#endif +notifier_fail: + zs_unregister_cpu_notifier(); + + return ret; } static void __exit zs_exit(void) @@ -1269,6 +1492,8 @@ static void __exit zs_exit(void) zpool_unregister_driver(&zs_zpool_driver); #endif zs_unregister_cpu_notifier(); + + zs_stat_exit(); } module_init(zs_init); -- cgit v1.2.3