summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2016-07-26shmem: split huge pages beyond i_size under memory pressureKirill A. Shutemov1-0/+175
Even if user asked to allocate huge pages always (huge=always), we should be able to free up some memory by splitting pages which are partly byound i_size if memory presure comes or once we hit limit on filesystem size (-o size=). In order to do this we maintain per-superblock list of inodes, which potentially have huge pages on the border of file size. Per-fs shrinker can reclaim memory by splitting such pages. If we hit -ENOSPC during shmem_getpage_gfp(), we try to split a page to free up space on the filesystem and retry allocation if it succeed. Link: http://lkml.kernel.org/r/1466021202-61880-37-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHEKirill A. Shutemov5-20/+32
For file mappings, we don't deposit page tables on THP allocation because it's not strictly required to implement split_huge_pmd(): we can just clear pmd and let following page faults to reconstruct the page table. But Power makes use of deposited page table to address MMU quirk. Let's hide THP page cache, including huge tmpfs, under separate config option, so it can be forbidden on Power. We can revert the patch later once solution for Power found. Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26khugepaged: add support of collapse for tmpfs/shmem pagesKirill A. Shutemov2-16/+475
This patch extends khugepaged to support collapse of tmpfs/shmem pages. We share fair amount of infrastructure with anon-THP collapse. Few design points: - First we are looking for VMA which can be suitable for mapping huge page; - If the VMA maps shmem file, the rest scan/collapse operations operates on page cache, not on page tables as in anon VMA case. - khugepaged_scan_shmem() finds a range which is suitable for huge page. The scan is lockless and shouldn't disturb system too much. - once the candidate for collapse is found, collapse_shmem() attempts to create a huge page: + scan over radix tree, making the range point to new huge page; + new huge page is not-uptodate, locked and freezed (refcount is 0), so nobody can touch them until we say so. + we swap in pages during the scan. khugepaged_scan_shmem() filters out ranges with more than khugepaged_max_ptes_swap swapped out pages. It's HPAGE_PMD_NR/8 by default. + old pages are isolated, unmapped and put to local list in case to be restored back if collapse failed. - if collapse succeed, we retract pte page tables from VMAs where huge pages mapping is possible. The huge page will be mapped as PMD on next minor fault into the range. Link: http://lkml.kernel.org/r/1466021202-61880-35-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26shmem: make shmem_inode_info::lock irq-safeKirill A. Shutemov1-24/+26
We are going to need to call shmem_charge() under tree_lock to get accoutning right on collapse of small tmpfs pages into a huge one. The problem is that tree_lock is irq-safe and lockdep is not happy, that we take irq-unsafe lock under irq-safe[1]. Let's convert the lock to irq-safe. [1] https://gist.github.com/kiryl/80c0149e03ed35dfaf26628b8e03cdbc Link: http://lkml.kernel.org/r/1466021202-61880-34-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()Kirill A. Shutemov1-15/+10
Both variants of khugepaged_alloc_page() do up_read(&mm->mmap_sem) first: no point keep it inside the function. Link: http://lkml.kernel.org/r/1466021202-61880-33-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp: extract khugepaged from mm/huge_memory.cKirill A. Shutemov3-1485/+1500
khugepaged implementation grew to the point when it deserve separate file in source. Let's move it to mm/khugepaged.c. Link: http://lkml.kernel.org/r/1466021202-61880-32-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappingsKirill A. Shutemov2-17/+22
Let's wire up existing madvise() hugepage hints for file mappings. MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the VMA. It only has effect if the filesystem is mounted with huge=advise or huge=within_size. MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in the VMA. It doesn't prevent a huge page from being allocated by other means, i.e. page fault into different mapping or write(2) into file. Link: http://lkml.kernel.org/r/1466021202-61880-31-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26shmem: add huge pages supportKirill A. Shutemov7-70/+326
Here's basic implementation of huge pages support for shmem/tmpfs. It's all pretty streight-forward: - shmem_getpage() allcoates huge page if it can and try to inserd into radix tree with shmem_add_to_page_cache(); - shmem_add_to_page_cache() puts the page onto radix-tree if there's space for it; - shmem_undo_range() removes huge pages, if it fully within range. Partial truncate of huge pages zero out this part of THP. This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour. As we don't really create hole in this case, lseek(SEEK_HOLE) may have inconsistent results depending what pages happened to be allocated. - no need to change shmem_fault: core-mm will map an compound page as huge if VMA is suitable; Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26shmem: get_unmapped_area align huge pageHugh Dickins2-2/+112
Provide a shmem_get_unmapped_area method in file_operations, called at mmap time to decide the mapping address. It could be conditional on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making it unconditional. shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area (which we treat as a black box, highly dependent on architecture and config and executable layout). Lots of conditions, and in most cases it just goes with the address that chose; but when our huge stars are rightly aligned, yet that did not provide a suitable address, go back to ask for a larger arena, within which to align the mapping suitably. There have to be some direct calls to shmem_get_unmapped_area(), not via the file_operations: because of the way shmem_zero_setup() is called to create a shmem object late in the mmap sequence, when MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero. Though this only matters when /proc/sys/vm/shmem_huge has been set. Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26shmem: prepare huge= mount option and sysfs knobKirill A. Shutemov2-0/+164
This patch adds new mount option "huge=". It can have following values: - "always": Attempt to allocate huge pages every time we need a new page; - "never": Do not allocate huge pages; - "within_size": Only allocate huge page if it will be fully within i_size. Also respect fadvise()/madvise() hints; - "advise: Only allocate huge pages if requested with fadvise()/madvise(); Default is "never" for now. "mount -o remount,huge= /mountpoint" works fine after mount: remounting huge=never will not attempt to break up huge pages at all, just stop more from being allocated. No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE, which is the appropriate option to protect those who don't want the new bloat, and with which we shall share some pmd code. Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it explicit). Allow enabling THP only if the machine has_transparent_hugepage(). But what about Shmem with no user-visible mount? SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. Though unlikely to suit all usages, provide sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled to experiment with huge on those. And allow shmem_enabled two further values: - "deny": For use in emergencies, to force the huge option off from all mounts; - "force": Force the huge option on for all - very useful for testing; Based on patch by Hugh Dickins. Link: http://lkml.kernel.org/r/1466021202-61880-28-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm, rmap: account shmem thp pagesKirill A. Shutemov4-7/+32
Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and smaps. It indicates how many times we allocate and map shmem THP. NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS. Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26truncate: handle file thpKirill A. Shutemov1-5/+23
For shmem/tmpfs we only need to tweak truncate_inode_page() and invalidate_mapping_pages(). truncate_inode_pages_range() and invalidate_inode_pages2_range() are adjusted to use page_to_pgoff(). Link: http://lkml.kernel.org/r/1466021202-61880-26-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26filemap: prepare find and delete operations for huge pagesKirill A. Shutemov1-56/+122
For now, we would have HPAGE_PMD_NR entries in radix tree for every huge page. That's suboptimal and it will be changed to use Matthew's multi-order entries later. 'add' operation is not changed, because we don't need it to implement hugetmpfs: shmem uses its own implementation. Link: http://lkml.kernel.org/r/1466021202-61880-25-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26vmscan: split file huge pages before paging them outKirill A. Shutemov1-0/+6
This is preparation of vmscan for file huge pages. We cannot write out huge pages, so we need to split them on the way out. Link: http://lkml.kernel.org/r/1466021202-61880-22-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp, mlock: do not mlock PTE-mapped file huge pagesKirill A. Shutemov4-9/+42
As with anon THP, we only mlock file huge pages if we can prove that the page is not mapped with PTE. This way we can avoid mlock leak into non-mlocked vma on split. We rely on PageDoubleMap() under lock_page() to check if the the page may be PTE mapped. PG_double_map is set by page_add_file_rmap() when the page mapped with PTEs. Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp: file pages support for split_huge_page()Kirill A. Shutemov2-45/+117
Basic scheme is the same as for anon THP. Main differences: - File pages are on radix-tree, so we have head->_count offset by HPAGE_PMD_NR. The count got distributed to small pages during split. - mapping->tree_lock prevents non-lockless access to pages under split over radix-tree; - Lockless access is prevented by setting the head->_count to 0 during split; - After split, some pages can be beyond i_size. We drop them from radix-tree. - We don't setup migration entries. Just unmap pages. It helps handling cases when i_size is in the middle of the page: no need handle unmap pages beyond i_size manually. Link: http://lkml.kernel.org/r/1466021202-61880-20-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp: run vma_adjust_trans_huge() outside i_mmap_rwsemKirill A. Shutemov1-2/+2
vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary. During split we munlock the huge page which requires rmap walk. rmap wants to take the lock on its own. Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this. Link: http://lkml.kernel.org/r/1466021202-61880-19-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp: prepare change_huge_pmd() for file thpKirill A. Shutemov1-1/+2
change_huge_pmd() has assert which is not relvant for file page. For shared mapping it's perfectly fine to have page table entry writable, without explicit mkwrite. Link: http://lkml.kernel.org/r/1466021202-61880-18-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp: skip file huge pmd on copy_huge_pmd()Kirill A. Shutemov1-18/+16
copy_page_range() has a check for "Don't copy ptes where a page fault will fill them correctly." It works on VMA level. We still copy all page table entries from private mappings, even if they map page cache. We can simplify copy_huge_pmd() a bit by skipping file PMDs. We don't map file private pages with PMDs, so they only can map page cache. It's safe to skip them as they can be re-faulted later. Link: http://lkml.kernel.org/r/1466021202-61880-17-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp: handle file COW faultsKirill A. Shutemov1-0/+5
File COW for THP is handled on pte level: just split the pmd. It's not clear how benefitial would be allocation of huge pages on COW faults. And it would require some code to make them work. I think at some point we can consider teaching khugepaged to collapse pages in COW mappings, but allocating huge on fault is probably overkill. Link: http://lkml.kernel.org/r/1466021202-61880-16-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp: handle file pages in split_huge_pmd()Kirill A. Shutemov1-2/+10
Splitting THP PMD is simple: just unmap it as in DAX case. This way we can avoid memory overhead on page table allocation to deposit. It's probably a good idea to try to allocation page table with GFP_ATOMIC in __split_huge_pmd_locked() to avoid refaulting the area, but clearing pmd should be good enough for now. Unlike DAX, we also remove the page from rmap and drop reference. pmd_young() is transfered to PageReferenced(). Link: http://lkml.kernel.org/r/1466021202-61880-15-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp: support file pages in zap_huge_pmd()Kirill A. Shutemov1-3/+9
split_huge_pmd() for file mappings (and DAX too) is implemented by just clearing pmd entry as we can re-fill this area from page cache on pte level later. This means we don't need deposit page tables when file THP is mapped. Therefore we shouldn't try to withdraw a page table on zap_huge_pmd() file THP PMD. Link: http://lkml.kernel.org/r/1466021202-61880-14-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26thp, vmstats: add counters for huge file pagesKirill A. Shutemov2-0/+3
THP_FILE_ALLOC: how many times huge page was allocated and put page cache. THP_FILE_MAPPED: how many times file huge page was mapped. Link: http://lkml.kernel.org/r/1466021202-61880-13-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: introduce do_set_pmd()Kirill A. Shutemov3-8/+72
With postponed page table allocation we have chance to setup huge pages. do_set_pte() calls do_set_pmd() if following criteria met: - page is compound; - pmd entry in pmd_none(); - vma has suitable size and alignment; Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26rmap: support file thpKirill A. Shutemov5-19/+51
Naive approach: on mapping/unmapping the page as compound we update ->_mapcount on each 4k page. That's not efficient, but it's not obvious how we can optimize this. We can look into optimization later. PG_double_map optimization doesn't work for file pages since lifecycle of file pages is different comparing to anon pages: file page can be mapped again at any time. Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: postpone page table allocation until we have page to mapKirill A. Shutemov2-126/+188
The idea (and most of code) is borrowed again: from Hugh's patchset on huge tmpfs[1]. Instead of allocation pte page table upfront, we postpone this until we have page to map in hands. This approach opens possibility to map the page as huge if filesystem supports this. Comparing to Hugh's patch I've pushed page table allocation a bit further: into do_set_pte(). This way we can postpone allocation even in faultaround case without moving do_fault_around() after __do_fault(). do_set_pte() got renamed to alloc_set_pte() as it can allocate page table if required. [1] http://lkml.kernel.org/r/alpine.LSU.2.11.1502202015090.14414@eggly.anvils Link: http://lkml.kernel.org/r/1466021202-61880-10-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: introduce fault_envKirill A. Shutemov5-471/+426
The idea borrowed from Peter's patch from patchset on speculative page faults[1]: Instead of passing around the endless list of function arguments, replace the lot with a single structure so we can change context without endless function signature changes. The changes are mostly mechanical with exception of faultaround code: filemap_map_pages() got reworked a bit. This patch is preparation for the next one. [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: do not pass mm_struct into handle_mm_faultKirill A. Shutemov3-12/+11
We always have vma->vm_mm around. Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26khugepaged: recheck pmd after mmap_sem re-acquiredKirill A. Shutemov1-0/+6
Vlastimil noted[1] that pmd can be no longer valid after we drop mmap_sem. We need recheck it once mmap_sem taken again. [1] http://lkml.kernel.org/r/12918dcd-a695-c6f4-e06f-69141c5f357f@suse.cz Link: http://lkml.kernel.org/r/1466021202-61880-6-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm, thp: fix locking inconsistency in collapse_huge_pageEbru Akagunduz1-4/+10
After creating revalidate vma function, locking inconsistency occured due to directing the code path to wrong label. This patch directs to correct label and fix the inconsistency. Related commit that caused inconsistency: http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0 Link: http://lkml.kernel.org/r/1464956884-4644-1-git-send-email-ebru.akagunduz@gmail.com Link: http://lkml.kernel.org/r/1466021202-61880-4-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm, thp: make swapin readahead under down_read of mmap_semEbru Akagunduz1-29/+62
Currently khugepaged makes swapin readahead under down_write. This patch supplies to make swapin readahead under down_read instead of down_write. The patch was tested with a test program that allocates 800MB of memory, writes to it, and then sleeps. The system was forced to swap out all. Afterwards, the test program touches the area by writing, it skips a page in each 20 pages of the area. [akpm@linux-foundation.org: update comment to match new code] [kirill.shutemov@linux.intel.com: passing 'vma' to hugepage_vma_revlidate() is useless] Link: http://lkml.kernel.org/r/20160530095058.GA53044@black.fi.intel.com Link: http://lkml.kernel.org/r/1466021202-61880-3-git-send-email-kirill.shutemov@linux.intel.com Link: http://lkml.kernel.org/r/1464335964-6510-4-git-send-email-ebru.akagunduz@gmail.com Link: http://lkml.kernel.org/r/1466021202-61880-2-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: make swapin readahead to improve thp collapse rateEbru Akagunduz3-4/+45
This patch makes swapin readahead to improve thp collapse rate. When khugepaged scanned pages, there can be a few of the pages in swap area. With the patch THP can collapse 4kB pages into a THP when there are up to max_ptes_swap swap ptes in a 2MB range. The patch was tested with a test program that allocates 400B of memory, writes to it, and then sleeps. I force the system to swap out all. Afterwards, the test program touches the area by writing, it skips a page in each 20 pages of the area. Without the patch, system did not swap in readahead. THP rate was %65 of the program of the memory, it did not change over time. With this patch, after 10 minutes of waiting khugepaged had collapsed %99 of the program's memory. [kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function] [kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter] [kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in] Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: make optimistic check for swapin readaheadEbru Akagunduz1-3/+42
Introduce a new sysfs integer knob /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes optimistic check for swapin readahead to increase thp collapse rate. Before getting swapped out pages to memory, checks them and allows up to a certain number. It also prints out using tracepoints amount of unmapped ptes. [vdavydov@parallels.com: fix scan not aborted on SCAN_EXCEED_SWAP_PTE] [sfr@canb.auug.org.au: build fix] Link: http://lkml.kernel.org/r/20160616154503.65806e12@canb.auug.org.au Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm/memblock.c:memblock_add_range(): if nr_new is 0 just returnnimisolo1-0/+3
If nr_new is 0 which means there's no region would be added, so just return to the caller. Signed-off-by: nimisolo <nimisolo@gmail.com> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm, memcg: use consistent gfp flags during readaheadMichal Hocko1-7/+6
Vladimir has noticed that we might declare memcg oom even during readahead because read_pages only uses GFP_KERNEL (with mapping_gfp restriction) while __do_page_cache_readahead uses page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from OOMs. This gfp mask discrepancy is really unfortunate and easily fixable. Drop page_cache_alloc_readahead() which only has one user and outsource the gfp_mask logic into readahead_gfp_mask and propagate this mask from __do_page_cache_readahead down to read_pages. This alone would have only very limited impact as most filesystems are implementing ->readpages and the common implementation mpage_readpages does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to use readahead_gfp_mask instead as this function is called only during readahead as well. The same applies to read_cache_pages. ext4 has its own ext4_mpage_readpages but the path which has pages != NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are doing a very similar pattern to mpage_readpages so the same can be applied to them as well. [akpm@linux-foundation.org: coding-style fixes] [mhocko@suse.com: restrict gfp mask in mpage_alloc] Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Chris Mason <clm@fb.com> Cc: Steve French <sfrench@samba.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Cc: Mike Marshall <hubcap@omnibond.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Changman Lee <cm224.lee@samsung.com> Cc: Chao Yu <yuchao0@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm, oom_reaper: make sure that mmput_async is called only when memory was reapedMichal Hocko1-7/+18
Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: add NR_ZSMALLOC to vmstatMinchan Kim2-2/+9
zram is very popular for some of the embedded world (e.g., TV, mobile phones). On those system, zsmalloc's consumed memory size is never trivial (one of example from real product system, total memory: 800M, zsmalloc consumed: 150M), so we have used this out of tree patch to monitor system memory behavior via /proc/vmstat. With zsmalloc in vmstat, it helps in tracking down system behavior due to memory usage. [minchan@kernel.org: zsmalloc: follow up zsmalloc vmstat] Link: http://lkml.kernel.org/r/20160607091737.GC23435@bbox [akpm@linux-foundation.org: fix build with CONFIG_ZSMALLOC=m] Link: http://lkml.kernel.org/r/1464919731-13255-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Chanho Min <chanho.min@lge.com> Cc: Chan Gyun Jeong <chan.jeong@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm, frontswap: convert frontswap_enabled to static keyVlastimil Babka2-21/+16
I have noticed that frontswap.h first declares "frontswap_enabled" as extern bool variable, and then overrides it with "#define frontswap_enabled (1)" for CONFIG_FRONTSWAP=Y or (0) when disabled. The bool variable isn't actually instantiated anywhere. This all looks like an unfinished attempt to make frontswap_enabled reflect whether a backend is instantiated. But in the current state, all frontswap hooks call unconditionally into frontswap.c just to check if frontswap_ops is non-NULL. This should at least be checked inline, but we can further eliminate the overhead when CONFIG_FRONTSWAP is enabled and no backend registered, using a static key that is initially disabled, and gets enabled only upon first backend registration. Thus, checks for "frontswap_enabled" are replaced with "frontswap_enabled()" wrapping the static key check. There are two exceptions: - xen's selfballoon_process() was testing frontswap_enabled in code guarded by #ifdef CONFIG_FRONTSWAP, which was effectively always true when reachable. The patch just removes this check. Using frontswap_enabled() does not sound correct here, as this can be true even without xen's own backend being registered. - in SYSCALL_DEFINE2(swapon), change the check to IS_ENABLED(CONFIG_FRONTSWAP) as it seems the bitmap allocation cannot currently be postponed until a backend is registered. This means that frontswap will still have some memory overhead by being configured, but without a backend. After the patch, we can expect that some functions in frontswap.c are called only when frontswap_ops is non-NULL. Change the checks there to VM_BUG_ONs. While at it, convert other BUG_ONs to VM_BUG_ONs as frontswap has been stable for some time. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1463152235-9717-1-git-send-email-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm,oom: remove unused argument from oom_scan_process_thread().Tetsuo Handa2-3/+3
oom_scan_process_thread() does not use totalpages argument. oom_badness() uses it. Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: memcontrol: teach uncharge_list to deal with kmem pagesVladimir Davydov1-18/+24
Page table pages are batched-freed in release_pages on most architectures. If we want to charge them to kmemcg (this is what is done later in this series), we need to teach mem_cgroup_uncharge_list to handle kmem pages. Link: http://lkml.kernel.org/r/18d5c09e97f80074ed25b97a7d0f32b95d875717.1464079538.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: charge/uncharge kmemcg from generic page allocator pathsVladimir Davydov4-60/+20
Currently, to charge a non-slab allocation to kmemcg one has to use alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with this helper should finally be freed using free_kmem_pages, otherwise it won't be uncharged. This API suits its current users fine, but it turns out to be impossible to use along with page reference counting, i.e. when an allocation is supposed to be freed with put_page, as it is the case with pipe or unix socket buffers. To overcome this limitation, this patch moves charging/uncharging to generic page allocator paths, i.e. to __alloc_pages_nodemask and free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way, one can use any of the available page allocation functions to get the allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT, just like in case of kmalloc and friends. A charged page will be automatically uncharged on free. To make it possible, we need to mark pages charged to kmemcg somehow. To avoid introducing a new page flag, we make use of page->_mapcount for marking such pages. Since pages charged to kmemcg are not supposed to be mapped to userspace, it should work just fine. There are other (ab)users of page->_mapcount - buddy and balloon pages - but we don't conflict with them. In case kmemcg is compiled out or not used at runtime, this patch introduces no overhead to generic page allocator paths. If kmemcg is used, it will be plus one gfp flags check on alloc and plus one page->_mapcount check on free, which shouldn't hurt performance, because the data accessed are hot. Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: memcontrol: cleanup kmem charge functionsVladimir Davydov3-27/+73
- Handle memcg_kmem_enabled check out to the caller. This reduces the number of function definitions making the code easier to follow. At the same time it doesn't result in code bloat, because all of these functions are used only in one or two places. - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't have to dive deep into memcg implementation to see which allocations are charged and which are not. - Refresh comments. Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm/mmu_gather: track page size with mmu gather and force flush if page size ↵Aneesh Kumar K.V3-3/+11
change This allows an arch which needs to do special handing with respect to different page size when flushing tlb to implement the same in mmu gather. Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: change the interface for __tlb_remove_page()Aneesh Kumar K.V1-6/+13
This updates the generic and arch specific implementation to return true if we need to do a tlb flush. That means if a __tlb_remove_page indicate a flush is needed, the page we try to remove need to be tracked and added again after the flush. We need to track it because we have already update the pte to none and we can't just loop back. This change is done to enable us to do a tlb_flush when we try to flush a range that consists of different page sizes. For architectures like ppc64, we can do a range based tlb flush and we need to track page size for that. When we try to remove a huge page, we will force a tlb flush and starts a new mmu gather. [aneesh.kumar@linux.vnet.ibm.com: mm-change-the-interface-for-__tlb_remove_page-v3] Link: http://lkml.kernel.org/r/1465049193-22197-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1464860389-29019-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm/hugetlb: simplify hugetlb unmapAneesh Kumar K.V1-33/+21
For hugetlb like THP (and unlike regular page), we do tlb flush after dropping ptl. Because of the above, we don't need to track force_flush like we do now. Instead we can simply call tlb_remove_page() which will do the flush if needed. No functionality change in this patch. Link: http://lkml.kernel.org/r/1465049193-22197-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm: thp: check pmd_trans_unstable() after split_huge_pmd()Naoya Horiguchi4-3/+6
split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing to pte entries, which can be checked with pmd_trans_unstable(). Some callers make this assertion and some do it differently and some not, so let's do it in a unified manner. Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm/page_isolation: clean up confused codeJoonsoo Kim1-3/+3
When there is an isolated_page, post_alloc_hook() is called with page but __free_pages() is called with isolated_page. Since they are the same so no problem but it's very confusing. To reduce it, this patch changes isolated_page to boolean type and uses page variable consistently. Link: http://lkml.kernel.org/r/1466150259-27727-10-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm/page_alloc: introduce post allocation processing on page allocatorJoonsoo Kim4-19/+18
This patch is motivated from Hugh and Vlastimil's concern [1]. There are two ways to get freepage from the allocator. One is using normal memory allocation API and the other is __isolate_free_page() which is internally used for compaction and pageblock isolation. Later usage is rather tricky since it doesn't do whole post allocation processing done by normal API. One problematic thing I already know is that poisoned page would not be checked if it is allocated by __isolate_free_page(). Perhaps, there would be more. We could add more debug logic for allocated page in the future and this separation would cause more problem. I'd like to fix this situation at this time. Solution is simple. This patch commonize some logic for newly allocated page and uses it on all sites. This will solve the problem. [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3] Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm/page_owner: use stackdepot to store stacktraceJoonsoo Kim1-19/+123
Currently, we store each page's allocation stacktrace on corresponding page_ext structure and it requires a lot of memory. This causes the problem that memory tight system doesn't work well if page_owner is enabled. Moreover, even with this large memory consumption, we cannot get full stacktrace because we allocate memory at boot time and just maintain 8 stacktrace slots to balance memory consumption. We could increase it to more but it would make system unusable or change system behaviour. To solve the problem, this patch uses stackdepot to store stacktrace. It obviously provides memory saving but there is a drawback that stackdepot could fail. stackdepot allocates memory at runtime so it could fail if system has not enough memory. But, most of allocation stack are generated at very early time and there are much memory at this time. So, failure would not happen easily. And, one failure means that we miss just one page's allocation stacktrace so it would not be a big problem. In this patch, when memory allocation failure happens, we store special stracktrace handle to the page that is failed to save stacktrace. With it, user can guess memory usage properly even if failure happens. Memory saving looks as following. (4GB memory system with page_owner) (before the patch -> after the patch) static allocation: 92274688 bytes -> 25165824 bytes dynamic allocation after boot + kernel build: 0 bytes -> 327680 bytes total: 92274688 bytes -> 25493504 bytes 72% reduction in total. Note that implementation looks complex than someone would imagine because there is recursion issue. stackdepot uses page allocator and page_owner is called at page allocation. Using stackdepot in page_owner could re-call page allcator and then page_owner. That is a recursion. To detect and avoid it, whenever we obtain stacktrace, recursion is checked and page_owner is set to dummy information if found. Dummy information means that this page is allocated for page_owner feature itself (such as stackdepot) and it's understandable behavior for user. [iamjoonsoo.kim@lge.com: mm-page_owner-use-stackdepot-to-store-stacktrace-v3] Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1466150259-27727-7-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26mm/page_owner: introduce split_page_owner and replace manual handlingJoonsoo Kim2-13/+9
split_page() calls set_page_owner() to set up page_owner to each pages. But, it has a drawback that head page and the others have different stacktrace because callsite of set_page_owner() is slightly differnt. To avoid this problem, this patch copies head page's page_owner to the others. It needs to introduce new function, split_page_owner() but it also remove the other function, get_page_owner_gfp() so looks good to do. Link: http://lkml.kernel.org/r/1464230275-25791-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>