summaryrefslogtreecommitdiffstats
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2016-05-05mtd: kill the nand_ecclayout structBoris Brezillon2-21/+1
Now that all MTD drivers have moved to the mtd_ooblayout_ops model we can safely remove the struct nand_ecclayout definition, and all the remaining places where it was still used. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-05-05mtd: nand: kill the ecc->layout fieldBoris Brezillon1-2/+0
Now that all NAND drivers have switched to mtd_ooblayout_ops, we can kill the ecc->layout field. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-05-05mtd: onenand: switch to mtd_ooblayout_opsBoris Brezillon1-2/+0
Implementing the mtd_ooblayout_ops interface is the new way of exposing ECC/OOB layout to MTD users. Modify the onenand drivers to switch to this approach. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-05-05mtd: nand: fsmc: get rid of the fsmc_nand_eccplace structBoris Brezillon1-18/+0
Now that mtd_ooblayout_ecc() returns the ECC byte position using the OOB free method, we can get rid of the fsmc_nand_eccplace struct. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-05-05mtd: nand: sharpsl: switch to mtd_ooblayout_opsBoris Brezillon1-1/+1
Implementing the mtd_ooblayout_ops interface is the new way of exposing ECC/OOB layout to MTD users. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-04-19mtd: nand: implement the default mtd_ooblayout_opsBoris Brezillon1-0/+3
Replace the default nand_ecclayout definitions for large and small page devices with the equivalent mtd_ooblayout_ops. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-04-19mtd: create an mtd_ooblayout_ops struct to ease ECC layout definitionBoris Brezillon1-4/+28
ECC layout definitions are currently exposed using the nand_ecclayout struct which embeds oobfree and eccpos arrays with predefined size. This approach was acceptable when NAND chips were providing relatively small OOB regions, but MLC and TLC now provide OOB regions of several hundreds of bytes, which implies a non negligible overhead for everybody even those who only need to support legacy NANDs. Create an mtd_ooblayout_ops interface providing the same functionality (expose the ECC and oobfree layout) without the need for this huge structure. The mtd->ecclayout is now deprecated and should be replaced by the equivalent mtd_ooblayout_ops. In the meantime we provide a wrapper around the ->ecclayout field to ease migration to this new model. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-04-19mtd: add mtd_set_ecclayout() helper functionBoris Brezillon1-0/+6
Add an mtd_set_ecclayout() helper function to avoid direct accesses to the mtd->ecclayout field. This will ease future reworks of ECC layout definition. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-04-19mtd: add mtd_ooblayout_xxx() helper functionsBoris Brezillon1-0/+33
In order to make the ecclayout definition completely dynamic we need to rework the way the OOB layout are defined and iterated. Create a few mtd_ooblayout_xxx() helpers to ease OOB bytes manipulation and hide ecclayout internals to their users. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-04-19mtd: nand: export default read/write oob functionsBoris Brezillon1-0/+14
Export the default read/write oob functions (for the standard and syndrome scheme), so that drivers can use them for their raw implementation and implement their own functions for the normal oob operation. This is required if your ECC engine is capable of fixing some of the OOB data. In this case you have to overload the ->read_oob() and ->write_oob(), but if you don't specify the ->read/write_oob_raw() functions they are assigned to the ->read/write_oob() implementation, which is not what you want. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-04-19mtd/ifc: Add support for IFC controller version 2.0Raghav Dogra1-15/+30
The new IFC controller version 2.0 has a different memory map page. Upto IFC 1.4 PAGE size is 4 KB and from IFC2.0 PAGE size is 64KB. This patch segregates the IFC global and runtime registers to appropriate PAGE sizes. Signed-off-by: Jaiprakash Singh <b44839@freescale.com> Signed-off-by: Raghav Dogra <raghav@freescale.com> Acked-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Raghav Dogra <raghav.dogra@nxp.com> Acked-by: Scott Wood <oss@buserror.net> Acked-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-04-19of: mtd: prepare helper reading NAND ECC algo from DTRafał Miłecki1-0/+6
NAND subsystem is being slightly reworked to store ECC details in separated fields. In future we'll want to add support for more DT properties as specifying every possible setup with a single "nand-ecc-mode" is a pretty bad idea. To allow this let's add a helper that will support something like "nand-ecc-algo" in future. Right now we use it for keeping backward compatibility. Signed-off-by: Rafał Miłecki <zajec5@gmail.com> Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-04-19mtd: nand: add new enum for storing ECC algorithmRafał Miłecki1-0/+8
Our nand_ecc_modes_t is already a bit abused by value NAND_ECC_SOFT_BCH. This enum should store ECC mode only and putting algorithm details there is a bad idea. It would result in too many values impossible to support in a sane way. To solve this problem let's add a new enum. We'll have to modify all drivers to set it properly but once it's done it'll be possible to drop NAND_ECC_SOFT_BCH. That will result in a cleaner design and more possibilities like setting ECC algorithm for hardware ECC mode. Signed-off-by: Rafał Miłecki <zajec5@gmail.com> Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-04-19Merge branch 'mtd-nand-trigger' of ↵Boris Brezillon2-11/+7
git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds into nand/next Pull leds-trigger changes from Jacek Anaszewski. Create a generic mtd led-trigger to replace the exisitng nand led-trigger implementation. * 'mtd-nand-trigger' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds: mtd: Hook I/O activity to the MTD LED trigger mtd: nand: Remove the "nand-disk" LED trigger leds: trigger: Introduce a MTD (NAND/NOR) trigger mtd: Uninline mtd_write_oob and move it to mtdcore.c leds: trigger: Introduce a kernel panic LED trigger
2016-04-15mtd: nand: omap2: Implement NAND ready using gpiolibRoger Quadros1-1/+1
The GPMC WAIT pin status are now available over gpiolib. Update the omap_dev_ready() function to use gpio instead of directly accessing GPMC register space. Signed-off-by: Roger Quadros <rogerq@ti.com> Acked-by: Brian Norris <computersforpeace@gmail.com> Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com> Acked-by: Tony Lindgren <tony@atomide.com>
2016-04-15memory: omap-gpmc: Prevent GPMC_STATUS from being accessed via gpmc_regsRoger Quadros1-1/+2
GPMC_STATUS register is private to the GPMC module and must not be accessed directly by NAND driver through the gpmc_regs. They must use gpmc_omap_get_nand_ops() instead. Signed-off-by: Roger Quadros <rogerq@ti.com> Acked-by: Tony Lindgren <tony@atomide.com>
2016-04-15mtd: nand: omap: Clean up device tree supportRoger Quadros1-2/+1
Move NAND specific device tree parsing to NAND driver. The NAND controller node must have a compatible id, register space resource and interrupt resource. Signed-off-by: Roger Quadros <rogerq@ti.com> Acked-by: Brian Norris <computersforpeace@gmail.com> Acked-by: Tony Lindgren <tony@atomide.com>
2016-04-15mtd: nand: omap: Use gpmc_omap_get_nand_ops() to get NAND registersRoger Quadros1-1/+3
Deprecate nand register passing via platform data and use gpmc_omap_get_nand_ops() instead. Signed-off-by: Roger Quadros <rogerq@ti.com> Acked-by: Brian Norris <computersforpeace@gmail.com> Acked-by: Tony Lindgren <tony@atomide.com>
2016-04-15memory: omap-gpmc: Implement IRQ domain for NAND IRQsRoger Quadros1-2/+3
GPMC provides 2 interrupts for NAND use. i.e. fifoevent and termcount. Use IRQ domain for this. NAND device tree node can then get the necessary interrupts by using gpmc as the interrupt parent. Legacy boot uses gpmc_get_client_irq to get the NAND interrupts from the GPMC IRQ domain. Get rid of custom bitmasks and use IRQ domain for that as well. Signed-off-by: Roger Quadros <rogerq@ti.com> Acked-by: Rob Herring <robh@kernel.org> Acked-by: Tony Lindgren <tony@atomide.com>
2016-04-15memory: omap-gpmc: Introduce GPMC to NAND interfaceRoger Quadros1-2/+33
The OMAP GPMC module has certain registers dedicated for NAND access and some NAND bits mixed with other GPMC functionality. For the NAND dedicated registers we have the struct gpmc_nand_regs. The NAND driver needs to access NAND specific bits from the following non-dedicated registers - EMPTYWRITEBUFFERSTATUS from GPMC_STATUS For accessing these bits we introduce the struct gpmc_nand_ops. Add gpmc_omap_get_nand_ops() that returns the gpmc_nand_ops along with updating the gpmc_nand_regs. This API will be called by the OMAP NAND driver to access the necessary bits in GPMC register space. Signed-off-by: Roger Quadros <rogerq@ti.com> Acked-by: Tony Lindgren <tony@atomide.com>
2016-04-15ARM: OMAP2+: gpmc: Add gpmc timings and settings to platform dataRoger Quadros2-139/+142
Add device_timings, gpmc_timings and gpmc_setting to gpmc platform data. Signed-off-by: Roger Quadros <rogerq@ti.com> Acked-by: Tony Lindgren <tony@atomide.com>
2016-04-15ARM: OMAP2+: gpmc: Add platform dataRoger Quadros2-2/+31
Add a platform data structure for GPMC. It contains all the necessary platform information that needs to be passed from platform init code to GPMC driver. Signed-off-by: Roger Quadros <rogerq@ti.com> Acked-by: Tony Lindgren <tony@atomide.com>
2016-04-13leds: trigger: Introduce a MTD (NAND/NOR) triggerEzequiel Garcia1-0/+6
This commit introduces a MTD trigger for flash (NAND/NOR) device activity. The implementation is copied from IDE disk. This trigger deprecates the "nand-disk" LED trigger, but for backwards compatibility, we still keep the "nand-disk" trigger around. The motivation for deprecating the "nand-disk" LED trigger is that it only works for NAND drivers, whereas the "mtd" LED trigger is more generic (in fact, "nand-disk" currently only works for certain NAND drivers). Signed-off-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar> Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Jacek Anaszewski <j.anaszewski@samsung.com>
2016-04-13mtd: Uninline mtd_write_oob and move it to mtdcore.cEzequiel Garcia1-11/+1
There's no reason for having mtd_write_oob inlined in mtd.h header. Move it to mtdcore.c where it belongs. Signed-off-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar> Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Jacek Anaszewski <j.anaszewski@samsung.com>
2016-03-26Merge branch 'for-linus' of ↵Linus Torvalds5-18/+45
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is quite a bit here, including some overdue refactoring and cleanup on the mon_client and osd_client code from Ilya, scattered writeback support for CephFS and a pile of bug fixes from Zheng, and a few random cleanups and fixes from others" [ I already decided not to pull this because of it having been rebased recently, but ended up changing my mind after all. Next time I'll really hold people to it. Oh well. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits) libceph: use KMEM_CACHE macro ceph: use kmem_cache_zalloc rbd: use KMEM_CACHE macro ceph: use lookup request to revalidate dentry ceph: kill ceph_get_dentry_parent_inode() ceph: fix security xattr deadlock ceph: don't request vxattrs from MDS ceph: fix mounting same fs multiple times ceph: remove unnecessary NULL check ceph: avoid updating directory inode's i_size accidentally ceph: fix race during filling readdir cache libceph: use sizeof_footer() more ceph: kill ceph_empty_snapc ceph: fix a wrong comparison ceph: replace CURRENT_TIME by current_fs_time() ceph: scattered page writeback libceph: add helper that duplicates last extent operation libceph: enable large, variable-sized OSD requests libceph: osdc->req_mempool should be backed by a slab pool libceph: make r_request msg_size calculation clearer ...
2016-03-26Merge tag 'ntb-4.6' of git://github.com/jonmason/ntbLinus Torvalds1-2/+8
Pull NTB bug fixes from Jon Mason: "NTB bug fixes for tasklet from spinning forever, link errors, translation window setup, NULL ptr dereference, and ntb-perf errors. Also, a modification to the driver API that makes _addr functions optional" * tag 'ntb-4.6' of git://github.com/jonmason/ntb: NTB: Remove _addr functions from ntb_hw_amd NTB: Make _addr functions optional in the API NTB: Fix incorrect clean up routine in ntb_perf NTB: Fix incorrect return check in ntb_perf ntb: fix possible NULL dereference ntb: add missing setup of translation window ntb: stop link work when we do not have memory ntb: stop tasklet from spinning forever during shutdown. ntb: perf test: fix address space confusion
2016-03-26Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-12/+3
Pull more SCSI updates from James Bottomley: "The only new stuff which missed the first pull request is an update to the UFS driver. The rest is an assortment of bug fixes and minor tweaks which appeared recently (some are fixes for recent code and some are stuff spotted recently by the checkers or the new gcc-6 compiler [most of Arnd's stuff])" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (32 commits) scsi_common: do not clobber fixed sense information scsi: ufs: select CONFIG_NLS scsi: fc: use get/put_unaligned64 for wwn access fnic: move printk()s outside of the critical code section. qla2xxx: avoid maybe_uninitialized warning megaraid_sas: add missing curly braces in ioctl handler lpfc: fix misleading indentation scsi_transport_sas: add 'scsi_target_id' sysfs attribute scsi_dh_alua: uninitialized variable in alua_check_vpd() scsi: ufs-qcom: add printouts of testbus debug registers scsi: ufs-qcom: enable/disable the device ref clock scsi: ufs-qcom: set PA_Local_TX_LCC_Enable before link startup scsi: ufs: add device quirk delay before putting UFS rails in LPM scsi: ufs: fix leakage during link off state scsi: ufs: tune UniPro parameters to optimize hibern8 exit time scsi: ufs: handle non spec compliant bkops behaviour by device scsi: ufs: add retry for query descriptors scsi: ufs: add error recovery after DL NAC error scsi: ufs: make error handling bit faster scsi: ufs: disable vccq if it's not needed by UFS device ...
2016-03-25mm, kasan: stackdepot implementation. Enable stackdepot for SLABAlexander Potapenko1-0/+32
Implement the stack depot and provide CONFIG_STACKDEPOT. Stack depot will allow KASAN store allocation/deallocation stack traces for memory chunks. The stack traces are stored in a hash table and referenced by handles which reside in the kasan_alloc_meta and kasan_free_meta structures in the allocated memory chunks. IRQ stack traces are cut below the IRQ entry point to avoid unnecessary duplication. Right now stackdepot support is only enabled in SLAB allocator. Once KASAN features in SLAB are on par with those in SLUB we can switch SLUB to stackdepot as well, thus removing the dependency on SLUB stack bookkeeping, which wastes a lot of memory. This patch is based on the "mm: kasan: stack depots" patch originally prepared by Dmitry Chernenkov. Joonsoo has said that he plans to reuse the stackdepot code for the mm/page_owner.c debugging facility. [akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t] [aryabinin@virtuozzo.com: comment style fixes] Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25arch, ftrace: for KASAN put hard/soft IRQ entries into separate sectionsAlexander Potapenko3-12/+31
KASAN needs to know whether the allocation happens in an IRQ handler. This lets us strip everything below the IRQ entry point to reduce the number of unique stack traces needed to be stored. Move the definition of __irq_entry to <linux/interrupt.h> so that the users don't need to pull in <linux/ftrace.h>. Also introduce the __softirq_entry macro which is similar to __irq_entry, but puts the corresponding functions to the .softirqentry.text section. Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm, kasan: add GFP flags to KASAN APIAlexander Potapenko2-10/+13
Add GFP flags to KASAN hooks for future patches to use. This patch is based on the "mm: kasan: unified support for SLUB and SLAB allocators" patch originally prepared by Dmitry Chernenkov. Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm, kasan: SLAB supportAlexander Potapenko4-0/+43
Add KASAN hooks to SLAB allocator. This patch is based on the "mm: kasan: unified support for SLUB and SLAB allocators" patch originally prepared by Dmitry Chernenkov. Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill()Tetsuo Handa1-2/+0
A leftover from commit c32b3cbe0d06 ("oom, PM: make OOM detection in the freezer path raceless"). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25oom, oom_reaper: protect oom_reaper_list using simpler wayTetsuo Handa1-2/+0
"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it by simply checking tsk->oom_reaper_list != NULL. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25oom: make oom_reaper_list single linkedVladimir Davydov1-1/+1
Entries are only added/removed from oom_reaper_list at head so we can use a single linked list and hence save a word in task_struct. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25oom, oom_reaper: disable oom_reaper for oom_kill_allocating_taskMichal Hocko1-0/+2
Tetsuo has reported that oom_kill_allocating_task=1 will cause oom_reaper_list corruption because oom_kill_process doesn't follow standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue the same task multiple times - e.g. by sacrificing the same child multiple times. This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag which is set in oom_kill_process atomically and oom reaper is disabled if the flag was already set. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm, oom_reaper: implement OOM victims queuingMichal Hocko1-0/+3
wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address spaceMichal Hocko1-1/+1
When oom_reaper manages to unmap all the eligible vmas there shouldn't be much of the freable memory held by the oom victim left anymore so it makes sense to clear the TIF_MEMDIE flag for the victim and allow the OOM killer to select another task. The lack of TIF_MEMDIE also means that the victim cannot access memory reserves anymore but that shouldn't be a problem because it would get the access again if it needs to allocate and hits the OOM killer again due to the fatal_signal_pending resp. PF_EXITING check. We can safely hide the task from the OOM killer because it is clearly not a good candidate anymore as everyhing reclaimable has been torn down already. This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE and thus hold off further global OOM killer actions granted the oom reaper is able to take mmap_sem for the associated mm struct. This is not guaranteed now but further steps should make sure that mmap_sem for write should be blocked killable which will help to reduce such a lock contention. This is not done by this patch. Note that exit_oom_victim might be called on a remote task from __oom_reap_task now so we have to check and clear the flag atomically otherwise we might race and underflow oom_victims or wake up waiters too early. Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25mm, oom: introduce oom reaperMichal Hocko1-0/+2
This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25sched: add schedule_timeout_idle()Andrew Morton1-0/+1
This will be needed in the patch "mm, oom: introduce oom reaper". Acked-by: Michal Hocko <mhocko@suse.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ceph: fix security xattr deadlockYan, Zheng1-1/+2
When security is enabled, security module can call filesystem's getxattr/setxattr callbacks during d_instantiate(). For cephfs, d_instantiate() is usually called by MDS' dispatch thread, while handling MDS reply. If the MDS reply does not include xattrs and corresponding caps, getxattr/setxattr need to send a new request to MDS and waits for the reply. This makes MDS' dispatch sleep, nobody handles later MDS replies. The fix is make sure lookup/atomic_open reply include xattrs and corresponding caps. So getxattr can be handled by cached xattrs. This requires some modification to both MDS and request message. (Client tells MDS what caps it wants; MDS encodes proper caps in the reply) Smack security module may call setxattr during d_instantiate(). Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL to us. So just make setxattr return error when called by MDS' dispatch thread. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25libceph: add helper that duplicates last extent operationYan, Zheng1-0/+2
This helper duplicates last extent operation in OSD request, then adjusts the new extent operation's offset and length. The helper is for scatterd page writeback, which adds nonconsecutive dirty pages to single OSD request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: enable large, variable-sized OSD requestsIlya Dryomov1-2/+4
Turn r_ops into a flexible array member to enable large, consisting of up to 16 ops, OSD requests. The use case is scattered writeback in cephfs and, as far as the kernel client is concerned, 16 is just a made up number. r_ops had size 3 for copyup+hint+write, but copyup is really a special case - it can only happen once. ceph_osd_request_cache is therefore stuffed with num_ops=2 requests, anything bigger than that is allocated with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which means either num_ops=1 or num_ops=2 for use_mempool=true - all existing users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: move r_reply_op_{len,result} into struct ceph_osd_req_opYan, Zheng1-2/+3
This avoids defining large array of r_reply_op_{len,result} in in struct ceph_osd_request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: rename ceph_osd_req_op::payload_len to indata_lenIlya Dryomov1-1/+1
Follow userspace nomenclature on this - the next commit adds outdata_len. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: monc hunt rate is 3s with backoff up to 30sIlya Dryomov2-0/+6
Unless we are in the process of setting up a client (i.e. connecting to the monitor cluster for the first time), apply a backoff: every time we want to reopen a session, increase our timeout by a multiple (currently 2); when we complete the connection, reduce that multipler by 50%. Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: monc ping rate is 10sIlya Dryomov1-2/+3
Split ping interval and ping timeout: ping interval is 10s; keepalive timeout is 30s. Make monc_ping_timeout a constant while at it - it's not actually exported as a mount option (and the rest of tick-related settings won't be either), so it's got no place in ceph_options. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: revamp subs code, switch to SUBSCRIBE2 protocolIlya Dryomov3-10/+24
It is currently hard-coded in the mon_client that mdsmap and monmap subs are continuous, while osdmap sub is always "onetime". To better handle full clusters/pools in the osd_client, we need to be able to issue continuous osdmap subs. Revamp subs code to allow us to specify for each sub whether it should be continuous or not. Although not strictly required for the above, switch to SUBSCRIBE2 protocol while at it, eliminating the ambiguity between a request for "every map since X" and a request for "just the latest" when we don't have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now required - it's been supported since pre-argonaut (2010). Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling in before we validate the epoch and successfully install the new map can mess up mon_client sub state. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds2-4/+4
Pull drm fixes from Dave Airlie: "Just a couple of dma-buf related fixes and some amdgpu fixes, along with a regression fix for radeon off but default feature, but makes my 30" monitor happy again" * 'drm-next' of git://people.freedesktop.org/~airlied/linux: drm/radeon/mst: cleanup code indentation drm/radeon/mst: fix regression in lane/link handling. drm/amdgpu: add invalidate_page callback for userptrs drm/amdgpu: Revert "remove the userptr rmn->lock" drm/amdgpu: clean up path handling for powerplay drm/amd/powerplay: fix memory leak of tdp_table dma-buf/fence: fix fence_is_later v2 dma-buf: Update docs for SYNC ioctl drm: remove excess description dma-buf, drm, ion: Propagate error code from dma_buf_start_cpu_access() drm/atmel-hlcdc: use helper to get crtc state drm/atomic: use helper to get crtc state
2016-03-24Merge tag 'asm-generic-4.6' of ↵Linus Torvalds4-8/+1
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic updates from Arnd Bergmann: "There are only three patches this time, most other changes to files in include/asm-generic tend to go through the tree of whoever depends on the change. Two patches are cleanups for stuff that is no longer needed, the main change is to adapt the generic version of BUG_ON() for CONFIG_BUG=n to make it behave consistently with BUG(). This avoids undefined behavior along with a number of warnings about that undefined behavior in randconfig builds when we keep going on after hitting a BUG_ON()" * tag 'asm-generic-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: asm-generic: remove old nonatomic-io wrapper files asm-generic: default BUG_ON(x) to if(x)BUG() asm-generic: page.h: Remove useless get_user_page and free_user_page
2016-03-24Merge tag 'pm+acpi-4.6-rc1-2' of ↵Linus Torvalds1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management and ACPI updates from Rafael Wysocki: "The second batch of power management and ACPI updates for v4.6. Included are fixups on top of the previous PM/ACPI pull request and other material that didn't make into it but still should go into 4.6. Among other things, there's a fix for an intel_pstate driver issue uncovered by recent cpufreq changes, a workaround for a boot hang on Skylake-H related to the handling of deep C-states by the platform and a PCI/ACPI fix for the handling of IO port resources on non-x86 architectures plus some new device IDs and similar. Specifics: - Fix for an intel_pstate driver issue related to the handling of MSR updates uncovered by the recent cpufreq rework (Rafael Wysocki). - cpufreq core cleanups related to starting governors and frequency synchronization during resume from system suspend and a locking fix for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran). - acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang, Michael Neuling, Richard Cochran, Shilpasri Bhat). - intel_idle driver update preventing some Skylake-H systems from hanging during initialization by disabling deep C-states mishandled by the platform in the problematic configurations (Len Brown). - Intel Xeon Phi Processor x200 support for intel_idle (Dasaratharaman Chandramouli). - cpuidle menu governor updates to make it always honor PM QoS latency constraints (and prevent C1 from being used as the fallback C-state on x86 when they are set below its exit latency) and to restore the previous behavior to fall back to C1 if the next timer event is set far enough in the future that was changed in 4.4 which led to an energy consumption regression (Rik van Riel, Rafael Wysocki). - New device ID for a future AMD UART controller in the ACPI driver for AMD SoCs (Wang Hongcheng). - Rockchip rk3399 support for the rockchip-io-domain adaptive voltage scaling (AVS) driver (David Wu). - ACPI PCI resources management fix for the handling of IO space resources on architectures where the IO space is memory mapped (IA64 and ARM64) broken by the introduction of common ACPI resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi). - Fix for the ACPI backend of the generic device properties API to make it parse non-device (data node only) children of an ACPI device correctly (Irina Tirdea). - Fixes for the handling of global suspend flags (introduced in 4.4) during hibernation and resume from it (Lukas Wunner). - Support for obtaining configuration information from Device Trees in the PM clocks framework (Jon Hunter). - ACPI _DSM helper code and devfreq framework cleanups (Colin Ian King, Geert Uytterhoeven)" * tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits) PM / AVS: rockchip-io: add io selectors and supplies for rk3399 intel_idle: Support for Intel Xeon Phi Processor x200 Product Family intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled ACPI / PM: Runtime resume devices when waking from hibernate PM / sleep: Clear pm_suspend_global_flags upon hibernate cpufreq: governor: Always schedule work on the CPU running update cpufreq: Always update current frequency before startig governor cpufreq: Introduce cpufreq_update_current_freq() cpufreq: Introduce cpufreq_start_governor() cpufreq: powernv: Add sysfs attributes to show throttle stats cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static PCI: ACPI: IA64: fix IO port generic range check ACPI / util: cast data to u64 before shifting to fix sign extension cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path cpuidle: menu: Fall back to polling if next timer event is near cpufreq: acpi-cpufreq: Clean up hot plug notifier callback intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts cpufreq: Make cpufreq_quick_get() safe to call ACPI / property: fix data node parsing in acpi_get_next_subnode() ACPI / APD: Add device HID for future AMD UART controller ...