linux - Linux Kernel (branches are rebased on master from time to time)

Age	Commit message (Collapse)	Author	Files	Lines
2007-05-10	add upper-32-bits macro	Andrew Morton	1	-0/+10
	We keep on getting "right shift count >= width of type" warnings when doing things like sector_t s; x = s >> 56; because with CONFIG_LBD=n, s is only 32-bit. Similar problems can occur with dma_addr_t's. So add a simple wrapper function which code can use to avoid this warning. The above example would become x = upper_32_bits(s) >> 24; The first user is in fact AFS. Cc: James Bottomley <James.Bottomley@SteelEye.com> Cc: "Cameron, Steve" <Steve.Cameron@hp.com> Cc: "Miller, Mike (OS Dev)" <Mike.Miller@hp.com> Cc: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10	slub: support concurrent local and remote frees and allocs on a slab	Christoph Lameter	1	-2/+5
	Avoid atomic overhead in slab_alloc and slab_free SLUB needs to use the slab_lock for the per cpu slabs to synchronize with potential kfree operations. This patch avoids that need by moving all free objects onto a lockless_freelist. The regular freelist continues to exist and will be used to free objects. So while we consume the lockless_freelist the regular freelist may build up objects. If we are out of objects on the lockless_freelist then we may check the regular freelist. If it has objects then we move those over to the lockless_freelist and do this again. There is a significant savings in terms of atomic operations that have to be performed. We can even free directly to the lockless_freelist if we know that we are running on the same processor. So this speeds up short lived objects. They may be allocated and freed without taking the slab_lock. This is particular good for netperf. In order to maximize the effect of the new faster hotpath we extract the hottest performance pieces into inlined functions. These are then inlined into kmem_cache_alloc and kmem_cache_free. So hotpath allocation and freeing no longer requires a subroutine call within SLUB. [I am not sure that it is worth doing this because it changes the easy to read structure of slub just to reduce atomic ops. However, there is someone out there with a benchmark on 4 way and 8 way processor systems that seems to show a 5% regression vs. Slab. Seems that the regression is due to increased atomic operations use vs. SLAB in SLUB). I wonder if this is applicable or discernable at all in a real workload?] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	Merge branch 'for-linus' of ↵	Linus Torvalds	9	-24/+1203
	master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband * 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband: IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters IB: Put rlimit accounting struct in struct ib_umem IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules
2007-05-09	Revert "md: improve partition detection in md array"	Linus Torvalds	1	-0/+1
	This reverts commit 5b479c91da90eef605f851508744bfe8269591a0. Quoth Neil Brown: "It causes an oops when auto-detecting raid arrays, and it doesn't seem easy to fix. The array may not be 'open' when do_md_run is called, so bdev->bd_disk might be NULL, so bd_set_size can oops. This whole approach of opening an md device before it has been assembled just seems to get more and more painful. I think I'm going to have to come up with something clever to provide both backward comparability with usage expectation, and sane integration into the rest of the kernel." Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	Merge branch 'upstream' of ↵	Jeff Garzik	1	-0/+2
	git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 into upstream
2007-05-09	Merge master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6	Linus Torvalds	1	-43/+50
	* master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6: ide: fix PIO setup on resume for ATAPI devices ide: legacy PCI bus order probing fixes ide: add ide_proc_register_port() ide: add "initializing" argument to ide_register_hw() ide: cable detection fixes (take 2) ide: move IDE settings handling to ide-proc.c ide: split off ioctl handling from IDE settings (v2) ide: make /proc/ide/ optional ide: add ide_tune_dma() helper ide: rework the code for selecting the best DMA transfer mode (v3) ide: fix UDMA/MWDMA/SWDMA masks (v3)
2007-05-10	ide: legacy PCI bus order probing fixes	Bartlomiej Zolnierkiewicz	1	-0/+5
	IDE PCI host drivers should register themselves with IDE core only when IDE driver is built-in, otherwise (IDE driver is modular and thus IDE PCI host drivers are also modular) the code has no effect and just complicates the probing. Fix it by adding new config option CONFIG_IDEPCI_PCIBUS (defined only when needed and invisible to the user) and covering by #ifdef/#endif the code in question. It turned out that "ide=reverse" was silently accepted but did nothing in case when IDE driver was modular, this is fixed now. Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-10	ide: add ide_proc_register_port()	Bartlomiej Zolnierkiewicz	1	-6/+4
	* create_proc_ide_interfaces() tries to add /proc entries for every probed and initialized IDE port, replace it by ide_proc_register_port() which does it only for the given port (also rename destroy_proc_ide_interface() to ide_proc_unregister_port() for consistency) * convert {create,destroy}_proc_ide_interface[s]() users to use new functions * pmac driver depended on proc_ide_create() to add /proc port entries, fix it * au1xxx-ide, swarm and cs5520 drivers depended indirectly on ide-generic driver (CONFIG_IDE_GENERIC=y) to add port /proc entries, fix them * there is now no need to add /proc entries for IDE ports in proc_ide_create() so don't do it * proc_ide_create() needs now to be called before drivers are probed - fix it, while at it make proc_ide_create() create /proc "ide" directory Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-10	ide: add "initializing" argument to ide_register_hw()	Bartlomiej Zolnierkiewicz	1	-2/+3
	Add "initializing" argument to ide_register_hw() and use it instead of ide.c wide variable of the same name. Update all users of ide_register_hw() accordingly. Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-10	ide: cable detection fixes (take 2)	Bartlomiej Zolnierkiewicz	1	-0/+1
	Tejun's recent eighty_ninty_three() fix has inspired me to do more thorough review of the cable detection code... * print user-friendly warning about limiting the maximum transfer speed to UDMA33 (and the reason behind it) when 80-wire cable is not detected, also while at it cleanup eighty_ninty_three() a bit * use eighty_ninty_three() in ide_ata66_check(), this actually fixes 3 bugs: - bit 14 (word 93 validity check) == 1 && bit 13 (80-wire cable test) == 1 were used as 80-wire cable present test for CONFIG_IDEDMA_IVB=n case (please see FIXME comment in eighty_ninty_three() for more details) - CONFIG_IDEDMA_IVB=y/n cases were interchanged - check for SATA devices was missing * remove private cable warnings from pdc_202xx{old,new} drivers now that core code provides this functionality (plus, in pdc202xx_new case the test could give false warnings for ATAPI devices because pdc202xx_new driver doesn't even support ATAPI DMA) Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-10	ide: move IDE settings handling to ide-proc.c	Bartlomiej Zolnierkiewicz	1	-16/+23
	* move __ide_add_setting() ide_add_setting() __ide_remove_setting() auto_remove_settings() ide_find_setting_by_name() ide_read_setting() ide_write_setting() set_xfer_rate() ide_add_generic_settings() ide_register_subdriver() ide_unregister_subdriver() from ide.c to ide-proc.c * set_{io_32bit,pio_mode,using_dma}() cannot be marked static now, fix it * rename ide_[un]register_subdriver() to ide_proc_[un]register_driver(), update device drivers to use new names * add CONFIG_IDE_PROC_FS=n versions of ide_proc_[un]register_driver() and ide_add_generic_settings() * make ide_find_setting_by_name(), ide_{read,write}_setting() and ide_{add,remove}_proc_entries() static * cover IDE settings code in device drivers with CONFIG_IDE_PROC_FS #ifdef, also while at it cover with CONFIG_IDE_PROC_FS #ifdef ide_driver_t.proc * remove bogus comment from ide.h * cover with CONFIG_IDE_PROC_FS #ifdef .proc and .settings in ide_drive_t Besides saner code this patch results in the IDE core smaller by ~2 kB (on x86-32) and IDE disk driver by ~1 kB (ditto) when CONFIG_IDE_PROC_FS=n. Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-10	ide: split off ioctl handling from IDE settings (v2)	Bartlomiej Zolnierkiewicz	1	-12/+4
	* do write permission and min/max checks in ide_procset_t functions * ide-disk.c: drive->id is always available so cleanup "multcount" setting accordingly * ide-disk.c: "address" setting was incorrectly defined as type TYPE_INTA, fix it by using type TYPE_BYTE and updating ide_drive_t->adressing field, the bug didn't trigger because this IDE setting uses custom ->set function * ide.c: add set_ksettings() for handling HDIO_SET_KEEPSETTINGS ioctl * ide.c: add set_unmaskirq() for handling HDIO_SET_UNMASKINTR ioctl * handle ioctls directly in generic_ide_ioclt() and idedisk_ioctl() instead of using IDE settings to deal with them * remove no longer needed ide_find_setting_by_ioctl() and {read,write}_ioctl fields from ide_settings_t, also remove now unused TYPE_INTA handling v2: * add missing EXPORT_SYMBOL_GPL(ide_setting_sem) needed now for ide-disk Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-10	ide: make /proc/ide/ optional	Bartlomiej Zolnierkiewicz	1	-6/+8
	All important information/features should be already available through sysfs and ioctl interfaces. Add CONFIG_IDE_PROC_FS (CONFIG_SCSI_PROC_FS rip-off) config option, disabling it makes IDE driver ~5 kB smaller (on x86-32). While at it add CONFIG_PROC_FS=n versions of proc_ide_{create,destroy}() and remove no longer needed #ifdefs. Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-10	ide: add ide_tune_dma() helper	Bartlomiej Zolnierkiewicz	1	-0/+2
	After reworking the code responsible for selecting the best DMA transfer mode it is now possible to add generic ide_tune_dma() helper. Convert some IDE PCI host drivers to use it (the ones left need more work). Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-10	ide: rework the code for selecting the best DMA transfer mode (v3)	Bartlomiej Zolnierkiewicz	1	-6/+4
	Depends on the "ide: fix UDMA/MWDMA/SWDMA masks" patch. * add ide_hwif_t.udma_filter hook for filtering UDMA mask (use it in alim15x3, hpt366, siimage and serverworks drivers) * add ide_max_dma_mode() for finding best DMA mode for the device (loosely based on some older libata-core.c code) * convert ide_dma_speed() users to use ide_max_dma_mode() * make ide_rate_filter() take "ide_drive_t drive" as an argument instead of "u8 mode" and teach it to how to use UDMA mask to do filtering use ide_rate_filter() in hpt366 driver * remove no longer needed ide_dma_speed() and _ratemask() unexport eighty_ninty_three() v2: * rename ->filter_udma_mask to ->udma_filter [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ] v3: * updated for scc_pata driver (fixes XFER_UDMA_6 filtering for user-space originated transfer mode change requests when 100MHz clock is used) Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-10	ide: fix UDMA/MWDMA/SWDMA masks (v3)	Bartlomiej Zolnierkiewicz	1	-0/+1
	* use 0x00 instead of 0x80 to disable ->{ultra,mwdma,swdma}_mask * add udma_mask field to ide_pci_device_t and use it to initialize ->ultra_mask in aec62xx, cmd64x, pdc202xx_{new,old} and piix drivers * fix UDMA masks to match with chipset specific _ratemask() (alim15x3, hpt366, serverworks and siimage drivers need UDMA mask filtering method - done in the next patch) v2: piix: fix cable detection for 82801AA_1 and 82372FB_1 [ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ] * cmd64x: use hwif->cds->udma_mask [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ] * aec62xx: fix newly introduced bug - check DMA status not command register [ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ] v3: * piix: use hwif->cds->udma_mask [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ] Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-09	Merge git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6	Linus Torvalds	6	-3/+21
	* git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [IA64] wire up pselect, ppoll [IA64] Add TIF_RESTORE_SIGMASK [IA64] unwind did not work for processes born with CLONE_STOPPED [IA64] Optional method to purge the TLB on SN systems [IA64] SPIN_LOCK_UNLOCKED macro cleanup in arch/ia64 [IA64-SN2][KJ] mmtimer.c-kzalloc [IA64] fix stack alignment for ia32 signal handlers [IA64] - Altix: hotplug after intr redirect can crash system [IA64] save and restore cpus_allowed in cpu_idle_wait [IA64] Removal of percpu TR cleanup in kexec code [IA64] Fix some section mismatch errors
2007-05-09	Merge git://git.infradead.org/mtd-2.6	Linus Torvalds	3	-7/+20
	* git://git.infradead.org/mtd-2.6: (21 commits) [MTD] [CHIPS] Remove MTD_OBSOLETE_CHIPS (jedec, amd_flash, sharp) [MTD] Delete allegedly obsolete "bank_size" field of mtd_info. [MTD] Remove unnecessary user space check from mtd.h. [MTD] [MAPS] Remove flash maps for no longer supported 405LP boards [MTD] [MAPS] Fix missing printk() parameter in physmap_of.c MTD driver [MTD] [NAND] platform NAND driver: add driver [MTD] [NAND] platform NAND driver: update header [JFFS2] Simplify and clean up jffs2_add_tn_to_tree() some more. [JFFS2] Remove another bogus optimisation in jffs2_add_tn_to_tree() [JFFS2] Remove broken insert_point optimisation in jffs2_add_tn_to_tree() [JFFS2] Remember to calculate overlap on nodes which replace older nodes [JFFS2] Don't advance c->wbuf_ofs to next eraseblock after wbuf flush [MTD] [NAND] at91_nand.c: CMDLINE_PARTS support [MTD] [NAND] Tidy up handling of page number in nand_block_bad() [MTD] block2mtd_paramline[] mustn't be __initdata [MTD] [NAND] Support multiple chips in CAFÉ driver [MTD] [NAND] Rename cafe.c to cafe_nand.c and remove the multi-obj magic [MTD] [NAND] Use rslib for CAFÉ ECC [RSLIB] Support non-canonical GF representations [JFFS2] Remove dead file histo_mips.h ...
2007-05-09	Merge master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6	Linus Torvalds	8	-64/+72
	* master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6: sh: Fix stacktrace simplification fallout. sh: SH7760 DMABRG support. sh: clockevent/clocksource/hrtimers/nohz TMU support. sh: Truncate MAX_ACTIVE_REGIONS for the common case. rtc: rtc-sh: Fix rtc_dev pointer for rtc_update_irq(). sh: Convert to common die chain. sh: Wire up utimensat syscall. sh: landisk mv_nr_irqs definition. sh: Fixup ndelay() xloops calculation for alternate HZ. sh: Add 32-bit opcode feature CPU flag. sh: Fix PC adjustments for varying opcode length. sh: Support for SH-2A 32-bit opcodes. sh: Kill off redundant __div64_32 symbol export. sh: Share exception vector table for SH-3/4. sh: Always define TRAPA_BUG_OPCODE. sh: __GFP_REPEAT for pte allocations, too. rtc: rtc-sh: Fix up dev_dbg() warnings. sh: generic quicklist support.
2007-05-09	Merge branch 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm	Linus Torvalds	34	-797/+531
	* 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (28 commits) ARM: OMAP: Fix GCC-reported compile time bug ARM: OMAP: restore CONFIG_GENERIC_TIME ARM: OMAP: partial LED fixes ARM: OMAP: add SoSSI clock (call propagate_rate for childrens) ARM: OMAP: FB sync with N800 tree (support for dynamic SRAM allocations) ARM: OMAP: Sync framebuffer headers with N800 tree ARM: OMAP: Mostly cosmetic to sync up with linux-omap tree ARM: OMAP: Fix gpmc header ARM: OMAP: Add mailbox support for IVA [ARM] armv7: add Makefile and Kconfig entries [ARM] armv7: add support for asid-tagged VIVT I-cache [ARM] armv7: add dedicated ARMv7 barrier instructions [ARM] armv7: Add ARMv7 cacheid macros [ARM] armv7: add support for ARMv7 cores. [ARM] Fix ARM branch relocation range [ARM] 4363/1: AT91: Remove legacy PIO definitions [ARM] 4361/1: AT91: Build error ARM: OMAP: Sync core code with linux-omap ARM: OMAP: Sync headers with linux-omap ARM: OMAP: h4 must have blinky leds!! ...
2007-05-09	Merge branch 'master' of ↵	Linus Torvalds	8	-71/+93
	git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: [POWERPC] Further fixes for the removal of 4level-fixup hack from ppc32 [POWERPC] EEH: log all PCI-X and PCI-E AER registers [POWERPC] EEH: capture and log pci state on error [POWERPC] EEH: Split up long error msg [POWERPC] EEH: log error only after driver notification. [POWERPC] fsl_soc: Make mac_addr const in fs_enet_of_init(). [POWERPC] Don't use SLAB/SLUB for PTE pages [POWERPC] Spufs support for 64K LS mappings on 4K kernels [POWERPC] Add ability to 4K kernel to hash in 64K pages [POWERPC] Introduce address space "slices" [POWERPC] Small fixes & cleanups in segment page size demotion [POWERPC] iSeries: Make HVC_ISERIES the default [POWERPC] iSeries: suppress build warning in lparmap.c [POWERPC] Mark pages that don't exist as nosave [POWERPC] swsusp: Introduce register_nosave_region_late
2007-05-09	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial	Linus Torvalds	61	-63/+63
	* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits) sound: convert "sound" subdirectory to UTF-8 MAINTAINERS: Add cxacru website/mailing list include files: convert "include" subdirectory to UTF-8 general: convert "kernel" subdirectory to UTF-8 documentation: convert the Documentation directory to UTF-8 Convert the toplevel files CREDITS and MAINTAINERS to UTF-8. remove broken URLs from net drivers' output Magic number prefix consistency change to Documentation/magic-number.txt trivial: s/i_sem /i_mutex/ fix file specification in comments drivers/base/platform.c: fix small typo in doc misc doc and kconfig typos Remove obsolete fat_cvf help text Fix occurrences of "the the " Fix minor typoes in kernel/module.c Kconfig: Remove reference to external mqueue library Kconfig: A couple of grammatical fixes in arch/i386/Kconfig Correct comments in genrtc.c to refer to correct /proc file. Fix more "deprecated" spellos. Fix "deprecated" typoes. ... Fix trivial comment conflict in kernel/relay.c.
2007-05-09	Merge branch 'for-linus' of git://www.atmel.no/~hskinnemoen/linux/kernel/avr32	Linus Torvalds	2	-2/+4
	* 'for-linus' of git://www.atmel.no/~hskinnemoen/linux/kernel/avr32: [AVR32] Wire up sys_utimensat [AVR32] Fix section mismatch .taglist -> .init.text [AVR32] Implement dma_{alloc,free}_writecombine() AVR32: Spinlock initializer cleanup [AVR32] Use correct config symbol when setting cpuflags
2007-05-09	i386: msr.h: be paranoid about types and parentheses	H. Peter Anvin	1	-27/+22
	When implementing things as macros, make sure we use typecasts and parentheses where needed. The macros as defined were vulnerable to surreptitious promotion causing problems. Avoid macros where practical; e.g. wrmsr() can be an inline instead. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	i386: remove unused rdtsc() macro	H. Peter Anvin	2	-12/+0
	All users to the two-part rdtsc() macro have already switched to using rdtscl() or rdtscll(). Remove the now-obsolete macro. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	md: improve partition detection in md array	NeilBrown	1	-1/+0
	md currently uses ->media_changed to make sure rescan_partitions is call on md array after they are assembled. However that doesn't happen until the array is opened, which is later than some people would like. So use blkdev_ioctl to do the rescan immediately that the array has been assembled. This means we can remove all the ->change infrastructure as it was only used to trigger a partition rescan. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	fbdev: add support for AVR32	Haavard Skinnemoen	1	-1/+1
	Provide framebuffer page protection flags and definitions of fb_readl/fb_writel for AVR32. Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	svgalib: move fb_get_caps to svgalib	Antonino A. Daplas	1	-0/+2
	Move fb_get_caps() method to svgalib.c as svga_get_caps() so it can be used by s3fb, arkfb and vt8623fb. Signed-off-by: Antonino Daplas <adaplas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	i386 mmzone: use __maybe_unused	David Rientjes	1	-3/+3
	Replace automatic variable instances of __attribute__ ((unused)) with __maybe_unused. Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	sh: dma: use __maybe_unused	David Rientjes	3	-3/+3
	There is no such thing as labeling a variable as __attribute__((used)). Since ts_shift is not referenced in inline assembly, we assume that we're simply suppressing a warning here if the variable is declared but unreferenced. Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	compiler: introduce __used and __maybe_unused	David Rientjes	4	-6/+25
	__used is defined to be __attribute__((unused)) for all pre-3.3 gcc compilers to suppress warnings for unused functions because perhaps they are referenced only in inline assembly. It is defined to be __attribute__((used)) for gcc 3.3 and later so that the code is still emitted for such functions. __maybe_unused is defined to be __attribute__((unused)) for both function and variable use if it could possibly be unreferenced due to the evaluation of preprocessor macros. Function prototypes shall be marked with __maybe_unused if the actual definition of the function is dependant on preprocessor macros. No update to compiler-intel.h is necessary because ICC supports both __attribute__((used)) and __attribute__((unused)) as specified by the gcc manual. __attribute_used__ is deprecated and will be removed once all current code is converted to using __used. Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Adrian Bunk <bunk@stusta.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	rename thread_info to stack	Roman Zippel	7	-18/+18
	This finally renames the thread_info field in task structure to stack, so that the assumptions about this field are gone and archs have more freedom about placing the thread_info structure. Nonbroken archs which have a proper thread pointer can do the access to both current thread and task structure via a single pointer. It'll allow for a few more cleanups of the fork code, from which e.g. ia64 could benefit. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> [akpm@linux-foundation.org: build fix] Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Andi Kleen <ak@muc.de> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	wrap access to thread_info	Roman Zippel	5	-5/+5
	Recently a few direct accesses to the thread_info in the task structure snuck back, so this wraps them with the appropriate wrapper. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	Allow arch to initialize arch field of the module structure	Roman Zippel	1	-0/+3
	This will later allow an arch to add module specific information via linker generated tables instead of poking directly in the module object structure. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	clocksource: fix resume logic	Thomas Gleixner	1	-0/+3
	We need to make sure that the clocksources are resumed, when timekeeping is resumed. The current resume logic does not guarantee this. Add a resume function pointer to the clocksource struct, so clocksource drivers which need to reinitialize the clocksource can provide a resume function. Add a resume function, which calls the maybe available clocksource resume functions and resets the watchdog function, so a stable TSC can be used accross suspend/resume. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Andi Kleen <ak@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	Move remote node draining out of slab allocators	Christoph Lameter	2	-5/+4
	Currently the slab allocators contain callbacks into the page allocator to perform the draining of pagesets on remote nodes. This requires SLUB to have a whole subsystem in order to be compatible with SLAB. Moving node draining out of the slab allocators avoids a section of code in SLUB. Move the node draining so that is is done when the vm statistics are updated. At that point we are already touching all the cachelines with the pagesets of a processor. Add a expire counter there. If we have to update per zone or global vm statistics then assume that the pageset will require subsequent draining. The expire counter will be decremented on each vm stats update pass until it reaches zero. Then we will drain one batch from the pageset. The draining will cause vm counter updates which will then cause another expiration until the pcp is empty. So we will drain a batch every 3 seconds. Note that remote node draining is a somewhat esoteric feature that is required on large NUMA systems because otherwise significant portions of system memory can become trapped in pcp queues. The number of pcp is determined by the number of processors and nodes in a system. A system with 4 processors and 2 nodes has 8 pcps which is okay. But a system with 1024 processors and 512 nodes has 512k pcps with a high potential for large amount of memory being caught in them. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	vmstat: use our own timer events	Christoph Lameter	1	-3/+0
	vmstat is currently using the cache reaper to periodically bring the statistics up to date. The cache reaper does only exists in SLUB as a way to provide compatibility with SLAB. This patch removes the vmstat calls from the slab allocators and provides its own handling. The advantage is also that we can use a different frequency for the updates. Refreshing vm stats is a pretty fast job so we can run this every second and stagger this by only one tick. This will lead to some overlap in large systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm updates occurring at once. However, the vm stats update only accesses per node information. It is only necessary to stagger the vm statistics updates per processor in each node. Vm counter updates occurring on distant nodes will not cause cacheline contention. We could implement an alternate approach that runs the first processor on each node at the second and then each of the other processor on a node on a subsequent tick. That may be useful to keep a large amount of the second free of timer activity. Maybe the timer folks will have some feedback on this one? [jirislaby@gmail.com: add missing break] Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	Add suspend-related notifications for CPU hotplug	Rafael J. Wysocki	1	-0/+12
	Since nonboot CPUs are now disabled after tasks and devices have been frozen and the CPU hotplug infrastructure is used for this purpose, we need special CPU hotplug notifications that will help the CPU-hotplug-aware subsystems distinguish normal CPU hotplug events from CPU hotplug events related to a system-wide suspend or resume operation in progress. This patch introduces such notifications and causes them to be used during suspend and resume transitions. It also changes all of the CPU-hotplug-aware subsystems to take these notifications into consideration (for now they are handled in the same way as the corresponding "normal" ones). [oleg@tv-sign.ru: cleanups] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	fs: deprecate memclear_highpage_flush	Nate Diller	1	-2/+1
	Now that all the in-tree users are converted over to zero_user_page(), deprecate the old memclear_highpage_flush() call. Signed-off-by: Nate Diller <nate.diller@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	fs: convert core functions to zero_user_page	Nate Diller	1	-9/+19
	It's very common for file systems to need to zero part or all of a page, the simplist way is just to use kmap_atomic() and memset(). There's actually a library function in include/linux/highmem.h that does exactly that, but it's confusingly named memclear_highpage_flush(), which is descriptive of how it does the work rather than what the purpose is. So this patchset renames the function to zero_user_page(), and calls it from the various places that currently open code it. This first patch introduces the new function call, and converts all the core kernel callsites, both the open-coded ones and the old memclear_highpage_flush() ones. Following this patch is a series of conversions for each file system individually, per AKPM, and finally a patch deprecating the old call. The diffstat below shows the entire patchset. [akpm@linux-foundation.org: fix a few things] Signed-off-by: Nate Diller <nate.diller@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	FUTEX: new PRIVATE futexes	Eric Dumazet	1	-3/+26
	Analysis of current linux futex code : -------------------------------------- A central hash table futex_queues[] holds all contexts (futex_q) of waiting threads. Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to perform lookups or insert/deletion of a futex_q. When a futex_wait() is done, calling thread has to : 1) - Obtain a read lock on mmap_sem to be able to validate the user pointer (calling find_vma()). This validation tells us if the futex uses an inode based store (mapped file), or mm based store (anonymous mem) 2) - compute a hash key 3) - Atomic increment of reference counter on an inode or a mm_struct 4) - lock part of futex_queues[] hash table 5) - perform the test on value of futex. (rollback is value != expected_value, returns EWOULDBLOCK) (various loops if test triggers mm faults) 6) queue the context into hash table, release the lock got in 4) 7) - release the read_lock on mmap_sem <block> 8) Eventually unqueue the context (but rarely, as this part may be done by the futex_wake()) Futexes were designed to improve scalability but current implementation has various problems : - Central hashtable : This means scalability problems if many processes/threads want to use futexes at the same time. This means NUMA unbalance because this hashtable is located on one node. - Using mmap_sem on every futex() syscall : Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic ops on mmap_sem, dirtying cache line : - lot of cache line ping pongs on SMP configurations. mmap_sem is also extensively used by mm code (page faults, mmap()/munmap()) Highly threaded processes might suffer from mmap_sem contention. mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded programs because of contention on the mmap_sem cache line. - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter: It's also a cache line ping pong on SMP. It also increases mmap_sem hold time because of cache misses. Most of these scalability problems come from the fact that futexes are in one global namespace. As we use a central hash table, we must make sure they are all using the same reference (given by the mm subsystem). We chose to force all futexes be 'shared'. This has a cost. But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and optimal performance if carefuly implemented. Time has come for linux to have better threading performance. The goal is to permit new futex commands to avoid : - Taking the mmap_sem semaphore, conflicting with other subsystems. - Modifying a ref_count on mm or an inode, still conflicting with mm or fs. This is possible because, for one process using PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their virtual address, no matter the underlying mm storage is. If glibc wants to exploit this new infrastructure, it should use new _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be prepared to fallback on old subcommands for old kernels. Using one global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK. PTHREAD_PROCESS_SHARED futexes should still use the old subcommands. Compatibility with old applications is preserved, they still hit the scalability problems, but new applications can fly :) Note : the same SHARED futex (mapped on a file) can be used by old binaries and new binaries, because both binaries will use the old subcommands. Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic, as this is the default semantic. Almost all applications should benefit of this changes (new kernel and updated libc) Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine) /* calling futex_wait(addr, value) with value != addr / 433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes) 424 cycles per futex(FUTEX_WAIT) call (using one futex) 334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes) 334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex) For reference : 187 cycles per getppid() call 188 cycles per umask() call 181 cycles per ni_syscall() call Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Pierre Peiffer <pierre.peiffer@bull.net> Cc: "Ulrich Drepper" <drepper@gmail.com> Cc: "Nick Piggin" <nickpiggin@yahoo.com.au> Cc: "Ingo Molnar" <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	futex_requeue_pi optimization	Pierre Peiffer	1	-1/+8
	This patch provides the futex_requeue_pi functionality, which allows some threads waiting on a normal futex to be requeued on the wait-queue of a PI-futex. This provides an optimization, already used for (normal) futexes, to be used with the PI-futexes. This optimization is currently used by the glibc in pthread_broadcast, when using "normal" mutexes. With futex_requeue_pi, it can be used with PRIO_INHERIT mutexes too. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	Make futex_wait() use an hrtimer for timeout	Pierre Peiffer	1	-1/+3
	This patch modifies futex_wait() to use an hrtimer + schedule() in place of schedule_timeout(). schedule_timeout() is tick based, therefore the timeout granularity is the tick (1 ms, 4 ms or 10 ms depending on HZ). By using a high resolution timer for timeout wakeup, we can attain a much finer timeout granularity (in the microsecond range). This parallels what is already done for futex_lock_pi(). The timeout passed to the syscall is no longer converted to jiffies and is therefore passed to do_futex() and futex_wait() as an absolute ktime_t therefore keeping nanosecond resolution. Also this removes the need to pass the nanoseconds timeout part to futex_lock_pi() in val2. In futex_wait(), if there is no timeout then a regular schedule() is performed. Otherwise, an hrtimer is fired before schedule() is called. [akpm@linux-foundation.org: fix `make headers_check'] Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net> Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	declare struct ktime	Andrew Morton	1	-2/+4
	Some smarty went and inflicted ktime_t as a typedef upon us, so we cannot forward declare it. Create a new `union ktime', map ktime_t onto that. Now we need to kill off this ktime_t thing. Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	aio is unlikely	Andrew Morton	1	-1/+2
	Stick an unlikely() around is_aio(): I assert that most IO is synchronous. Cc: Suparna Bhattacharya <suparna@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	RPC: add wrapper for svc_reserve to account for checksum	Jeff Layton	1	-0/+19
	When the kernel calls svc_reserve to downsize the expected size of an RPC reply, it fails to account for the possibility of a checksum at the end of the packet. If a client mounts a NFSv2/3 with sec=krb5i/p, and does I/O then you'll generally see messages similar to this in the server's ring buffer: RPC request reserved 164 but used 208 While I was never able to verify it, I suspect that this problem is also the root cause of some oopses I've seen under these conditions: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=227726 This is probably also a problem for other sec= types and for NFSv4. The large reserved size for NFSv4 compound packets seems to generally paper over the problem, however. This patch adds a wrapper for svc_reserve that accounts for the possibility of a checksum. It also fixes up the appropriate callers of svc_reserve to call the wrapper. For now, it just uses a hardcoded value that I determined via testing. That value may need to be revised upward as things change, or we may want to eventually add a new auth_op that attempts to calculate this somehow. Unfortunately, there doesn't seem to be a good way to reliably determine the expected checksum length prior to actually calculating it, particularly with schemes like spkm3. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Acked-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	knfsd: rename sk_defer_lock to sk_lock	NeilBrown	1	-1/+2
	Now that sk_defer_lock protects two different things, make the name more generic. Also don't bother with disabling _bh as the lock is only ever taken from process context. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	remove nfs4_acl_add_ace()	Adrian Bunk	1	-1/+0
	nfs4_acl_add_ace() can now be removed. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Neil Brown <neilb@cse.unsw.edu.au> Acked-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	change kernel threads to ignore signals instead of blocking them	Oleg Nesterov	1	-0/+1
	Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against signals. This doesn't prevent the signal delivery, this only blocks signal_wake_up(). Every "killall -33 kthreadd" means a "struct siginfo" leak. Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking them (make a new helper ignore_signals() for that). If the kernel thread needs some signal, it should use allow_signal() anyway, and in that case it should not use CLONE_SIGHAND. Note that we can't change daemonize() (should die!) in the same way, because it can be used along with CLONE_SIGHAND. This means that allow_signal() still should unblock the signal to work correctly with daemonize()ed threads. However, disallow_signal() doesn't block the signal any longer but ignores it. NOTE: with or without this patch the kernel threads are not protected from handle_stop_signal(), this seems harmless, but not good. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09	kthread: don't depend on work queues	Eric W. Biederman	1	-0/+3
	Currently there is a circular reference between work queue initialization and kthread initialization. This prevents the kthread infrastructure from initializing until after work queues have been initialized. We want the properties of tasks created with kthread_create to be as close as possible to the init_task and to not be contaminated by user processes. The later we start our kthreadd that creates these tasks the harder it is to avoid contamination from user processes and the more of a mess we have to clean up because the defaults have changed on us. So this patch modifies the kthread support to not use work queues but to instead use a simple list of structures, and to have kthreadd start from init_task immediately after our kernel thread that execs /sbin/init. By being a true child of init_task we only have to change those process settings that we want to have different from init_task, such as our process name, the cpus that are allowed, blocking all signals and setting SIGCHLD to SIG_IGN so that all of our children are reaped automatically. By being a true child of init_task we also naturally get our ppid set to 0 and do not wind up as a child of PID == 1. Ensuring that tasks generated by kthread_create will not slow down the functioning of the wait family of functions. [akpm@linux-foundation.org: use interruptible sleeps] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>