summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2006-03-22[TCP]: Do not use inet->id of global tcp_socket when sending RST.Alexey Kuznetsov1-5/+1
The problem is in ip_push_pending_frames(), which uses: if (!df) { __ip_select_ident(iph, &rt->u.dst, 0); } else { iph->id = htons(inet->id++); } instead of ip_select_ident(). Right now I think the code is a nonsense. Most likely, I copied it from old ip_build_xmit(), where it was really special, we had to decide whether to generate unique ID when generating the first (well, the last) fragment. In ip_push_pending_frames() it does not make sense, it should use plain ip_select_ident() instead. Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22[NETFILTER]: Fix undefined references to get_h225_addrPatrick McHardy1-2/+2
get_h225_addr is exported, but declared static, which fails when linking statically. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22[NETFILTER]: futher {ip,ip6,arp}_tables unificationDmitry Mishin5-140/+66
This patch moves {ip,ip6,arp}t_entry_{match,target} definitions to x_tables.h. This move simplifies code and future compatibility fixes. Signed-off-by: Dmitry Mishin <dim@openvz.org> Acked-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22[NETFILTER]: Fix xt_policy address matchingPatrick McHardy1-3/+3
Fix missing inversion in address matching, it was broken during the conversion to x_tables. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22[NETFILTER]: nf_conntrack: support for layer 3 protocol load on demandPablo Neira Ayuso9-0/+130
x_tables matches and targets that require nf_conntrack_ipv[4|6] to work don't have enough information to load on demand these modules. This patch introduces the following changes to solve this issue: o nf_ct_l3proto_try_module_get: try to load the layer 3 connection tracker module and increases the refcount. o nf_ct_l3proto_module put: drop the refcount of the module. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22[NETFILTER]: x_tables: set the protocol family in x_tables targets/matchesPablo Neira Ayuso32-163/+235
Set the family field in xt_[matches|targets] registered. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22[NETFILTER]: conntrack: cleanup the conntrack ID initializationPablo Neira Ayuso2-4/+4
Currently the first conntrack ID assigned is 2, use 1 instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22[NETFILTER]: nfnetlink_queue: fix nfnetlink message sizePablo Neira Ayuso1-9/+10
Fix oversized message, use NLMSG_SPACE just one since it reserves space for the netlink header and NFA_SPACE for every attribute. Thanks to Harald Welte for the feedback Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22[NETFILTER]: ctnetlink: Fix expectaction mask dumpingPablo Neira Ayuso2-39/+117
The expectation mask has some particularities that requires a different handling. The protocol number fields can be set to non-valid protocols, ie. l3num is set to 0xFFFF. Since that protocol does not exist, the mask tuple will not be dumped. Moreover, this results in a kernel panic when nf_conntrack accesses the array of protocol handlers, that is PF_MAX (0x1F) long. This patch introduces the function ctnetlink_exp_dump_mask, that correctly dumps the expectation mask. Such function uses the l3num value from the expectation tuple that is a valid layer 3 protocol number. The value of the l3num mask isn't dumped since it is meaningless from the userspace side. Thanks to Yasuyuki Kozakai and Patrick McHardy for the feedback. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22[NETFILTER]: Fix Kconfig typosThomas Vögtle1-3/+3
Signed-off-by: Thomas Vögtle <tv@lio96.de> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22[NETFILTER]: Fix ip6tables breakage from {get,set}sockopt compat layerPatrick McHardy1-2/+2
do_ipv6_getsockopt returns -EINVAL for unknown options, not -ENOPROTOOPT as do_ipv6_setsockopt. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/perex/alsaLinus Torvalds202-2755/+4939
* git://git.kernel.org/pub/scm/linux/kernel/git/perex/alsa: (124 commits) [ALSA] version 1.0.11rc4 [PATCH] Intruduce DMA_28BIT_MASK [ALSA] hda-codec - Add support for ASUS P4GPL-X [ALSA] hda-codec - Add support for HP nx9420 laptop [ALSA] Fix memory leaks in error path of control.c [ALSA] AMD Au1x00: AC'97 controller is memory mapped [ALSA] AMD Au1x00: fix DMA init/cleanup [ALSA] hda-codec - Fix generic auto-configurator [ALSA] hda-codec - Fix BIOS auto-configuration [ALSA] Fixes typos in Audiophile-USB.txt [ALSA] ice1712 - typo fixes for dxr_enable module option [ALSA] AMD Au1x00: make driver build after cleanup [ALSA] ice1712 - Fix wrong value types for enum items [ALSA] fix resource leak in usbmixer [ALSA] Fix gus_pcm dereference before NULL [ALSA] Fix seq_clientmgr dereferences before NULL check [ALSA] hda-codec - Fix for Samsung R65 and ASUS A6J [ALSA] hda-codec - Add support for VAIO FE550G and SZ110 [ALSA] usb-audio: add Maya44 mixer control names [ALSA] usb-audio: add Casio PL-40R support ...
2006-03-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds22-23/+136
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: fixed path to moved file in include/linux/device.h Fix spelling in E1000_DISABLE_PACKET_SPLIT Kconfig description Documentation/dvb/get_dvb_firmware: fix firmware URL Documentation: Update to BUG-HUNTING Remove superfluous NOTIFY_COOKIE_LEN define add "tags" to .gitignore Fix "frist", "fisrt", typos fix rwlock usage example It's UTF-8
2006-03-22Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6Linus Torvalds24-222/+491
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6: [SPARC64]: Add a secondary TSB for hugepage mappings. [SPARC]: Respect vm_page_prot in io_remap_page_range().
2006-03-22Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds12-31/+366
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [TG3]: Bump driver version and reldate. [TG3]: Skip phy power down on some devices [TG3]: Fix SRAM access during tg3_init_one() [X25]: dte facilities 32 64 ioctl conversion [X25]: allow ITU-T DTE facilities for x25 [X25]: fix kernel error message 64 bit kernel [X25]: ioctl conversion 32 bit user to 64 bit kernel [NET]: socket timestamp 32 bit handler for 64 bit kernel [NET]: allow 32 bit socket ioctl in 64 bit kernel [BLUETOOTH]: Return negative error constant
2006-03-22Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6Linus Torvalds132-42698/+35515
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (138 commits) [SCSI] libata: implement minimal transport template for ->eh_timed_out [SCSI] eliminate rphy allocation in favour of expander/end device allocation [SCSI] convert mptsas over to end_device/expander allocations [SCSI] allow displaying and setting of cache type via sysfs [SCSI] add scsi_mode_select to scsi_lib.c [SCSI] 3ware 9000 add big endian support [SCSI] qla2xxx: update MAINTAINERS [SCSI] scsi: move target_destroy call [SCSI] fusion - bump version [SCSI] fusion - expander hotplug suport in mptsas module [SCSI] fusion - exposing raid components in mptsas [SCSI] fusion - memory leak, and initializing fields [SCSI] fusion - exclosure misspelled [SCSI] fusion - cleanup mptsas event handling functions [SCSI] fusion - removing target_id/bus_id from the VirtDevice structure [SCSI] fusion - static fix's [SCSI] fusion - move some debug firmware event debug msgs to verbose level [SCSI] fusion - loginfo header update [SCSI] add scsi_reprobe_device [SCSI] megaraid_sas: fix extended timeout handling ...
2006-03-22[PATCH] SELinux: add slab cache for inode security structJames Morris1-2/+8
Add a slab cache for the SELinux inode security struct, one of which is allocated for every inode instantiated by the system. The memory savings are considerable. On 64-bit, instead of the size-128 cache, we have a slab object of 96 bytes, saving 32 bytes per object. After booting, I see about 4000 of these and then about 17,000 after a kernel compile. With this patch, we save around 530KB of kernel memory in the latter case. On 32-bit, the savings are about half of this. Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] SELinux: cleanup stray variable in selinux_inode_init_security()James Morris1-2/+0
Remove an unneded pointer variable in selinux_inode_init_security(). Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] SELinux: fix hard link count for selinuxfs root directoryJames Morris1-5/+9
A further fix is needed for selinuxfs link count management, to ensure that the count is correct for the parent directory when a subdirectory is created. This is only required for the root directory currently, but the code has been updated for the general case. Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] selinuxfs cleanups: sel_make_avc_filesJames Morris1-5/+2
Fix copy & paste error in sel_make_avc_files(), removing a supurious call to d_genocide() in the error path. All of this will be cleaned up by kill_litter_super(). Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] selinuxfs cleanups: sel_make_boolsJames Morris1-4/+1
Remove the call to sel_make_bools() from sel_fill_super(), as policy needs to be loaded before the boolean files can be created. Policy will never be loaded during sel_fill_super() as selinuxfs is kernel mounted during init and the only means to load policy is via selinuxfs. Also, the call to d_genocide() on the error path of sel_make_bools() is incorrect and replaced with sel_remove_bools(). Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] selinuxfs cleanups: sel_fill_super exit pathJames Morris1-17/+24
Unify the error path of sel_fill_super() so that all errors pass through the same point and generate an error message. Also, removes a spurious dput() in the error path which breaks the refcounting for the filesystem (litter_kill_super() will correctly clean things up itself on error). Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] selinuxfs cleanups: use sel_make_dir()James Morris1-8/+4
Use existing sel_make_dir() helper to create booleans directory rather than duplicating the logic. Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] selinuxfs cleanups: fix hard link countJames Morris1-0/+4
Fix the hard link count for selinuxfs directories, which are currently one short. Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] selinux: simplify sel_read_boolStephen Smalley1-19/+1
Simplify sel_read_bool to use the simple_read_from_buffer helper, like the other selinuxfs functions. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] sem2mutex: security/Ingo Molnar3-16/+19
Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Stephen Smalley <sds@epoch.ncsc.mil> Cc: James Morris <jmorris@namei.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] selinux: Disable automatic labeling of new inodes when no policy is ↵Stephen Smalley1-1/+1
loaded This patch disables the automatic labeling of new inodes on disk when no policy is loaded. Discussion is here: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=180296 In short, we're changing the behavior so that when no policy is loaded, SELinux does not label files at all. Currently it does add an 'unlabeled' label in this case, which we've found causes problems later. SELinux always maintains a safe internal label if there is none, so with this patch, we just stick with that and wait until a policy is loaded before adding a persistent label on disk. The effect is simply that if you boot with SELinux enabled but no policy loaded and create a file in that state, SELinux won't try to set a security extended attribute on the new inode on the disk. This is the only sane behavior for SELinux in that state, as it cannot determine the right label to assign in the absence of a policy. That state usually doesn't occur, but the rawhide installer seemed to be misbehaving temporarily so it happened to show up on a test install. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] page migration reorgChristoph Lameter10-660/+741
Centralize the page migration functions in anticipation of additional tinkering. Creates a new file mm/migrate.c 1. Extract buffer_migrate_page() from fs/buffer.c 2. Extract central migration code from vmscan.c 3. Extract some components from mempolicy.c 4. Export pageout() and remove_from_swap() from vmscan.c 5. Make it possible to configure NUMA systems without page migration and non-NUMA systems with page migration. I had to so some #ifdeffing in mempolicy.c that may need a cleanup. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: slab cache interleave rotor fixPaul Jackson1-1/+1
The alien cache rotor in mm/slab.c assumes that the first online node is node 0. Eventually for some archs, especially with hotplug, this will no longer be true. Fix the interleave rotor to handle the general case of node numbering. Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: hugetlb alloc_fresh_huge_page bogus node loop fixPaul Jackson1-1/+3
Fix bogus node loop in hugetlb.c alloc_fresh_huge_page(), which was assuming that nodes are numbered contiguously from 0 to num_online_nodes(). Once the hotplug folks get this far, that will be false. Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] fix swap cluster offsetAkinobu Mita1-1/+1
When we've allocated SWAPFILE_CLUSTER pages, ->cluster_next should be the first index of swap cluster. But current code probably sets it wrong offset. Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] drain_node_pages: interrupt latency reduction / optimizationChristoph Lameter1-4/+8
1. Only disable interrupts if there is actually something to free 2. Only dirty the pcp cacheline if we actually freed something. 3. Disable interrupts for each single pcp and not for cleaning all the pcps in all zones of a node. drain_node_pages is called every 2 seconds from cache_reap. This fix should avoid most disabling of interrupts. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab: fix drain_array() so that it works correctly with the shared_arrayChristoph Lameter1-9/+12
The list_lock also protects the shared array and we call drain_array() with the shared array. Therefore we cannot go as far as I wanted to but have to take the lock in a way so that it also protects the array_cache in drain_pages. (Note: maybe we should make the array_cache locking more consistent? I.e. always take the array cache lock for shared arrays and disable interrupts for the per cpu arrays?) Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab: remove drain_array_lockedChristoph Lameter1-21/+10
Remove drain_array_locked and use that opportunity to limit the time the l3 lock is taken further. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab: make drain_array more universal by adding more parametersChristoph Lameter1-9/+11
And a parameter to drain_array to control the freeing of all objects and then use drain_array() to replace instances of drain_array_locked with drain_array. Doing so will avoid taking locks in those locations if the arrays are empty. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab: cache_reap(): further reduction in interrupt holdoffChristoph Lameter1-14/+43
cache_reap takes the l3->list_lock (disabling interrupts) unconditionally and then does a few checks and maybe does some cleanup. This patch makes cache_reap() only take the lock if there is work to do and then the lock is taken and released for each cleaning action. The checking of when to do the next reaping is done without any locking and becomes racy. Should not matter since reaping can also be skipped if the slab mutex cannot be acquired. The same is true for the touched processing. If we get this wrong once in awhile then we will mistakenly clean or not clean the shared cache. This will impact performance slightly. Note that the additional drain_array() function introduced here will fall out in a subsequent patch since array cleaning will now be very similar from all callers. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: make shrink_all_memory try harderRafael J. Wysocki1-0/+7
Make shrink_all_memory() repeat the attempts to free more memory if there seems to be no pages to free. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] optimize follow_hugetlb_pageChen, Kenneth W1-8/+17
follow_hugetlb_page() walks a range of user virtual address and then fills in list of struct page * into an array that is passed from the argument list. It also gets a reference count via get_page(). For compound page, get_page() actually traverse back to head page via page_private() macro and then adds a reference count to the head page. Since we are doing a virt to pte look up, kernel already has a struct page pointer into the head page. So instead of traverse into the small unit page struct and then follow a link back to the head page, optimize that with incrementing the reference count directly on the head page. The benefit is that we don't take a cache miss on accessing page struct for the corresponding user address and more importantly, not to pollute the cache with a "not very useful" round trip of pointer chasing. This adds a moderate performance gain on an I/O intensive database transaction workload. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] convert hugetlbfs_counter to atomicChen, Kenneth W1-16/+2
Implementation of hugetlbfs_counter() is functionally equivalent to atomic_inc_return(). Use the simpler atomic form. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: is_aligned_hugepage_range() cleanupDavid Gibson8-69/+16
Quite a long time back, prepare_hugepage_range() replaced is_aligned_hugepage_range() as the callback from mm/mmap.c to arch code to verify if an address range is suitable for a hugepage mapping. is_aligned_hugepage_range() stuck around, but only to implement prepare_hugepage_range() on archs which didn't implement their own. Most archs (everything except ia64 and powerpc) used the same implementation of is_aligned_hugepage_range(). On powerpc, which implements its own prepare_hugepage_range(), the custom version was never used. In addition, "is_aligned_hugepage_range()" was a bad name, because it suggests it returns true iff the given range is a good hugepage range, whereas in fact it returns 0-or-error (so the sense is reversed). This patch cleans up by abolishing is_aligned_hugepage_range(). Instead prepare_hugepage_range() is defined directly. Most archs use the default version, which simply checks the given region is aligned to the size of a hugepage. ia64 and powerpc define custom versions. The ia64 one simply checks that the range is in the correct address space region in addition to being suitably aligned. The powerpc version (just as previously) checks for suitable addresses, and if necessary performs low-level MMU frobbing to set up new areas for use by hugepages. No libhugetlbfs testsuite regressions on ppc64 (POWER5 LPAR). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Move hugetlb_free_pgd_range() prototype to hugetlb.hDavid Gibson2-3/+4
The optional hugepage callback, hugetlb_free_pgd_range() is presently implemented non-trivially only on ia64 (but I plan to add one for powerpc shortly). It has its own prototype for the function in asm-ia64/pgtable.h. However, since the function is called from generic code, it make sense for its prototype to be in the generic hugetlb.h header file, as the protypes other arch callbacks already are (prepare_hugepage_range(), set_huge_pte_at(), etc.). This patch makes it so. Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Fix hugepage logic in free_pgtables() harderDavid Gibson1-1/+1
Turns out the hugepage logic in free_pgtables() was doubly broken. The loop coalescing multiple normal page VMAs into one call to free_pgd_range() had an off by one error, which could mean it would coalesce one hugepage VMA into the same bundle (checking 'vma' not 'next' in the loop). I transferred this bug into the new is_vm_hugetlb_page() based version. Here's the fix. This one didn't bite on powerpc previously for the same reason the is_hugepage_only_range() problem didn't: powerpc's hugetlb_free_pgd_range() is identical to free_pgd_range(). It didn't bite on ia64 because the hugepage region is distant enough from any other region that the separated PMD_SIZE distance test would always prevent coalescing the two together. No libhugetlbfs testsuite regressions (ppc64, POWER5). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Fix hugepage logic in free_pgtables()David Gibson4-12/+8
free_pgtables() has special logic to call hugetlb_free_pgd_range() instead of the normal free_pgd_range() on hugepage VMAs. However, the test it uses to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized range at the start of the vma. is_hugepage_only_range() will return true if the given range has any intersection with a hugepage address region, and in this case the given region need not be hugepage aligned. So, for example, this test can return true if called on, say, a 4k VMA immediately preceding a (nicely aligned) hugepage VMA. At present we get away with this because the powerpc version of hugetlb_free_pgd_range() is just a call to free_pgd_range(). On ia64 (the only other arch with a non-trivial is_hugepage_only_range()) we get away with it for a different reason; the hugepage area is not contiguous with the rest of the user address space, and VMAs are not permitted in between, so the test can't return a false positive there. Nonetheless this should be fixed. We do that in the patch below by replacing the is_hugepage_only_range() test with an explicit test of the VMA using is_vm_hugetlb_page(). This in turn changes behaviour for platforms where is_hugepage_only_range() returns false always (everything except powerpc and ia64). We address this by ensuring that hugetlb_free_pgd_range() is defined to be identical to free_pgd_range() (instead of a no-op) on everything except ia64. Even so, it will prevent some otherwise possible coalescing of calls down to free_pgd_range(). Since this only happens for hugepage VMAs, removing this small optimization seems unlikely to cause any trouble. This patch causes no regressions on the libhugetlbfs testsuite - ppc64 POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Make {alloc,free}_huge_page() localDavid Gibson2-16/+13
Originally, mm/hugetlb.c just handled the hugepage physical allocation path and its {alloc,free}_huge_page() functions were used from the arch specific hugepage code. These days those functions are only used with mm/hugetlb.c itself. Therefore, this patch makes them static and removes their prototypes from hugetlb.h. This requires a small rearrangement of code in mm/hugetlb.c to avoid a forward declaration. This patch causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Strict page reservation for hugepage inodesDavid Gibson3-64/+154
These days, hugepages are demand-allocated at first fault time. There's a somewhat dubious (and racy) heuristic when making a new mmap() to check if there are enough available hugepages to fully satisfy that mapping. A particularly obvious case where the heuristic breaks down is where a process maps its hugepages not as a single chunk, but as a bunch of individually mmap()ed (or shmat()ed) blocks without touching and instantiating the pages in between allocations. In this case the size of each block is compared against the total number of available hugepages. It's thus easy for the process to become overcommitted, because each block mapping will succeed, although the total number of hugepages required by all blocks exceeds the number available. In particular, this defeats such a program which will detect a mapping failure and adjust its hugepage usage downward accordingly. The patch below addresses this problem, by strictly reserving a number of physical hugepages for hugepage inodes which have been mapped, but not instatiated. MAP_SHARED mappings are thus "safe" - they will fail on mmap(), not later with an OOM SIGKILL. MAP_PRIVATE mappings can still trigger an OOM. (Actually SHARED mappings can technically still OOM, but only if the sysadmin explicitly reduces the hugepage pool between mapping and instantiation) This patch appears to address the problem at hand - it allows DB2 to start correctly, for instance, which previously suffered the failure described above. This patch causes no regressions on the libhugetblfs testsuite, and makes a test (designed to catch this problem) pass which previously failed (ppc64, POWER5). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: serialize hugepage allocation and instantiationDavid Gibson1-7/+18
Currently, no lock or mutex is held between allocating a hugepage and inserting it into the pagetables / page cache. When we do go to insert the page into pagetables or page cache, we recheck and may free the newly allocated hugepage. However, since the number of hugepages in the system is strictly limited, and it's usualy to want to use all of them, this can still lead to spurious allocation failures. For example, suppose two processes are both mapping (MAP_SHARED) the same hugepage file, large enough to consume the entire available hugepage pool. If they race instantiating the last page in the mapping, they will both attempt to allocate the last available hugepage. One will fail, of course, returning OOM from the fault and thus causing the process to be killed, despite the fact that the entire mapping can, in fact, be instantiated. The patch fixes this race by the simple method of adding a (sleeping) mutex to serialize the hugepage fault path between allocation and insertion into pagetables and/or page cache. It would be possible to avoid the serialization by catching the allocation failures, waiting on some condition, then rechecking to see if someone else has instantiated the page for us. Given the likely frequency of hugepage instantiations, it seems very doubtful it's worth the extra complexity. This patch causes no regression on the libhugetlbfs testsuite, and one test, which can trigger this race now passes where it previously failed. Actually, the test still sometimes fails, though less often and only as a shmat() failure, rather processes getting OOM killed by the VM. The dodgy heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage space aren't protected by the new mutex, and would be ugly to do so, so there's still a race there. Another patch to replace those tests with something saner for this reason as well as others coming... Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Small fixes to hugepage clear/copy pathDavid Gibson1-7/+26
Move the loops used in mm/hugetlb.c to clear and copy hugepages to their own functions for clarity. As we do so, we add some checks of need_resched - we are, after all copying megabytes of memory here. We also add might_sleep() accordingly. We generally dropped locks around the clear and copy, already but not everyone has PREEMPT enabled, so we should still be checking explicitly. For this to work, we need to remove the clear_huge_page() from alloc_huge_page(), which is called with the page_table_lock held in the COW path. We move the clear_huge_page() to just after the alloc_huge_page() in the hugepage no-page path. In the COW path, the new page is about to be copied over, so clearing it was just a waste of time anyway. So as a side effect we also fix the fact that we held the page_table_lock for far too long in this path by calling alloc_huge_page() under it. It causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] Enable mprotect on huge pagesZhang, Yanmin6-13/+43
2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb mprotect. From: David Gibson <david@gibson.dropbear.id.au> Remove a test from the mprotect() path which checks that the mprotect()ed range on a hugepage VMA is hugepage aligned (yes, really, the sense of is_aligned_hugepage_range() is the opposite of what you'd guess :-/). In fact, we don't need this test. If the given addresses match the beginning/end of a hugepage VMA they must already be suitably aligned. If they don't, then mprotect_fixup() will attempt to split the VMA. The very first test in split_vma() will check for a badly aligned address on a hugepage VMA and return -EINVAL if necessary. From: "Chen, Kenneth W" <kenneth.w.chen@intel.com> On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE. The identify of hugetlb pte is lost when changing page protection via mprotect. A page fault occurs later will trigger a bug check in huge_pte_alloc(). The fix is to always make new pte a hugetlb pte and also to clean up legacy code where _PAGE_PRESENT is forced on in the pre-faulting day. Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] readahead: fix initial window size calculationSteven Pratt1-3/+3
The current current get_init_ra_size is not optimal across different IO sizes and max_readahead values. Here is a quick summary of sizes computed under current design and under the attached patch. All of these assume 1st IO at offset 0, or 1st detected sequential IO. 32k max, 4k request old new ----------------- 8k 8k 16k 16k 32k 32k 128k max, 4k request old new ----------------- 32k 16k 64k 32k 128k 64k 128k 128k 128k max, 32k request old new ----------------- 32k 64k <----- 64k 128k 128k 128k 512k max, 4k request old new ----------------- 4k 32k <---- 16k 64k 64k 128k 128k 256k 512k 512k Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] readahead: ->prev_page can overrun the ahead windowOleg Nesterov1-6/+20
If get_next_ra_size() does not grow fast enough, ->prev_page can overrun the ahead window. This means the caller will read the pages from ->ahead_start + ->ahead_size to ->prev_page synchronously. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>