ex442 ------------------------------------ Use utilities such as vmstat, iostat, mpstat, sar, gnome-system-monitor, top, powertop and others to analyze and report system and application behavior Configure systems to provide performance metrics using utilities such as Performance Co-Pilot (PCP) Use the Pluggable Authentication Modules (PAM) mechanism to implement restrictions on critical system resources Use /proc/sys, sysctl and /sys to examine and modify and set kernel run-time parameters Use utilities such as dmesg, dmidecode, x86info, sosreport etc. to profile system hardware configurations Analyze system and application behavior using tools such as ps, strace, top and Valgrind Configure systems to run SystemTap scripts Alter process priorities of both new and existing processes -- [x] Configure systems to support alternate page sizes for applications that use large amounts of memory Given multiple versions of applications that perform the same or similar tasks, choose which version of the application to run on a system based on its observed performance characteristics Configure disk subsystems for optimal performance using mechanisms such as swap partition placement, I/O scheduling algorithm selection, file system layout and others Configure kernel behavior by altering module parameters -- [x] Calculate network buffer sizes based on known quantities such as bandwidth and round-trip time and set system buffer sizes based on those calculations Select and configure tuned profiles. Manage system resource usage using control groups rhel6 objectives: [ ] Use utilities such as vmstat, iostat, mpstat, sar, gnome-system-monitor, top, powertop, and others to analyze and report system and application behavior Configure systems to provide performance metrics using utilities such as the round robin database tool (RRDtool) [ ] Use the pluggable authentication modules (PAM) mechanism to implement restrictions on critical system resources [ ] Use /proc/sys, sysctl, and /sys to examine, modify, and set kernel run-time parameters [ ] Use utilities such as dmesg, dmidecode, x86info, sosreport, and more to profile system hardware configurations [ ] Analyze system and application behavior using tools such as ps, strace, top, and Valgrind [ ] Configure systems to run SystemTap scripts [x] Alter process priorities of both new and existing processes [ ] Configure systems to support alternate page sizes for applications that use large amounts of memory [ ] Given multiple versions of applications that perform the same or similar tasks, choose which version of the application to run on a system based on its observed performance characteristics [ ] Configure disk subsystems for optimal performance using mechanisms such as swap partition placement, I/O scheduling algorithm selection, file system layout, and others [x] Configure kernel behavior by altering module parameters [ ] Calculate network buffer sizes based on known quantities such as bandwidth and round-trip time and set system buffer sizes based on those calculations [ ] Select and configure tuned profiles [ ] Manage system resource usage using control groups Metrics ------------------------------------ 1 byte = 8 bits Decimal: 1 kilobyte (kB) = 1000 bytes 1 megabyte (MB) = 1,000,000 bytes 1 gigabyte (GB) = 1,000,000,000 bytes (billion) Binary: 1 kibibyte (KiB = 1024 bytes /K) 1 mebibyte (MiB = 1024*1024 bytes /M) 1 gibibyte (GiB = 1024*1024*1024 bytes /G) Every application a little different, for instance see output from `man lvs`: --units hHbBsSkKmMgGtTpPeE All sizes are output in these units: (h)uman-readable, (b)ytes, (s)ectors, (k)ilobytes, (m)egabytes, (g)igabytes, (t)erabytes, (p)etabytes, (e)xabytes. Capitalise to use multiples of 1000 (S.I.) instead of 1024. Can also specify custom units e.g. --units 3M Alter process priorities of both new and existing processes ------------------------------------ Linux priorities run from 0 - 139: 0 - 39 are termed conventional processes, and 40 - 139 are termed realtime processes The higher the priority number the better. The main differences between conventional and realtime processes is their scheduling method. Conventional processes are based on the method of "niceness" and the CPU scheduler uses the "niceness" value to determine how much CPU time each process is to recieve. A "nicer" process will consume LESS CPU time (and therefore play nicer with other processes).. The default priority of a conventional process is *20* (this however can be dynamically adjusted by the scheduler if required): (please note the difference between pri and priority ; pri shows the global priority on the global 0-139 scale, priority seems to show the priority in terms of the conventional process space (between 0-39 where 0 is apparantly the best!) 6# ps -eo pid,pri,priority,nice,command | awk 'NR == 1 || /[h]ttpd|[n]amed/' PID PRI PRI NI COMMAND 2319 19 20 0 /usr/sbin/httpd 16174 19 20 0 /usr/sbin/named -u named Lets decrease niceness both -10 and -20 and see what happens: 6# renice -n -10 16174 16174: old priority 0, new priority -10 6# ps -eo pid,pri,priority,nice,command | awk 'NR == 1 || /[h]ttpd|[n]amed/' PID PRI PRI NI COMMAND 2319 19 20 0 /usr/sbin/httpd 16174 29 10 -10 /usr/sbin/named -u named 6# renice -n -20 16174 16174: old priority -20, new priority -20 6# ps -eo pid,sched,pri,priority,nice,command | awk 'NR == 1 || /[1]6174/' PID SCH PRI PRI NI COMMAND 16174 0 39 0 -20 /usr/sbin/named -u named Increasing our niceness to the max: 6# renice -n 20 16174 16174: old priority -20, new priority 19 6# ps -eo pid,sched,pri,priority,nice,command | awk 'NR == 1 || /[1]6174/' PID SCH PRI PRI NI COMMAND 16174 0 0 39 19 /usr/sbin/named -u named Priority here is absolute rock bottom ie 0 ... Realtime processes are used for processes that should be granted more CPU time and there are a few algorithms in terms of juggling usage between them all. Niceness does not apply. rtprio ranges from 1 - 99 (higher the better) and this makes up the range 40 -- 139 on the global linux kernel priority scale. 6# chrt -m SCHED_OTHER min/max priority : 0/0 -- This is our standard conventional mode (niceness etc..) SCHED_FIFO min/max priority : 1/99 -- When a process starts to be scheduled, the kernel will NOT pre-empt or timeslice this process until it stops requesting CPU time OR a realtime process with a higher priority value requests CPU time. SCHED_RR min/max priority : 1/99 -- Round Robin between processes of equal priority. SCHED_BATCH min/max priority : 0/0 -- Does not preempt nearly as often as regular tasks would, thereby allowing tasks to run longer and make better use of caches but at the cost of interactivity. This is well suited for batch jobs. SCHED_IDLE min/max priority : 0/0 -- This is even weaker than nice 19 (dont think this gets used much) from man sched_setscheduler 'for running very low priority background jobs' From man sched_setscheduler: Processes scheduled under one of the real-time policies (SCHED_FIFO, SCHED_RR) have a sched_priority value in the range 1 (low) to 99 (high). (As the numbers imply, real-time processes always have higher priority than normal processes.) Conceptually, the scheduler maintains a list of runnable processes for each possible sched_priority value. In order to determine which process runs next, the scheduler looks for the non-empty list with the highest static priority and selects the process at the head of this list. A process’s scheduling policy determines where it will be inserted into the list of processes with equal static priority and how it will move inside this list. All scheduling is preemptive: if a process with a higher static priority becomes ready to run, the currently running process will be preempted and returned to the wait list for its static priority level. The scheduling policy only determines the ordering within the list of runnable processes with equal static priority. man sched_setscheduler: SCHED_FIFO: First In-First Out scheduling SCHED_FIFO can only be used with static priorities higher than 0, which means that when a SCHED_FIFO processes becomes runnable, it will always immediately preempt any currently running SCHED_OTHER, SCHED_BATCH, or SCHED_IDLE process. SCHED_FIFO is a simple scheduling algorithm without time slicing. For processes scheduled under the SCHED_FIFO policy, the following rules apply: * A SCHED_FIFO process that has been preempted by another process of higher priority will stay at the head of the list for its priority and will resume execution as soon as all processes of higher priority are blocked again. * When a SCHED_FIFO process becomes runnable, it will be inserted at the end of the list for its priority. * A call to sched_setscheduler() or sched_setparam(2) will put the SCHED_FIFO (or SCHED_RR) process identified by pid at the start of the list if it was runnable. As a consequence, it may preempt the currently running process if it has the same priority. (POSIX.1-2001 specifies that the process should go to the end of the list.) * A process calling sched_yield(2) will be put at the end of the list. No other events will move a process scheduled under the SCHED_FIFO policy in the wait list of runnable processes with equal static priority. A SCHED_FIFO process runs until either it is blocked by an I/O request, it is preempted by a higher priority process, or it calls sched_yield(2). SCHED_RR: Round Robin scheduling SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described above for SCHED_FIFO also applies to SCHED_RR, except that each process is only allowed to run for a maximum time quantum. If a SCHED_RR process has been running for a time period equal to or longer than the time quantum, it will be put at the end of the list for its priority. A SCHED_RR process that has been preempted by a higher priority process and subsequently resumes execution as a running process will complete the unexpired portion of its round robin time quantum. The length of the time quantum can be retrieved using sched_rr_get_interval(2). SCHED_OTHER: Default Linux time-sharing scheduling SCHED_OTHER can only be used at static priority 0. SCHED_OTHER is the standard Linux time-sharing scheduler that is intended for all processes that do not require the special realtime mechanisms. The process to run is chosen from the static priority 0 list based on a dynamic priority that is determined only inside this list. The dynamic priority is based on the nice value (set by nice(2) or setpriority(2)) and increased for each time quantum the process is ready to run, but denied to run by the scheduler. This ensures fair progress among all SCHED_OTHER processes. SCHED_BATCH: Scheduling batch processes (Since Linux 2.6.16.) SCHED_BATCH can only be used at static priority 0. This policy is similar to SCHED_OTHER in that it schedules the process according to its dynamic priority (based on the nice value). The difference is that this policy will cause the scheduler to always assume that the process is CPU-intensive. Consequently, the scheduler will apply a small scheduling penalty with respect to wakeup behaviour, so that this process is mildly disfavored in scheduling decisions. This policy is useful for workloads that are non-interactive, but do not want to lower their nice value, and for workloads that want a deterministic scheduling policy without interactivity causing extra preemptions (between the workload’s tasks). SCHED_IDLE: Scheduling very low priority jobs (Since Linux 2.6.23.) SCHED_IDLE can only be used at static priority 0; the process nice value has no influence for this policy. This policy is intended for running jobs at extremely low priority (lower even than a +19 nice value with the SCHED_OTHER or SCHED_BATCH policies). chrt is our tool here: Set one of the httpd children to RR 35: (see how pri changes!) 6# ps -eo pid,policy,rtprio,pri,nice,command | awk 'NR == 1 || /[2]0469|[2]0471/' PID POL RTPRIO PRI NI COMMAND 20469 TS - 19 0 /usr/sbin/httpd 20471 TS - 19 0 /usr/sbin/httpd 6# chrt -p -r 35 20471 6# ps -eo pid,policy,rtprio,pri,nice,command | awk 'NR == 1 || /[2]0469|[2]0471/' PID POL RTPRIO PRI NI COMMAND 20469 TS - 19 0 /usr/sbin/httpd 20471 RR 35 75 - /usr/sbin/httpd <-- see how the priority has increased ! Tunables: 6# cat /proc/sys/kernel/sched_rr_timeslice_ms 100 <-- this is the timeslice period for RR timesliceing; how long a RR process can run before we look to RR to another process of equal priority There is some protection from run away processes provided via: /proc/sys/kernel/sched_rt_period_us: The scheduling period that is equivalent to 100% CPU bandwidth /proc/sys/kernel/sched_rt_runtime_us: A global limit on how much time realtime scheduling may use. The default values for sched_rt_period_us (1000000 or 1s) and sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away realtime tasks will not lock up the machine but leave a little time to recover it. By setting runtime to -1 you'd get the old behaviour back. Controlling CPU Binding / Selection: ------------------------------------ You can remove CPUs from the scheduler system via this boot flag: isolcpus=2,5-7 This prevents the scheduler from scheduling any user-space threads on this CPU. Once a CPU is isolated, you must manually assign processes to the isolated CPU, either with the CPU affinity system calls or the numactl command. Use taskset to bind processes to specific CPUs: Locking down nagios to one core: [root@baasreporter kernel]# ps aux | grep [3]183 nagios 3183 9.0 4.3 478196 44996 ? SNsl Jun15 885:12 /usr/sbin/nagios -d /etc/nagios/nagios.cfg [root@baasreporter kernel]# taskset -c -p 3183 pid 3183's current affinity list: 0-3 [root@baasreporter kernel]# taskset -c -p 3 3183 pid 3183's current affinity list: 0-3 pid 3183's new affinity list: 3 [root@baasreporter kernel]# taskset -c -p 3183 pid 3183's current affinity list: 3 BONUS: CPU/Numa - CPU Architecture ------------------------------------ Symmetric Multi-Processor (SMP) topology== SMP topology allows all processors to access memory in the same amount of time. However, because shared and equal memory access inherently forces serialized memory accesses from all the CPUs, SMP system scaling constraints are now generally viewed as unacceptable. For this reason, practically all modern server systems are NUMA machines. SMP systems have centralized shared memory called Main Memory (MM) operating under a single operating system with two or more homogeneous processors. Usually each processor has an associated private high-speed memory known as cache memory (or cache) to speed-up the MM data access and to reduce the system bus traffic. Processors may be interconnected using buses, crossbar switches or on-chip mesh networks. The bottleneck in the scalability of SMP using buses or crossbar switches is the bandwidth and power consumption of the interconnect among the various processors, the memory, and the disk arrays Non-Uniform Memory Access (NUMA) topology== NUMA topology was developed more recently than SMP topology. In a NUMA system, multiple processors are physically grouped on a socket. Each socket has a dedicated area of memory, and processors that have loca access to that memory are referred to c`ollectively as a node. Processors on the same node have high speed access to that node's memory bank, and slower access to memory banks not on their node. Therefore, there is a performance penalty to accessing non-local memory. Given this performance penalty, performance sensitive applications on a system with NUMA topology should access memory that is on the same node as the processor executing the application, and should avoid accessing remote memory wherever possible. When tuning application performance on a system with NUMA topology, it is therefore important to consider where the application is being executed, and which memory bank is closest to the point of execution. In a system with NUMA topology, the /sys file system contains information about how processors, memory, and peripheral devices are connected. The /sys/devices/system/cpu directory contains details about how processors in the system are connected to each other. The /sys/devices/system/node directory contains information about NUMA nodes in the system, and the relative distances between those nodes. The cells of the NUMA system are connected together with some sort of system interconnect--e.g., a crossbar or point-to-point link are common types of NUMA system interconnects. Both of these types of interconnects can be aggregated to create NUMA platforms with cells at multiple distances from other cells. # ls -al /sys/devices/system/node/ <-- list the NUMA nodes that the kernel knows about drwxr-xr-x 2 root root 0 Jun 22 10:33 node0 drwxr-xr-x 2 root root 0 Jun 22 10:35 node1 drwxr-xr-x 2 root root 0 Jun 22 10:35 node2 drwxr-xr-x 2 root root 0 Jun 22 10:35 node3 # cat /sys/devices/system/node/node0/cpulist <-- CPUs in that NUMA node 0-7,32-39 # cat /sys/devices/system/node/node0/distance 10 20 20 20 # for file in /sys/devices/system/node/*; do echo -n "$file: "; cat ${file}/distance; done /sys/devices/system/node/node0: 10 20 20 20 <- Shows the "distance" between nodes .. as NUMA nodes can be chained along /sys/devices/system/node/node1: 20 10 20 20 /sys/devices/system/node/node2: 20 20 10 20 /sys/devices/system/node/node3: 20 20 20 10 $ cat /sys/devices/system/node/node1/numastat | grep numa_miss numa_miss 56442251 $ cat /sys/devices/system/node/node1/numastat | grep numa_miss numa_miss 56453178 man numastat (from numactl): numa_hit is the number of allocations where an allocation was intended for that node and succeeded there numa_miss shows how often an allocation was intended for this node, but ended up on another node due to low memory can use numad as a daemon to dynamically balance processes across numa nodes numactl to start processes in a numa-aware fashion Configure systems to support alternate page sizes for applications that use large amounts of memory ------------------------------------ By default; page size for memory is allocated in 4k blocks. The page is the smallest addressable size/unit of memory. The 'page table' is used by the OS to hold the mappings between process-relative virtual memory address to actual real physical memory addresses: 6# grep PageTabl /proc/meminfo PageTables: 9176 kB <-- here we have used 9176kB of memory just to map virtual memory addressses to physical. The TLB(translation lookaside buffer) is a cache on CPU; which caches a subset -- the most recently used -- of these mapping When a virtual address needs to be translated into a physical address; the TLB is always searched first: -- If a match is found (TLB Hit); the physical address is returned and the process execution continues. -- If there is no match (TLB Miss); the handler will perform a 'page walk' and then search the page table to check if a mapping already exists. -- If a mapping already exists; it is written back to the TLB; AND the faulting instruction is restarted To reference a 4MB space of RAM, this is broken down into 4K*1024 Pages. This consumes valuable space in the TLB and increases the Page Table size in memory. Huge pages allow you to increase the page size (to 2MB). In this instance, 4MB of space in a HugePage zone would only take 2 Huge Pages (and 2 entries in the TLB/PageTable) etc.. There are a few ways to enable Huge Pages: # Create group: % groupadd my-hugetlbfs % getent group my-hugetlbfs my-hugetlbfs:x:2021: % adduser franklin my-hugetlbfs sysctl: # Allocate 256*2MiB for HugePageTables (YMMV) vm.nr_hugepages = 256 # Members of group my-hugetlbfs(2021) can allocate "huge" Shared memory segment vm.hugetlb_shm_group = 2021 # Filesystem mount mkdir /hugepages hugetlbfs /hugepages hugetlbfs mode=1770,gid=2021 0 0 mySQL Example: [root@baasreporter vm]# head -n2 /etc/my.cnf [mysqld] large-pages [root@baasreporter vm]# tail -n2 /etc/sysctl.conf vm.hugetlb_shm_group = 27 vm.nr_hugepages = 32 [root@baasreporter vm]# id mysql uid=27(mysql) gid=27(mysql) groups=27(mysql) [root@baasreporter vm]# grep mysql /etc/security/limits.conf mysql soft memlock unlimited mysql hard memlock unlimited Result: [root@baasreporter vm]# grep -i huge /proc/meminfo HugePages_Total: 32 <-- Total Huge Pages HugePages_Free: 30 <-- Free Huge Pages (2 used) HugePages_Rsvd: 7 <-- Pages which have been 'reserved' -- a commitment to allocate from the pool has been made, but no allocation has yet been made. Reserved huge guarantee that an application will be able to allocate huge page from the pool of huge pages at fault time. Hugepagesize: 2048 kB If the mmap syscall is used, then to mount a special psudeo fs: [root@baasreporter tmp]# mount -t hugetlbfs hugetlbfs /tmp/huge/ -o mode=1770,gid=27 Note: While read system calls are supported on files that reside on hugetlb file systems, write system calls are not. Regular chown, chgrp, and chmod commands (with right permissions) could be used to change the file attributes on hugetlbfs. Also, it is important to note that no such mount command is required if applications are going to use only shmat/shmget system calls or mmap with MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb below. Users who wish to use hugetlb memory via shared memory segment should be a member of a supplementary group and system admin needs to configure that gid into /proc/sys/vm/hugetlb_shm_group. If mmap syscall used, you need either to mount a dedicated FS: mount -t hugetlbfs -o size=256M none /mnt/ and mmap(/mnt/h, MAP_PRIVATE) or without mounting hgetlbfs: mmap(/mnt/h, MAP_HUGETLB | MAP_PRIVATE)) If shmget is used, nothing more to do from system administation point of view (MAP_SHARED to be given in flags parameter to shmget flags). SYSV Shared Memory: /proc/sys/kernel/shmall This file contains the system-wide limit on the total number of pages of System V shared memory. /proc/sys/kernel/shmmax This file can be used to query and set the run time limit on the maximum (System V IPC) shared memory segment size that can be created. Shared memory segments up to 1Gb are now supported in the ker- nel. This value defaults to SHMMAX. /proc/sys/kernel/shmmni (available in Linux 2.4 and onwards) This file specifies the system-wide maximum number of System V shared memory segments that can be created. SHMMAX is the maximum amount any particular shared memory hunk can be. SHMALL is the systemwide limit of all the shared memory segments. <-- these are specified in multiples of page size (getconf PAGE_SIZE) Transparent Huge Pages ------------------------------------ Transparent Huge Pages (THP) are enabled by default in RHEL 6 for all applications. The kernel attempts to allocate hugepages whenever possible and any Linux process will receive 2MB pages if the mmap region is 2MB naturally aligned. The main kernel address space itself is mapped with hugepages, reducing TLB pressure from kernel code. For general information on Hugepages, see: What are Huge Pages and what are the advantages of using them? The kernel will always attempt to satisfy a memory allocation using hugepages. If no hugepages are available (due to non availability of physically continuous memory for example) the kernel will fall back to the regular 4KB pages. THP are also swappable (unlike hugetlbfs). This is achieved by breaking the huge page to smaller 4KB pages, which are then swapped out normally. But to use hugepages effectively, the kernel must find physically continuous areas of memory big enough to satisfy the request, and also properly aligned. For this, a khugepaged kernel thread has been added. This thread will occasionally attempt to substitute smaller pages being used currently with a hugepage allocation, thus maximizing THP usage. In userland, no modifications to the applications are necessary (hence transparent). But there are ways to optimize its use. # grep -i anonhuge /proc/meminfo AnonHugePages: 1486848 kB # cat /sys/kernel/mm/transparent_hugepage/enabled [always] never # cat /sys/kernel/mm/redhat_transparent_hugepage/enabled [always] never khugepaged will be automatically started when transparent_hugepage/enabled is set to "always" or "madvise, and it'll be automatically shutdown if it's set to "never". The redhat_transparent_hugepage/defrag parameter takes the same values and it controls whether the kernel should make aggressive use of memory compaction to make more hugepages available. Identify processes using huge pages: grep -e AnonHugePages /proc/*/smaps | awk '{ if($2>4) print $0} ' | awk -F "/" '{print $0; system("ps -fp " $3)} ' To disable THP: transparent_hugepage=never as a kernel param Useful `ps` switches: ------------------------------------ ps -eo stat,pid,command : get a greppable list (^D or ^R) of all processes on the run queue, or in parent Huge Pages (THP) are enabled by default in RHEL 6 for all applications. The kernel attempts to allocate hugepages whenever possible and any Linux process will receive 2MB pages if the mmap region is 2MB naturally aligned. The main kernel address space itself is mapped with hugepages, reducing TLB pressure from kernel code. For general information on Hugepages, see: What are Huge Pages and what are the advantages of using them? The kernel will always attempt to satisfy a memory allocation using hugepages. If no hugepages are available (due to non availability of physically continuous memory for example) the kernel will fall back to the regular 4KB pages. THP are also swappable (unlike hugetlbfs). This is achieved by breaking the huge page to smaller 4KB pages, which are then swapped out normally. But to use hugepages effectively, the kernel must find physically continuous areas of memory big enough to satisfy the request, and also properly aligned. For this, a khugepaged kernel thread has been added. This thread will occasionally attempt to substitute smaller pages being used currently with a hugepage allocation, thus maximizing THP usage. In userland, no modifications to the applications are necessary (hence transparent). But there are ways to optimize its use. For applications that want to use hugepages, use of posix_memalign() can also help ensure that large allocations are aligned to huge page (2MB) boundaries.nterruptible sleep # ps -eo stat,class,rtprio,pri,nice,command | awk 'NR == 1 || /nagios|httpd/' STAT CLS RTPRIO PRI NI COMMAND S RR 10 50 - /usr/sbin/httpd SN TS - 5 15 /usr/sbin/nagios -d /etc/nagios/nagios.cfg class -> shows you the process type ie: TS SCHED_OTHER FF SCHED_FIFO RR SCHED_RR rtprio -> shows you the real time priority pri -> show the global priority (between 0 - 139) // "pri" (0..139) // "priority" (-100..39) I perfer 'pri' as this goes 0 .. 139 ... higher the beter; 0 -- 39 are conventional processes (TS), and 40--139 are realtime 'priority' shows this as inverse ... lower priority is better: -100 maps to rtprio 99; 39 maps to nice+19 prioirity p->prio pri 39 - p->priority Bonus: Compiling Kernel: ------------------------------------ make mrproper --> clean make directory/should run first to revert build state to original make menuconfig --> interactive menu to configure kernel options (can take a long time) make config --> one by one options make defconfig --> default config make all{yes/no/mod} --> say yes/no/module to everything make oldconfig --> use supplied .config file make make modules make modules_install --> installs into /lib/modules/{KERN VERSION}/ make install Exploring OOM ------------------------------------ Overcommitting virtual memory: /proc/sys/vm/overcommit_memory This file contains the kernel virtual memory accounting mode. Values are: 0: heuristic overcommit (this is the default) 1: always overcommit, never check 2: always check, never overcommit In mode 0, calls of mmap(2) with MAP_NORESERVE set are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed". Under Linux 2.4 any non-zero value implies mode 1. In mode 2 (available since Linux 2.6), the total virtual address space on the system is limited to (SS + RAM*(r/100)), where SS is the size of the swap space, and RAM is the size of the physical memory, and r is the contents of the file /proc/sys/vm/overcommit_ratio. /proc/sys/vm/overcommit_ratio See the description of /proc/sys/vm/overcommit_memory. Turning off overcommiting: [root@baasreporter vm]# cat overcommit_memory 2 [root@baasreporter vm]# cat overcommit_ratio 0 [root@baasreporter vm]# grep -i ^commit /proc/meminfo CommitLimit: 4192956 kB Committed_AS: 808632 kB ( In this instance, I can only allocate 4GB of Virtual Memory space ). Taking this a silly step further: I removed the 4GB swap, and the ratio to (45%) .. [root@baasreporter build]# cat /proc/sys/vm/overcommit_memory 2 [root@baasreporter build]# cat /proc/sys/vm/overcommit_ratio 45 [root@baasreporter build]# grep -i commit /proc/meminfo CommitLimit: 461668 kB Committed_AS: 421676 kB Of my 1000Mb Ram, I can only allocate 450Mb of virtual memory, and I only have 40Mb of virtual memory space free... Malloc programs for Testing virtual memory space: #include #include #define MEGABYTE 1024*1024 int main(int argc, char *argv[]) { void *myblock = NULL; int count = 0; while (1) { myblock = (void *) malloc(MEGABYTE); if (!myblock) break; printf("Currently allocating %d MB\n", ++count); } exit(0); } Testing Resident size: #include #include #define MEGABYTE 1024*1024 int main(int argc, char *argv[]) { void *myblock = NULL; int count = 0; while(1) { myblock = (void *) malloc(MEGABYTE); if (!myblock) break; memset(myblock,1, MEGABYTE); printf("Currently allocating %d MB\n",++count); } exit(0); } Configure kernel behavior by altering module parameters ------------------------------------ modprobe -l --> list available modules lsmod --> list installed modules modprobe (-r) --> install(/remove) module modinfo --> get info on module Disable auto loading a the module: blacklist moduleName >> /etc/modprobe.d/blacklist.conf; OR (older method) alias moduleName off >> /etc/modprove.conf modules located: # ls /lib/modules/$(uname -r)/kernel As an example, if you install the e1000 driver for two PRO/1000 adapters (eth0 and eth1) and set the speed and duplex to 10full and 100half, add the following to modules.conf: # modinfo e1000 | grep ^parm: | grep "Speed\|Duplex" parm: Speed:Speed setting (array of int) parm: Duplex:Duplex setting (array of int) # cat /etc/modprobe.conf alias eth0 e1000 alias eth1 e1000 options e1000 Speed=10,100 Duplex=2,1 If you have the kernel source; you can potentially find more descriptions re: module options: # file /usr/src/linux-2.6.19/drivers/net/e1000/e1000_param.c The blacklist command does not stop a kernel module being loaded manually or as a dependency. The "hard" way to stop a kernel module being loaded completely is via the following: echo "install pppol2tp /bin/true" > /etc/modprobe.d/pppol2tp.conf In order to configure the kernel to automatically load kernel modules, add a file into /etc/sysconfig/modules/X.modules ... ie: /etc/sysconfig/modules/bluez-uinput.modules: #!/bin/sh if [ ! -c /dev/input/uinput ] ; then exec /sbin/modprobe uinput >/dev/null 2>&1 fi Bonus: Interrupt Tuning ------------------------------------ Hardware issues interrupts to request immediate CPU time off the processor Different hardware devices operate on different IRQ (interrupt channels): /proc/interrupts shows you the distribution of interrupts over all CPUs use /proc/irq/IRQX/smp_affinity to tune the CPU affinity of the interrupt handlers: CPU VAL CPU0 1 CPU1 2 CPU2 4 CPU3 8 Add them together (so to bind to CPU 0 and CPU 1 ) echo 3 (1+2) to the smp_affinity file Calculate network buffer sizes based on known quantities such as bandwidth and round-trip and set system buffer sizes based on those calculations ------------------------------------ In data communications, bandwidth-delay product refers to the product of a data link's capacity (in bits per second) and its round-trip delay time (in seconds). The result, an amount of data measured in bits (or bytes), is equivalent to the maximum amount of data on the network circuit at any given time, i.e., data that has been transmitted but not yet acknowledged. BDW = Bandwidth (bits/s) * Delay (s) IE: A 2Mbps WAN Link with 300ms delay: BDW = 2,000,000 * 0.3 = 600,000 bits 600,000 / 8 = 75,000 bytes The TCP Window size should be a minimum of 75,000 bytes. Downsides: -- More unacknowledged data in flight/transit at any one time; more sensitive to packet loss -- Socket buffer sizes increase To check if network stack autotuning is enabled: $ cat /proc/sys/net/ipv4/tcp_moderate_rcvbuf 1 # allow testing with buffers up to 64MB net.core.rmem_max = 67108864 net.core.wmem_max = 67108864 --> These are the hardlimits on send/recieve in bytes for socket buffer space. This is 64MB per socket which is a lot. # increase Linux autotuning TCP buffer limit to 32MB net.ipv4.tcp_rmem = 4096 87380 33554432 net.ipv4.tcp_wmem = 4096 65536 33554432 --> These arrays are the minimum, default initial, maximum buffer sizes the autotuning can use in bytes for TCP Valgrind ------------------------------------ Check for memory leaks: valgrind --tool=memcheck --leak-test=full -v /tmp/p/a.out Profile cache usage for application: valgrind --tool=cachegrind -v /tmp/p/a.out