Tuesday, 29 October, 2013 22:09 Written by Brian B
The last blog post gave us some brief descriptions of the various scheduling classes in Solaris. I focused on the Time Sharing (TS) class since it is the default. Hopefully we can see that the TS (and the IA class for that matter) makes its decisions based on how the threads are using the CPU. Are we CPU intensive or are we I/O intensive? It works well, but it doesn’t provide the administrator fine-grain control as it relates to resource management.
To address this, The Fair Share Scheduler (FSS) was added to Solaris in the Solaris 9 release.
The primary benefit of FSS is to allow the admin an ability to identify and dispatch processes and their threads based upon their importance as determined by the business and implemented by the administrator.
We saw the complexity of the TS dispatch table in the earlier post. Here we see the FSS table has no such complexity.
In FSS we use the concept of CPU shares. These shares allow the admin a fine level of granularity to carve up CPU resources. We are no longer limited to allocating an entire CPU. The admin designates the importance of the workload by assigning to it a number of shares. You dictate importance by assigning a larger number of shares to those workloads that carry a higher importance. Shares ARE NOT the same as CPU caps nor CPU resource usage. Shares simply define the relative importance of workloads in comparison to other workloads where CPU resource usage is an actual measurement of consumption. A workload may be given 50% of the shares yet at a point in time may be only consuming 5% of the CPU. I look at a CPU share as a minimum guaranty of CPU allocation, not as a cap on CPU consumption.
When we assign shares to a work load, we need to be aware of the shares that are already assigned. It is the ratio of shares assigned to one workload compared to all of the other workloads.
I speak of FSS in a “Horizontal” and a “Vertical” aspect when I’m delivering for Oracle University. In Solaris 9 we were able to define projects in the /etc/project file. This is the vertical aspect. In Solaris 10 Non-Global Zones were introduced and brought with it the Horizontal aspect. I assign shares horizontally across the various zones and then vertically within each zone in the /etc/project file if needed.
By default the Non-Global zones use the default scheduling class. If the system is updated with a new default class, they will obtain the new setting when booted or rebooted. The recommended scheduler to use with Non-Global Zones is the FSS. The preferred way is to set the system default scheduler to FSS and all zones then inherit it.
To display information about the loaded scheduling classes, run priocntl -l
SYS (System Class)
TS (Time Sharing)
Configured TS User Priority Range: -60 through 60
SDC (System Duty-Cycle Class)
FX (Fixed priority)
Configured FX User Priority Range: 0 through 60
IA (Interactive)
Configured IA User Priority Range: -60 through 60
priocntl can be used to view or set scheduling parameters for a specified process.
To determine the global priority of a process run ps -ecl
root@solaris:~# ps -ecl #The c displays properties of the scheduler, we see the class (CLS) and the priority (PRI) F S UID PID PPID CLS PRI ADDR SZ WCHAN TTY TIME CMD 1 T 0 0 0 SYS 96 ? 0 ? 0:01 sched 1 S 0 5 0 SDC 99 ? 0 ? ? 0:02 zpool-rp 1 S 0 6 0 SDC 99 ? 0 ? ? 0:00 kmem_tas 0 S 0 1 0 TS 59 ? 720 ? ? 0:00 init 1 S 0 2 0 SYS 98 ? 0 ? ? 0:00 pageout 1 S 0 3 0 SYS 60 ? 0 ? ? 0:01 fsflush 1 S 0 7 0 SYS 60 ? 0 ? ? 0:00 intrd 1 S 0 8 0 SYS 60 ? 0 ? ? 0:00 vmtasks 0 S 0 869 1 TS 59 ? 1461 ? ? 0:05 nscd 0 S 0 11 1 TS 59 ? 3949 ? ? 0:11 svc.star 0 S 0 13 1 TS 59 ? 5007 ? ? 0:32 svc.conf 0 S 0 164 1 TS 59 ? 822 ? ? 0:00 vbiosd 0 S 16 460 1 TS 59 ? 1323 ? ? 0:00 nwamd
To set the default scheduling class use dispadmin -d FSS and then dispadmin -d to ensure it changed. Then run dispadmin -l to see that it loaded.
root@solaris:~# dispadmin -d dispadmin: Default scheduling class is not set root@solaris:~# dispadmin -d FSS root@solaris:~# dispadmin -d FSS (Fair Share) root@solaris:~# dispadmin -l CONFIGURED CLASSES ================== SYS (System Class) TS (Time Sharing) SDC (System Duty-Cycle Class) FX (Fixed Priority) IA (Interactive) FSS (Fair Share)
Manually move add of the running processes into the FSS class and then verify with the ps command.
root@solaris:~# priocntl -s -c FSS -i all root@solaris:~# ps -ef -o class,zone,fname | grep -v CLS | sort -k2 | more FSS global auditd FSS global automoun FSS global automoun FSS global bash FSS global bash FSS global bonobo-a FSS global clock-ap FSS global console- FSS global cron FSS global cupsd FSS global dbus-dae FSS global dbus-dae FSS global dbus-lau FSS global dbus-lau
Finally move init over to the FSS class so all children will inherit.
root@solaris:~# ps -ecf | grep init root 1 0 TS 59 16:33:44 ? 0:00 /usr/sbin/init root@solaris:~# priocntl -s -c FSS -i pid 1 root@solaris:~# ps -ecf | grep init root 1 0 FSS 29 16:33:44 ? 0:00 /usr/sbin/init
With the FSS all set, we now assign shares to our Non-Global Zones
zonecfg -z
set cpu-shares=number of shares
exit zonecfg
To display CPU consumption run prstat -Z
Tuesday, 22 October, 2013 20:24 Written by Brian B
The Oracle Solaris kernel has a number of process scheduling classes available.
A brief review.
Timesharing (TS) This is the default class for processes and their associated kernel threads. Priorities in the class are dynamically adjusted based upon CPU utilization in an attempt to allocate processor resources evenly.
Interactive (IA) This is an enhanced version of TS. Some texts reference this in conjunction with TS, i.e. TS/IA. This class applies to the in-focus window in the GUI. It provides extra resources to processes associated with that specific window.
Fair Share Scheduler (FSS) This class is “share based” rather than priority based. The threads associated with this class are scheduled based on the associated shares assigned to them and the processor’s utilization.
Fixed-Priority (FX) Priorities for these threads are fixed regardless of how they interact with the CPU. They do not vary dynamically over the life of the thread.
System (SYS) Used to schedule kernel threads. These threads are bound meaning unlike the userland threads listed above they do not context switch off the CPU if their time quantum is consumed. They run until they are blocked or they complete.
Real-Time (RT) These threads are fixed-priority with a fixed time duration. They are the one of the highest priority classes with only interrupts carrying a higher priority.
As it relates to the priority ranges for the scheduling classes the userland classes (TS/IA/FX/FSS) carry the lowest priorities, 0-59. The SYS class is next ranging from 60-99. At the top (ignoring INT) is the RT class at 100-159.
We can mix scheduling classes on the same system but there are some considerations to keep in mind.
TS and IA as well as FSS and RT can be in the same processor set.
We can look at how the TS class (default) makes its decisions by looking at the dispatch table itself
This table is indexed by the priority level of the thread. To understand an entry lets use priority 30 as an example.
The left most column is marked as ts_quantum -> Timesharing quantum. This specifies the time in milliseconds (identified by the RES=1000) that the thread will be allocated before it will be involuntary context-switched off the CPU.
A context switch is the process of storing and restoring the state of a process so that execution can be resumed at the same point at a later time. We store this state in the Light Weight Process (LWP) that the thread was bound to. Basically a thread binds to a LWP. A LWP binds to a kernel thread (kthr) and the kernel thread is presented to the kernel dispatcher. When the thread is placed on the CPU (hardware strand/thread) the contents of the LWP is loaded onto the CPU and the CPU starts execution at that point. When the thread is removed from the CPU (it is preempted, the time quantum is consumed, it sleeps) the contents of the CPU registers are loaded into the LWP and then it is removed from the CPU and returns to the dispatch queue to compete again based on priority with the other threads that request access to the CPU.
So, at a priority of 30 the thread has 80 milliseconds to complete its work or it will be forced off the CPU. In the event that it does not complete its work, the system will context switch the thread off the CPU AND change its priority. We see the next column ts_tqexp ->Timesharing time quantum expired. This column identifies the new priority of the thread. In this case ts_tqexp is now 20. So, we consumed our time quantum, we were involuntary context switched off the CPU, and we had our priority lowered when we returned to the dispatch queue. At a priority of 20 our time quantum is now 120 milliseconds. Lowered priority to keep the thread from “hogging” the CPU but an increase in the time quantum in hopes that when we do get back on the CPU we have more time to complete our work.
The next column identifies our new priority when the thread returns from a sleep state. There is no reason to keep a thread on a CPU if there is no work to be done. When the thread enters a sleep state we leave the CPU, this is a VOLUNTARY context switch and we are placed on the sleep queue. When we leave the sleep queue we are not placed back on the CPU. We are placed back on the dispatch queue to compete with the other threads to gain access to the CPU. Since we have been off the CPU for a period of time we advance the priority in this case we were initially dispatched at 30, we voluntary context switched off the CPU, and when we woke we were given the priority of 53. Notice that at a priority of 53, or new time quantum is 40. The priority increased from 30 to 53 but the time quantum decreased from 80 to 40. We get you back on the CPU faster but limit the amount of time you get on the CPU.
The context and involuntary context switching can be seen in a mpstat commmand. csw and icsw
The last two columns deal with the attempt to prevent CPU starvation. ts_maxwait is a measurement (in seconds) that if exceed without access to the CPU the value of ts_lwait is assigned. notice that for all but priority 59 that this value is set to 0. So when we exceed 0 (meaning we have been off the CPU for 1 second) we are assigned the value of ts_lwait. Again using 30 as our example we would go from a priority of 30 to a priority of 53 if we were prevented access to the CPU for 1 second.
In the middle of all of this we have preemption. The Solaris Kernel is fully preemptible. All threads, even SYS threads, will be preempted if a higher priority thread hits the kernel dispatcher while a lower priority thread is running. The thread isn’t allowed to complete its time quantum, it is context switched off the CPU.
And don’t forget, there is IA, FSS, FX, SYS, RT, and INT threads that adds to the chaos if allowed and why I provided some of the guidance I listed earlier.
We see some use of FX and quite a bit more of the FSS with Solaris zones. I’ll talk about FSS in another post.
I’ll spend a better part of a day whiteboarding all of this in the Oracle University Solaris Performance Management Class.
Monday, 30 September, 2013 18:37 Written by Brian B
I was honored to be asked by Rick Ramsey Twitter @OldManRamsey of Oracle Technology Network to be interviewed to discuss “What’s the biggest change data centers are facing today, and what does Collier IT recommend?” We were pressed for time, we only had 10 – 12 minutes and didn’t get much Solaris talk in. We are speaking of starting up a recurring Google Places environment so we can continue these short snippits of Oracle Technology with a focus around Operating Environments.
Here is the interview. Personal thanks to Oracle Corporation, Oracle Partner Network, Oracle Technology Network, Oracle University, and my employer Collier IT for allowing me this opportunity.