The Oracle Solaris kernel has a number of process scheduling classes available.
A brief review.
Timesharing (TS) This is the default class for processes and their associated kernel threads. Priorities in the class are dynamically adjusted based upon CPU utilization in an attempt to allocate processor resources evenly.
Interactive (IA) This is an enhanced version of TS. Some texts reference this in conjunction with TS, i.e. TS/IA. This class applies to the in-focus window in the GUI. It provides extra resources to processes associated with that specific window.
Fair Share Scheduler (FSS) This class is “share based” rather than priority based. The threads associated with this class are scheduled based on the associated shares assigned to them and the processor’s utilization.
Fixed-Priority (FX) Priorities for these threads are fixed regardless of how they interact with the CPU. They do not vary dynamically over the life of the thread.
System (SYS) Used to schedule kernel threads. These threads are bound meaning unlike the userland threads listed above they do not context switch off the CPU if their time quantum is consumed. They run until they are blocked or they complete.
Real-Time (RT) These threads are fixed-priority with a fixed time duration. They are the one of the highest priority classes with only interrupts carrying a higher priority.
As it relates to the priority ranges for the scheduling classes the userland classes (TS/IA/FX/FSS) carry the lowest priorities, 0-59. The SYS class is next ranging from 60-99. At the top (ignoring INT) is the RT class at 100-159.
We can mix scheduling classes on the same system but there are some considerations to keep in mind.
- Avoid having the FSS, TS, IA, and FX classes share the same processor set (pset)
- All processes that run on a processor set must be in the same scheduling class so they do not compete for the same CPUs
- To avoid starving applications, use processor sets for FSS and FX class applications
TS and IA as well as FSS and RT can be in the same processor set.
We can look at how the TS class (default) makes its decisions by looking at the dispatch table itself
This table is indexed by the priority level of the thread. To understand an entry lets use priority 30 as an example.
The left most column is marked as ts_quantum -> Timesharing quantum. This specifies the time in milliseconds (identified by the RES=1000) that the thread will be allocated before it will be involuntary context-switched off the CPU.
A context switch is the process of storing and restoring the state of a process so that execution can be resumed at the same point at a later time. We store this state in the Light Weight Process (LWP) that the thread was bound to. Basically a thread binds to a LWP. A LWP binds to a kernel thread (kthr) and the kernel thread is presented to the kernel dispatcher. When the thread is placed on the CPU (hardware strand/thread) the contents of the LWP is loaded onto the CPU and the CPU starts execution at that point. When the thread is removed from the CPU (it is preempted, the time quantum is consumed, it sleeps) the contents of the CPU registers are loaded into the LWP and then it is removed from the CPU and returns to the dispatch queue to compete again based on priority with the other threads that request access to the CPU.
So, at a priority of 30 the thread has 80 milliseconds to complete its work or it will be forced off the CPU. In the event that it does not complete its work, the system will context switch the thread off the CPU AND change its priority. We see the next column ts_tqexp ->Timesharing time quantum expired. This column identifies the new priority of the thread. In this case ts_tqexp is now 20. So, we consumed our time quantum, we were involuntary context switched off the CPU, and we had our priority lowered when we returned to the dispatch queue. At a priority of 20 our time quantum is now 120 milliseconds. Lowered priority to keep the thread from “hogging” the CPU but an increase in the time quantum in hopes that when we do get back on the CPU we have more time to complete our work.
The next column identifies our new priority when the thread returns from a sleep state. There is no reason to keep a thread on a CPU if there is no work to be done. When the thread enters a sleep state we leave the CPU, this is a VOLUNTARY context switch and we are placed on the sleep queue. When we leave the sleep queue we are not placed back on the CPU. We are placed back on the dispatch queue to compete with the other threads to gain access to the CPU. Since we have been off the CPU for a period of time we advance the priority in this case we were initially dispatched at 30, we voluntary context switched off the CPU, and when we woke we were given the priority of 53. Notice that at a priority of 53, or new time quantum is 40. The priority increased from 30 to 53 but the time quantum decreased from 80 to 40. We get you back on the CPU faster but limit the amount of time you get on the CPU.
The context and involuntary context switching can be seen in a mpstat commmand. csw and icsw
The last two columns deal with the attempt to prevent CPU starvation. ts_maxwait is a measurement (in seconds) that if exceed without access to the CPU the value of ts_lwait is assigned. notice that for all but priority 59 that this value is set to 0. So when we exceed 0 (meaning we have been off the CPU for 1 second) we are assigned the value of ts_lwait. Again using 30 as our example we would go from a priority of 30 to a priority of 53 if we were prevented access to the CPU for 1 second.
In the middle of all of this we have preemption. The Solaris Kernel is fully preemptible. All threads, even SYS threads, will be preempted if a higher priority thread hits the kernel dispatcher while a lower priority thread is running. The thread isn’t allowed to complete its time quantum, it is context switched off the CPU.
And don’t forget, there is IA, FSS, FX, SYS, RT, and INT threads that adds to the chaos if allowed and why I provided some of the guidance I listed earlier.
We see some use of FX and quite a bit more of the FSS with Solaris zones. I’ll talk about FSS in another post.
I’ll spend a better part of a day whiteboarding all of this in the Oracle University Solaris Performance Management Class.