Tuesday, 24 September, 2013 10:35 Written by Brian B
Had a fantastic day one at Oracle OpenWorld.
Here is the Collier IT group watching Mr. Ellison’s kickoff.
And a partial view of the humanity that is OpenWorld
From the Engineered Systems perspective the big announcement day 1 was the SPARC M6 – Terabyte Scale Computing.
Some of the specifics/features
12 S3 Cores running @3.6 GHz
48MB shared L3 Cache
Scalable to 32 processors
2 integrated 2×8 PCIe 3.0
4.27 billion transistors
3 Terabytes per second bandwidth
1.4 Terabytes per second memory bandwidth
1 Terabyte per second I/O bandwidth
384 cores
3072 threads
32 Terabyte system memory
This thing is quite the beast.
We were told that the M7/T7 chip is running in the lab now. It extends the software on silicon to include:
Database query acceleration
Application data protection
Java acceleration
Data decompression
The first day shows us that Oracle continues it’s growth in the SPARC space. Exciting things coming real soon in the Solaris/SPARC world.
Wednesday, 18 September, 2013 17:37 Written by Brian B
We are finishing up our 3 day Oracle University Partner Summit in New Orleans. Part of the gathering includes an award ceremony.
Collier IT was the recipient of 3 Awards this year. A great deal of teamwork, time, effort, and dedication is required to ensure that each of our clients receive the training and skill sets they require. These awards are a reflection of that effort and I want to thank the entire Collier IT Ownership, Management, and staff for providing an atmosphere that allows these successes.
So the first award is exciting for us. It acknowledges the work surrounding our sales efforts. Architecting a learning path solution that evaluates current skills and aligns additional training to company business needs.
Our second award is one that we have been privileged to win 3 consecutive years. The Instructor Quality Award is a testament to the instructors on our staff, that also function as consulting engineers, that provide top quality content delivery to our customers.
Our final award this year seems to be a personal award, but it is one that requires an entire team effort within Collier IT as well as a strong Oracle University support staff. This is my second consecutive award and I am quite humbled to be named Oracle University Instructor of the Year.
My thanks to all of the students that attended classes in Collier ITs Roseville, MN location. Without each of you, your feedback, and your positive comments, none of this would be possible. My thanks to all of our clients that entrust their employees and their training dollars to us. We value your continued business.
Thursday, 12 September, 2013 07:16 Written by Brian B
Teaching Solaris Performance Management this week and we got into a large discussion about T-Series CPUs, Multi-threaded -vs- Multi-Process applications, Multiple Page Size Support (MPSS), and the Translation Lookaside Buffer (TLB).
Solaris processes run in a virtual memory address space. When we attempt to utilize that memory address space something needs to map that virtual address to an actual physical address. On the SPARC platform the Hardware Address Translation layer (named SFMMU – Spitfire Memory Management Unit) performs this function. The MMU divides the virtual address space into pages. Solaris supports MPSS so we can change the size of these pages on both SPARC as well as x86.
The pagesize -a command will display the available page sizes on the system.
$ uname -a SunOS chicago 5.10 Generic_127112-11 i86pc i386 i86pc Solaris $ pagesize -a 4096 2097152
$ uname -a SunOS niagara 5.10 Generic_138888-03 sun4v sparc SUNW Solaris $ pagesize -a 8192 65536 4194304 268435456
Virtual Memory would not be very effective if every memory address had to be translated by looking up the associated physical page in memory. The solution is to cache the recent translations in a Translation Lookaside Buffer (TLB). A TLB has a fixed number of slots that contain Translation Table Entries (TTE), which map virtual addresses to physical addresses.
Modern servers today have multiple cores with multiple hardware strands allowing the system to dispatch a large number of threads to a CPU. Each of the processes associated with these threads will need to gain access to the physical memory location placing a burden on the TLB and the HAT. Simply put, there may not be enough space in the TLB to hold the Translation Table Entries (TTE) to hold all of the needed translations required by the large number of running processes.
To speed up handling of TLB miss traps, the processor provides a hardware-assisted lookup mechanism called the Translation Storage Buffer (TSB). The TSB is a virtually indexed, direct-mapped, physically contiguous, and size-aligned region of physical memory which is used to cache recently used Translation Table Entries (TTEs) after retrieval from the page tables. When a TLB miss occurs, the hardware uses the virtual address of the miss combined with the contents of a TSB base address register (which is pre-programmed on context switch) to calculate the pointer into the TSB of the entry corresponding to the virtual address. If the TSB entry tag matches the virtual address of the miss, the TTE is loaded into the TLB by the TLB miss handler, and the trapped instruction is retried. If no match is found, the trap handler branches to a slow path routine called the TSB miss handler. Quite a bit of complex work to handle these “misses”.
Starting with Solaris 10, Update 1 the Out-Of-The-BOX (OOB) Large Page Support turns on MPSS automatically for the applications heap and text(libraries). The advantage is that it improves the performance of your userland applications by limiting/reducing the CPU cycles required to service dTLB and iTLB misses. Theoretically we are mapping a larger amount of memory in the TLB if we choose to map larger pages.
For example, if the heap size of a process is 256M, on a Niagara (UltraSPARC-T1) box it will be mapped on to a single 256M page. On a system that doesn’t support large pages, it will be mapped on to 32,768 8K pages.
The pmap command displays the page sizes of memory mappings within the address space of a process. The -sx option directs pmap to show the page size for each mapping.
sol10# pmap -sx ´pgrep testprog´ 2909: ./testprog Address Kbytes RSS Anon Locked Pgsz Mode Mapped File 00010000 8 8 - - 8K r-x-- dev:277,83 ino:114875 00020000 8 8 8 - 8K rwx-- dev:277,83 ino:114875 00022000 131088 131088 131088 - 8K rwx-- [ heap ] FF280000 120 120 - - 8K r-x-- libc.so.1 FF29E000 136 128 - - - r-x-- libc.so.1 FF2C0000 72 72 - - 8K r-x-- libc.so.1 FF2D2000 192 192 - - - r-x-- libc.so.1 FF302000 112 112 - - 8K r-x-- libc.so.1 FF31E000 48 32 - - - r-x-- libc.so.1 FF33A000 24 24 24 - 8K rwx-- libc.so.1 FF340000 8 8 8 - 8K rwx-- libc.so.1 FF390000 8 8 - - 8K r-x-- libc_psr.so.1 FF3A0000 8 8 - - 8K r-x-- libdl.so.1 FF3B0000 8 8 8 - 8K rwx-- [ anon ] FF3C0000 152 152 - - 8K r-x-- ld.so.1 FF3F6000 8 8 8 - 8K rwx-- ld.so.1 FFBFA000 24 24 24 - 8K rwx-- [ stack ] -------- ------- ------- ------- ------- total Kb 132024 132000 131168 -.
There may be some instances where OOB may cause poor performance of some of your applications including application crashes if the application makes an improper assumption regarding page sizes. If one runs into this scenario there are adjustments that can be made in /etc/system to enable or disable OOB support.
It also can introduce challenges on some caches and their coherency. On multi-threaded applications run on CMP and SMP systems, threads from a common PID can be dispatched to different CPUs, each holding their own TTEs in the TLB. When a thread unmaps virtual memory we have to perform a cleanup. There may now be a stale mapping on a different TLB that now maps to an invalid physical memory location that if allowed to remain could allow for corruption. Those CPUs that are crosscalled during a munmap that have actually run the process are cleaned up instead of just broadcasting it to all of the running CPUs. However as we add more processors, this can increase the time it takes to perform this cleanup. If you think this may be occurring if you migrate to a larger system, consider using processor pools or CPU binding of the process to see if that allows some relief.
We can use the trapstat command to gain some reference into the performance of our dTLB and iTLB hit rates. By specifying the -T option, trapstat shows TLB misses broken down by page size. In this example, CPU 0 is spending 7.9 percent of its time handling user-mode TLB misses on 8K pages, and another 2.3 percent of its time handling user-mode TLB misses on 64K pages.
example# trapstat -T -c 0 cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim ----------+-------------------------------+-------------------------------+---- 0 u 8k| 1300 0.1 15 0.0 | 104897 7.9 90 0.0 | 8.0 0 u 64k| 0 0.0 0 0.0 | 29935 2.3 7 0.0 | 2.3 0 u 512k| 0 0.0 0 0.0 | 3569 0.2 2 0.0 | 0.2 0 u 4m| 0 0.0 0 0.0 | 233 0.0 2 0.0 | 0.0 - - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - - 0 k 8k| 13 0.0 0 0.0 | 71733 6.5 110 0.0 | 6.5 0 k 64k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 0 k 512k| 0 0.0 0 0.0 | 0 0.0 206 0.1 | 0.1 0 k 4m| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 ==========+===============================+===============================+==== ttl | 1313 0.1 15 0.0 | 210367 17.1 417 0.2 |17.5
By specifying the -e option, trapstat displays statistics for only specific trap types. Using this option minimizes the probe effect when seeking specific data. This example yields statistics for only the dtlb-prot and syscall-32 traps on CPUs 12 through 15:
example# trapstat -e dtlb-prot,syscall-32 -c 12-15 vct name | cpu12 cpu13 cpu14 cpu15 ------------------------+------------------------------------ 6c dtlb-prot | 817 754 1018 560 108 syscall-32 | 1426 1647 2186 1142 vct name | cpu12 cpu13 cpu14 cpu15 ------------------------+------------------------------------ 6c dtlb-prot | 1085 996 800 707 108 syscall-32 | 2578 2167 1638 1452
cpustat allows another point of entry into monitor events on the CPU to include the workings of the TLB. The following command displays the three CPUs with the highest DTLB_miss rate.
example% cpustat -c DTLB_miss -k DTLB_miss -n 3 1 1 time cpu event DTLB_miss 1.040 115 tick 107 1.006 18 tick 98 1.045 126 tick 31 1.046 96 total 236 event DTLB_miss total 236
There is quite a bit more to think about relating to MPSS and the TLBs, I hope this post serves as a starting point for those that are running CMP/SMP systems with multi-threaded applications to perform a deeper dive.
Take a look at (pmap -sx) (ppgsz) (pagesize) (mpss.so.1) for additional direction.