Performance Troubleshooting

From Wiki-UX.info
Jump to: navigation, search

Abstract[edit]

The following document details the required steps that an HP HP-UX L1 Engineer are required to follow to handle a Performance Troubleshooting case in the context of the Response Center contractual obligation with their customers. The procedure details how to identified a Performance case as well as performing cursory analysis to profile the problem.

Performance[edit]

Read the following statements:

  • All HP-UX systems, and all computer no matter what fads, looks or CPU arquitecture bought by HP customer perform a computation task were source data is transformed to data. Most likely both the source and the transformed data will require temporary or permanent storage for current or future retrivial through any available presentation media
  • Performance is defined and measure by the amount of data that can be transformed and possiblly stored in a time frame by the resources assigned to the task by a given system. Several if not hundreds | thousands of independent data transformation happens concurrently

Compare those statements with the following HP-UX problem descriptions:

  • "The system has a performance problem!"
  • "The users sessions hangs intermittently!"
  • "The application is taking three times longer to complete the task"
  • "Why is my System kernel | user processes so high"
  • "I have 128 GB of memory, but my database suddenly closes with 'cannot malloc' error messages"

If you feel that you don't know where to start to offer a proper answer to the customer and the "It depends" phrase immediately pops up, not wonder. The questions are incorrectly stated, you need knowledge and practice to tackle even a baseline performance analysis and even those cases where strong performance lingo is used, may have little relation to performance best practices.

The panic bad news! is that you as L1 engineer are fully responsable to:

  1. Identified performance cases
  2. Frame those cases
  3. Perform data collection and,
  4. Analysis between the contractual limits of the support contract.

The good news is that HP-UX L2 | WTEC engineers are here to help you to attack complex problems or identified product defects in the product that require the development of fixes.

Performance outline[edit]

The following is the HP Education "HP-UX performance and tuning" course content. This constitute an excellent overview of the common topics that may arise during a perforance case analysis and is and excellent starting point to determinate what topics you need to handle.

  • Introduction
    • What is a performance problem
    • The "System centric" view of performance
    • Measuring performance
    • The first rule of interpreting metrics
    • Types of performance
    • Multiprocessor scaling
    • Bottlenecks
    • Baseline
    • Queuing and response times
    • Queuing theory and performance
    • Increasing CPU counts and utilization
  • Tools
    • Performance tools
    • Sources of data
    • Glance
    • GPM or as it is now known, xglance
    • Customizing lists in GPM
    • Alarms
    • HP Performance Agent and Manager
    • HP Performance Manager
    • HP PM web-based reports
    • Caliper
    • Using caliper
  • CPUs and Performance
    • Types of CPU bottlenecks
    • CPUs and performance
    • Tuning for data latency
    • Performance and system size
    • Memory types on cell based servers
    • Configuring and Using CLM
    • Launch policies
    • Address translation delays
    • Virtual to physical address translation
    • Measuring TLB misses
    • Tuning for TLB misses
    • Variable page size kernel parameters
    • The change attributes command
    • Hyperthreading
    • Shared caches
    • Compiler optimizations
    • CPUs and performance
    • Process Resource Manager
  • Process management
    • The HP-UX operating system
    • Virtual address process space
    • Physical process components
    • Life cycle of a process and process states
    • CPU scheduler
    • Context switching, priority queues and time share
    • Parent-child process relationship
  • CPU management
    • Processor module
    • Symmetric multiprocessing
    • CPU processor
    • CPU and TLB cache
    • TLB, Cache, and Memory
    • HP-UX 11.00 performance optimized page sizes
    • CPU metrics to monitor system-wide and per process
    • glance reports and timex command
    • Activities that utilize the CPU
    • Tuning a CPU-bound system
    • Processor affinity
  • IO Performance
    • IO and performance
    • Causes of IO performance problems
    • The system call interface
    • System calls take time
    • The filesystem layer
    • VxFS performance topics
    • Defragmenting OnlineJFS filesystems
    • Understanding your IO workload
    • Caching controls
    • DSYNC
    • Locks on kernel data structures
    • Performance implications of locks
    • Large directories, improving performance
    • Buffered IO, reading ahead
    • Writing behind at 11.23 and earlier
    • Writing behind with HP-UX 11.31
    • Direct IO
    • Caching
    • Caching improves performance
    • Caching improves performance when
    • Tuning the cache
    • Volume managers
    • Mirroring and performance
    • Striping
    • Multipathing
    • Load balancing policies
    • Device caching
  • Memory
    • System memory management
    • When does memory affect performance
    • Memory usage
    • Virtual memory
    • Memory allocation
    • vhand (the page daemon)
    • Memory Resource Groups
    • File/Buffer cache paging differences
    • Diagnosing memory problems
    • Tuning the swap environment
    • malloc
    • Freeing memory
    • Expanding the heap
    • The Small Block Allocator (SBA)
    • The Global Cache Exchange
    • Tuning the Global Cache Exchange
    • Protection ID faults
  • Network Performance
    • Types of performance
    • Latency
    • Latency and response time
    • Bandwidth
    • Layers within networking
    • Mapping the physical network
    • Measuring network speed
    • ttcp
    • netperf
    • LAN cards and CPU interrupts
    • ping
    • traceroute
    • netstat
    • lsof
    • Wireshark
  • Kernel Parameters
    • Kernel parameter groups
    • Process management
    • Process management parameters continued
    • Memory management
    • Swap space management
    • Introduction to SYS V IPC parameters
    • Message queues
    • Understanding the message queue parameters
    • Semaphores
    • Shared memory
    • Signals related parameters
    • Networking
    • NFS related parameters
    • Auditing and security

Types of Performance Cases[edit]

There are at least three types of Performance Troubleshooting cases that are normally attended by the HP-UX Response Center engineers:

  • Informational: The customer is not aware of any current performance problem but desires to collect performance data to establish system baseline and possible detect areas of improvement.
  • Resource Contention: The customer experiences decremental AND | OR disminishing results in user application perfomance, measured or percieve against historical data (baseline). Requires assistance to establish the role of the Operating System in the problem.
  • Break and Fix: Resource contention (CPU, Memory, I/O) no related with User Applications, like the Kernel or bundled HP-UX software that cannot be correlated with the application load. Requires the framing of the problem, and possible the development of a fix or patch.

Informational Cases[edit]

HP-UX Reponse Center is not in the Educational Business. HP Education is. For customer that are just starting to learn about how to perform performance data collection, the following are good documents that those System Administrations should consult (that includes you!).

HP Education:

Kernel Parameters:

Books:

Old references:

Resource Contention (Bottleneck) Cases[edit]

The HP Response Center does not provided Performance Analysis Services, since this is beyond the business unit scope (break and fix).

Nonetheless, very often legit customer concerns may arise about the execution of a particular HP-UX system over the application resources. Agents are require to know how to perform a detailed resource contention analysis, to be able to point the customer in the right direction if further processing capacity like CPU, Memory or Storage is required on their system to improve the execution time.

Set the correct expectations[edit]

Customers must be aware on the imposed limits of the analysis. The following template should be used and possible send to the customer.

Hello <CUSTOMER NAME>. This is <AGENT NAME> with the <SUPPORT GROUP>. I have been asked to assist you on your performance case.

Since this case request HP-UX Performance Recommendations, I have to provided the following disclaimer:

Support Disclaimer:

I am not a Performance Analyst and cannot provide Performance analysis or performance tuning as it is beyond the scope of the Response Center. Performance analysis | tuning can be provided on a Time and Materials (T&M) basis should you request it.
So, contractually we are looking for Operating System configurable changes that can eliminate contention for resources and bottlenecks of CPU, Disk or Memory and to that end, I need you to run the performance scripts that are listed at the bottom of this email so we can get the necessary data to establish a base line and pursuit preliminary analysis.


Your basic [question | problem description] is:

<State the main performance issue or question?>

Add the most sensible performance gathering tools according to the problem described.

Collect preliminary performance data[edit]

Once the correct expectations are set, the next step consists in collecting statistical information on the usage of the system. This information help to identified the contended resources or, if already identified, measure the resource contention through a representative time.

There are several commands and tools that can be used to acomplish this. At minimum, L1 engineers are expected to handle the tools of the The HP-UX Performance Cookbook.

Be careful in this point. System Administrators are likely to answer customer performance question based on a single measurement sample taken with one tool, like a Glance screen shot or top or sar commands.

This should be explain almost as a mantra: A single measurement is of none statistical value. Data for a complete business cycle (one day, week month, etc) should be collected to trace proper performance trends.

Check the HP-UX_Enterprise_Frontline:Community_Portal#Performance collection of articles for special application data collection tools for specific resource measurement. In particular, System were Glance or Measureware is installed, can provided up to a month of historical data that can be very useful to look into the past of a particular system.

Glance is the prefer tool to collect statistical performance data, since report templates can be create to report dozens of historical performance indicators. On systems were Glance is not available, you may need to start collecting performance indicators with command line tools.

The following is an very simple Posix shell script that can be used to collect data to identified resource contention where Glance binary logs is not available.

Save the following script in /tmp/perf.sh file and execute it in a cron job or in a user session with a while loop.

export UNIX95=
export PATH=/usr/sbin:/usr/bin
 
# date >> /tmp/ps.out; ps -Hef >> /tmp/ps.out &
date >> /tmp/sar.out; sar 1 5 >> /tmp/sar.out &
date >> /tmp/sard.out; sar -d 1 5 >> /tmp/sard.out &
date >> /tmp/sarb.out; sar -b 1 5 >> /tmp/sarb.out &
date >> /tmp/sarc.out; sar -c 1 5 >> /tmp/sarc.out &
date >> /tmp/sarw.out; sar -w 1 5 >> /tmp/sarw.out &
date >> /tmp/swapinfo.out; swapinfo -tam >> /tmp/swapinfo.out &
date >> /tmp/vmstat.out; vmstat -n 1 5 >> /tmp/vmstat.out &

To run the script in a while loop, use the following command. The sleep command set execution delay between each command (300 seconds, 5 minutes in this example). Interrupt the loop with [Ctrl]=[C].

while true
do
sh /tmp/perf.sh
sleep 300
done

Run the script for a representative time frame and collect the output files /tmp/*.out

Section 3[edit]

Reference[edit]

Authors[edit]