Capacity Planning Interview

From Wiki-UX.info
Jump to: navigation, search


Introduction

The following document has the purpose of complementing the interviewer, to provided overall insight on my current level of standard tool handling and analysis the approach to the problems. Using a single case approach, I tried to explore the following topics:

  1. Usage of platform specific standard tools: OVPA (measureware)
  2. Usage of in-house develop tools for general information review: GNUPLOT.
  3. Develop custom per case analysis tools using standard UNIX tools and languages: awk and shell commands.
  4. Data analysis, using the general to specific approach.
  5. Communication skills.

A Resource Contention Case under HP-UX

Yesterday during the afternoon, a major italian tekecomunication company reported disminished performance on one of their largest units. A superdome SD32 running HP-UX 11.11 Oracle database system.

At the time the case was opened, the situation has already become normal. Both customer and engineer on the region request that the situation be reviewed, with the purpose to pinpoint the source of the problem as well as design.

Under this conditions, I select using the OVPA (Measure) tool. On HP-UX, commands like sar and iostat uses kernel trace points (ktrace), that are basically counters that are updated every time that a specific kernel call is done. Those kernel calls correlates directly with the different resources been utilized by the system. OVPA, unlike those commands, keeps an historical log of the account information that allows to perform historical analysis.

The first steps is to collect the binary logs. Once I get hold, we proceed to great a global resource report of the major most general resource usage indicators.

Overall Resource Utilization

Review between "05/26/2010 00:00:00" and "05/27/2010 00:00:00". We have been informed that the issue was worst around 15:00 pm.

/opt/perf/bin/extract -g -xp -r global_cpumem.rept \
-f global_cpumem.out,purge \
-b "05/26/2010 00:00:00" -e "05/27/2010 00:00:00"

This is an example of the report output:

# head -n 20 global_cpumem.out
OVPA Export 05/27/10 01:08    Logfile: /var/opt/perf/datafiles/logglob SCOPE/UX C.04.70.000(0) batman
          |        |      |   Phys    |   Phys   |  Free   |Memory| User |      |  Pg Out   |  Pg Out  |
   Date   |  Time  |CPU % |    IOs    |  IO Rt   |   Mem   |  %   |Mem % |Swap %|    KB     |  KB Rt   |
05/26/2010|00:00:00| 85.09|    1199534|    4007.8|    36075| 49.28| 25.94| 33.00|          0|       0.0|
05/26/2010|00:05:00| 84.10|    1204143|    4005.8|    36065| 49.29| 25.95| 33.00|          0|       0.0|
05/26/2010|00:10:00| 81.12|    1446648|    4828.6|    36818| 48.23| 24.89| 32.00|          0|       0.0|
05/26/2010|00:15:00| 74.37|    1572286|    5246.2|    36227| 49.06| 25.72| 32.00|          0|       0.0|
05/26/2010|00:20:00| 51.79|     895526|    2983.1|    36678| 48.43| 25.09| 32.00|          0|       0.0|
05/26/2010|00:25:00| 46.81|     932760|    3109.2|    36867| 48.16| 24.82| 32.00|          0|       0.0|
05/26/2010|00:30:00| 51.15|    1313495|    4375.4|    35564| 49.99| 26.65| 33.00|          0|       0.0|
05/26/2010|00:35:00| 39.44|    2161198|    7208.8|    35121| 50.62| 27.28| 33.00|          0|       0.0|
05/26/2010|00:40:00| 74.49|    2158107|    7188.9|    34285| 51.79| 28.45| 34.00|          0|       0.0|
05/26/2010|00:45:00| 68.29|    2060613|    6855.0|    34967| 50.83| 27.49| 34.00|          0|       0.0|
05/26/2010|00:50:00| 65.78|    2260320|    7534.4|    35541| 50.03| 26.69| 33.00|          0|       0.0|
05/26/2010|00:55:00| 74.88|    2152054|    7178.3|    35354| 50.29| 26.95| 33.00|          0|       0.0|
05/26/2010|01:00:00| 60.65|    2423284|    8091.1|    35734| 49.76| 26.42| 32.00|          0|       0.0|
05/26/2010|01:05:00| 69.55|    2927491|    9716.2|    36085| 49.26| 25.92| 32.00|          0|       0.0|
05/26/2010|01:10:00| 51.73|    2303793|    7694.7|    36196| 49.11| 25.77| 32.00|          0|       0.0|
05/26/2010|01:15:00| 52.94|    1384110|    4613.7|    35608| 49.93| 26.59| 33.00|          0|       0.0|
05/26/2010|01:20:00| 41.89|    1192023|    3966.8|    35995| 49.39| 26.05| 32.00|          0|       0.0|

The following code is using to convert "OVPA Export" into a graphical representation (plot).

cat > global_cpumem.plot << EOF
<gnuplot>
set size 1, 1
set yrange [0:120]
set xrange [1:300]
set xlabel "Samples"
plot '-' using 1:2 title 'CPU %' with lines, \\
     '-' using 1:3 title 'Memory %' with lines, \\
     '-' using 1:4 title 'User Memory %' with lines, \\
	 '-' using 1:5 title 'SWAP %' with lines
<plotdata>
EOF
 
tail -n $(expr $(wc -l global_cpumem.out | cut -d" " -f 1) - 3 ) global_cpumem.out | \
awk -F'|' 'BEGIN {LINE=1; printf "#Sample\tCPU\tMemory\tUser Memory\n"; } { 
   printf "%-6d\t%4.2f\t%4.2f\t%4.2f\t%4.2f\n", LINE, $3, $7, $8, $9
   LINE++
   }' >> global_cpumem.plot
 
cat >> global_cpumem.plot << EOF
</plotdata>
</gnuplot>
EOF

GnuplotBasic Plot

The information shows CPU consistent high usage around sample 120 (05/26/2010 09:55:00) ending on sample 214 (05/26/2010 17:45:00).

Note that memory utilization also increased during that time frame, unlike the CPU utilization spikes earlier on that day, where the memory utilization was flat.

This preliminary overview seems to point to new processes that began execution, processes that allocated additional memory and larger amount of CPU cycles.

CPU Bottleneck

Due to the previous information, the issue has clear indication of posible CPU bottlenecks.

The following plots are based on similar reports that collects the type of CPU execution and well as the queues.

GnuplotBasic Plot

GnuplotBasic Plot

Note that during the whole operation, the System CPU usage is almost flat, while the User CPU is responsable for most of the load of the system.

Likewise, it can be seen on the run queue, that a large number of Priority Queues (max 50) were schedule for execution. This point to high CPU resource usage by userland applications. In general terminal, Run Queues or Priority Queues larger that 3 are consider a resource bottleneck.

Per Process CPU and I/O Utilization

The Per Process CPU and I/O Utilization report is somewhat large, since it shows the active process with one minute resolution. A review can be done when both CPU Utilization and Pri Queue are high, for example between samples 23 (05/26/2010 11:45:00) - 27 (05/26/2010 12:05:00).

Here we hit with a problem. Due to the large amount of information that the per process information created, OVPA does not keep a large quantity of per process information, so we will have to concentrate on the information available between the 05/26/2010 17:05:01 and 05/26/2010 17:44:00.

/opt/perf/bin/extract -p -xp -r cpuio.rept \
-f cpuio.out,purge \
-b "05/26/2010 09:55:00" -e "05/26/2010 17:45:00"

The following is an example of the level of detail of a single minute of execution:

OVPA Export 05/27/10 01:46    Logfile: /var/opt/perf/datafiles/logglob SCOPE/UX C.04.70.000(0) batman
          |        |          |    Process     |      User      |       | Phys | IO Byte |File System|    VM     |  System   |
   Date   |  Time  |   PID    |      Name      |      Name      | CPU % |IO Rt |  Rate   |    IOs    |    IOs    |    IOs    |
05/26/2010|17:05:01|      1461|ora_p148_dwprp  |oracle          |  11.32| 127.7|   7680.0|          0|          0|       7764|
05/26/2010|17:05:01|     28367|oracledwprp     |oracle          |   0.12|  45.2|   1447.6|          0|          0|       2748|
05/26/2010|17:05:01|     11425|ora_p084_dwprp  |oracle          |   1.11|  36.3|   2209.9|          0|          0|       2207|
05/26/2010|17:05:01|      3387|ora_p155_dwprp  |oracle          |  14.40|   5.0|    304.1|          0|          0|        304|
05/26/2010|17:05:01|      1056|ora_p131_dwprp  |oracle          |   1.03|  63.7|   3891.2|          0|          0|       3872|
05/26/2010|17:05:01|     14933|ora_p166_dwprp  |oracle          |  26.20|  17.0|   1029.0|          0|          0|       1033|
05/26/2010|17:05:01|     20306|ora_p190_dwprp  |oracle          |  54.38|  35.3|   2035.9|          0|          0|       2146|
05/26/2010|17:05:01|      2197|ora_p172_dwprp  |oracle          |   3.14|  24.3|   1404.4|          0|          0|       1477|
05/26/2010|17:05:01|      5502|ora_p006_dacp   |oracle          |   3.16|  31.2|    665.8|          0|          0|       1896|
05/26/2010|17:05:01|     11195|ora_p010_dwprp  |oracle          |  14.50|  45.1|   2768.7|          0|          0|       2742|
05/26/2010|17:05:01|     16573|oracledwprp     |oracle          |   8.40|  16.9|   1018.5|          0|          0|       1027|
05/26/2010|17:05:01|     11421|ora_p082_dwprp  |oracle          |   1.11|  44.0|   2669.6|          0|          0|       2675|
05/26/2010|17:05:01|     11921|oracledwprp     |oracle          |   3.30|  41.3|   1323.2|          0|          0|       2511|
05/26/2010|17:05:01|     11463|ora_p097_dwprp  |oracle          |  16.55| 280.0|  16793.6|          0|          0|      17024|
05/26/2010|17:05:01|     14939|ora_p167_dwprp  |oracle          |  26.46|  16.8|   1012.1|          0|          0|       1021|
05/26/2010|17:05:01|     11183|ora_p007_dwprp  |oracle          |  17.45|  53.3|   3273.2|          0|          0|       3240|
05/26/2010|17:05:01|     20961|oracledwprp     |oracle          |   5.47|   0.0|      1.6|          0|          0|          0|
05/26/2010|17:05:01|     21865|ftpd            |root            |   0.13|   6.0|    268.1|          0|          0|        364|
05/26/2010|17:05:01|     11309|ora_p043_dwprp  |oracle          |   7.33|   7.9|    490.2|          0|          0|        480|
05/26/2010|17:05:01|     17726|ora_p448_dwprp  |oracle          |   2.13|  61.8|   3686.4|          0|          0|       3757|
05/26/2010|17:05:01|     17999|oracledwprp     |oracle          |   9.12|   0.1|      4.7|          0|          0|          6|
05/26/2010|17:05:01|     14236|ora_p426_dwprp  |oracle          |   1.01|  25.2|   1520.3|          0|          0|       1532|
05/26/2010|17:05:01|     11384|ora_p067_dwprp  |oracle          |  72.15|  23.6|   1423.3|          0|          0|       1434|
05/26/2010|17:05:01|     22221|ora_p106_dwprp  |oracle          |   1.04|   7.8|    474.9|          0|          0|        474|
05/26/2010|17:05:01|     11406|ora_p076_dwprp  |oracle          |  85.85|  27.4|   1651.6|          0|          0|       1665|
05/26/2010|17:05:01|      9747|ora_p144_dwprp  |oracle          |  34.88| 106.1|   6348.8|          0|          0|       6450|
05/26/2010|17:05:01|     11250|ora_p025_dwprp  |oracle          |   5.23|   7.9|    490.2|          0|          0|        480|
05/26/2010|17:05:01|      1467|ora_p150_dwprp  |oracle          |  14.65| 174.8|  10444.8|          0|          0|      10627|
05/26/2010|17:05:01|     11301|ora_p040_dwprp  |oracle          |   5.25|   7.9|    490.2|          0|          0|        480|
05/26/2010|17:05:01|     11175|ora_p004_dwprp  |oracle          |  12.33|  34.9|   2138.7|          0|          0|       2121|
05/26/2010|17:05:01|      7947|oracledwprp     |oracle          |  10.47|   0.0|      0.0|          0|          0|          0|
05/26/2010|17:05:01|     11132|ora_dbw0_dwprp  |oracle          |   3.20| 106.3|   3379.2|          0|          0|       6463|
05/26/2010|17:05:01|       169|oracledwprp     |oracle          |  16.96| 108.2|   6451.2|          0|          0|       6578|
05/26/2010|17:05:01|     14234|ora_p425_dwprp  |oracle          |   1.11|  25.0|   1506.1|          0|          0|       1520|
05/26/2010|17:05:01|     13582|ora_p187_dwprp  |oracle          |  63.00|  41.7|   2425.0|          0|          0|       2535|
05/26/2010|17:05:01|     28768|ora_pmon_prpg   |oracle          |   0.00|   0.0|      0.0|          0|          0|          0|
05/26/2010|17:05:01|     21504|oracledwprp     |oracle          |   8.20|   0.1|      3.6|          0|          0|          6|
05/26/2010|17:05:01|      1636|guard_stap      |root            |  38.21|   0.0|      0.0|          0|          0|          0|
05/26/2010|17:05:01|      1469|ora_p151_dwprp  |oracle          |  15.72| 185.8|  11161.6|          0|          0|      11296|
05/26/2010|17:05:01|     11252|ora_p026_dwprp  |oracle          |   6.25|   7.9|    490.2|          0|          0|        480|
05/26/2010|17:05:01|     11461|ora_p096_dwprp  |oracle          |  18.72| 322.6|  19353.6|          0|          0|      19614|
05/26/2010|17:05:01|     11306|ora_p042_dwprp  |oracle          |   5.18|   7.9|    490.2|          0|          0|        480|
05/26/2010|17:05:01|      3385|ora_p154_dwprp  |oracle          |  14.63|   5.0|    304.1|          0|          0|        304|
05/26/2010|17:05:01|      2153|ora_p156_dwprp  |oracle          |   3.07|  20.3|   1175.0|          0|          0|       1234|
05/26/2010|17:05:01|      5499|ora_p005_dacp   |oracle          |   3.16|  25.6|    545.6|          0|          0|       1556|
05/26/2010|17:05:01|     17664|ora_p434_dwprp  |oracle          |   1.01|  33.7|   2028.0|          0|          0|       2048|
05/26/2010|17:05:01|     11411|ora_p078_dwprp  |oracle          |  69.39|  22.4|   1349.5|          0|          0|       1361|
05/26/2010|17:05:01|     26544|ora_p117_dwprp  |oracle          |   0.01|   5.6|    337.3|          0|          0|        340|
05/26/2010|17:05:01|     11465|ora_p098_dwprp  |oracle          |  12.37| 214.2|  12902.4|          0|          0|      13023|
05/26/2010|17:05:01|      9743|ora_p142_dwprp  |oracle          |  38.21|  92.7|   5529.6|          0|          0|       5636|
05/26/2010|17:05:01|     11378|ora_p064_dwprp  |oracle          |  58.98|  19.0|   1147.1|          0|          0|       1155|
05/26/2010|17:05:01|      2761|oracledacp      |oracle          |   1.05|   6.6|    379.3|          0|          0|        401|
05/26/2010|17:05:01|     14923|ora_p161_dwprp  |oracle          |   7.70|   6.7|    404.8|          0|          0|        407|
05/26/2010|17:05:01|     11068|oracledwprp     |oracle          |  22.69| 142.7|   8499.2|          0|          0|       8676|
05/26/2010|17:05:01|     20869|oracledwprp     |oracle          |  17.80| 124.3|   7475.2|          0|          0|       7557|
05/26/2010|17:05:01|      2015|oracledwprp     |oracle          |  22.91| 166.4|   8499.2|          0|          0|      10117|
05/26/2010|17:05:01|     26546|ora_p118_dwprp  |oracle          |   0.01|   6.7|    404.8|          0|          0|        407|
05/26/2010|17:05:01|      1463|ora_p149_dwprp  |oracle          |  13.64| 158.3|   9523.2|          0|          0|       9624|
05/26/2010|17:05:01|      2172|ora_p158_dwprp  |oracle          |   2.13|  17.7|   1023.7|          0|          0|       1076|
05/26/2010|17:05:01|      7370|ora_p152_dwprp  |oracle          |  14.57|   5.0|    304.1|          0|          0|        304|
05/26/2010|17:05:01|     16005|oracledwprp     |oracle          |  10.29| 190.6|  11571.2|          0|          0|      11588|
05/26/2010|17:05:01|      5566|ora_p014_dacp   |oracle          |  10.52| 108.6|   2318.0|          0|          0|       6602|
05/26/2010|17:05:01|      9749|ora_p145_dwprp  |oracle          |  40.98| 159.3|   9523.2|          0|          0|       9685|
05/26/2010|17:05:01|     11431|ora_p086_dwprp  |oracle          |   1.03|  47.0|   2865.7|          0|          0|       2857|
05/26/2010|17:05:01|     11139|ora_ckpt_dwprp  |oracle          |   0.00|   5.9|    186.0|          0|          0|        358|
05/26/2010|17:05:01|     15283|oracledwprp     |oracle          |  15.55|   0.0|      0.0|          0|          0|          0|
05/26/2010|17:05:01|     26587|ora_p123_dwprp  |oracle          |   9.46|  31.8|   1453.9|          0|          0|       1933|
05/26/2010|17:05:01|     11177|ora_p005_dwprp  |oracle          |  19.53|  63.0|   3788.8|          0|          0|       3830|
NOTE: The full raw report cpuio.out attached for understading the complexity of the report.


At this point, the amount of daa provided by the report mades human reading challenging due to the amount of data. Custom additional tailoring of the standard report will be very useful. The key objective of this tailoring are:

  1. Group the information by individual processes.
  2. Obtain the average CPU usage for each individual process, to be able to focus attention on CPU process hogs.

Create a per process report

tail -n $(expr $(wc -l cpuio.out | cut -d" " -f 1) - 3 ) cpuio.out | \
awk -F'|' '{print $4}' | sort | uniq > proclist

if [[ -f perproc.out ]]; then
   > perproc.out
fi

for proc in $(cat proclist)
do
 echo "Processing $proc"
 head -n 3 cpuio.out >> perproc.out
 awk -F'|' '{if($4 ~ /^'$proc'/ ) print $0}' cpuio.out > temp
 cat temp >> perproc.out

 # Calculate average CPU usage
 lines=$(wc -l temp | cut -d " " -f 1)
 awk -F'|' -v N=$lines -v PROC=$proc \
 'BEGIN {AVGCPU=0} {AVGCPU=AVGCPU + $6} END {printf "AVERAGE                       |%-16s                 |%7.2f\n", PROC, AVGCPU / N }' temp >> perproc.out 
 echo >> perproc.out
done

The following is an example of custom report perproc.out:

OVPA Export 05/27/10 01:46    Logfile: /var/opt/perf/datafiles/logglob SCOPE/UX C.04.70.000(0) batman
          |        |          |    Process     |      User      |       | Phys | IO Byte |File System|    VM     |  System   |
   Date   |  Time  |   PID    |      Name      |      Name      | CPU % |IO Rt |  Rate   |    IOs    |    IOs    |    IOs    |
05/26/2010|17:09:00|     12530|PatrolAgent     |patmon          |   5.53|   3.6|     18.5|        114|          0|        102|
05/26/2010|17:11:00|     12530|PatrolAgent     |patmon          |   4.54|  28.4|    182.5|       1022|          0|        675|
05/26/2010|17:15:01|     12530|PatrolAgent     |patmon          |   5.15|   4.6|     25.2|        144|          0|        132|
05/26/2010|17:19:01|     12530|PatrolAgent     |patmon          |   7.42|   3.3|     15.6|         95|          0|        101|
05/26/2010|17:21:00|     12530|PatrolAgent     |patmon          |   5.13|  94.3|    569.5|       2907|          0|       2775|
05/26/2010|17:22:00|     12530|PatrolAgent     |patmon          |   2.01|  33.1|    274.9|       1752|          0|        220|
05/26/2010|17:25:02|     12530|PatrolAgent     |patmon          |   4.15|   8.7|     26.3|        124|          0|        398|
05/26/2010|17:29:00|     12530|PatrolAgent     |patmon          |   5.34|   3.3|     17.3|        102|          0|         96|
05/26/2010|17:32:00|     12530|PatrolAgent     |patmon          |   4.26|  21.2|    125.6|        621|          0|        645|
05/26/2010|17:33:00|     12530|PatrolAgent     |patmon          |   4.14|   9.7|     68.2|        471|          0|        114|
05/26/2010|17:34:00|     12530|PatrolAgent     |patmon          |   6.08|   0.8|      4.5|         30|          0|         18|
05/26/2010|17:35:01|     12530|PatrolAgent     |patmon          |   5.20|   3.2|     18.4|         89|          0|        101|
05/26/2010|17:39:00|     12530|PatrolAgent     |patmon          |   5.20|   2.0|     10.8|         54|          0|         66|
05/26/2010|17:40:01|     12530|PatrolAgent     |patmon          |   2.04|   5.1|     16.8|         72|          0|        234|
05/26/2010|17:41:00|     12530|PatrolAgent     |patmon          |   4.14|  10.1|     44.1|        162|          0|        444|
05/26/2010|17:42:00|     12530|PatrolAgent     |patmon          |   3.16|  69.6|    442.6|       2436|          0|       1802|
05/26/2010|17:43:00|     12530|PatrolAgent     |patmon          |   3.10|  15.3|    120.4|        859|          0|         53|
05/26/2010|17:44:00|     12530|PatrolAgent     |patmon          |   5.09|   3.0|     13.1|         60|          0|        121|
AVERAGE                       |PatrolAgent                      |   4.54

OVPA Export 05/27/10 01:46    Logfile: /var/opt/perf/datafiles/logglob SCOPE/UX C.04.70.000(0) batman
          |        |          |    Process     |      User      |       | Phys | IO Byte |File System|    VM     |  System   |
   Date   |  Time  |   PID    |      Name      |      Name      | CPU % |IO Rt |  Rate   |    IOs    |    IOs    |    IOs    |
05/26/2010|17:11:00|      7074|bdf             |patmon          |   1.03|  66.3|   1274.8|          0|          0|       3964|
05/26/2010|17:19:01|     25966|bdf             |patmon          |   1.10|  29.5|    786.2|       1701|          0|         59|
05/26/2010|17:32:00|     11937|bdf             |root            |   1.05|  26.2|    697.5|       1483|          0|         83|
AVERAGE                       |bdf                              |   1.06

OVPA Export 05/27/10 01:46    Logfile: /var/opt/perf/datafiles/logglob SCOPE/UX C.04.70.000(0) batman
          |        |          |    Process     |      User      |       | Phys | IO Byte |File System|    VM     |  System   |
   Date   |  Time  |   PID    |      Name      |      Name      | CPU % |IO Rt |  Rate   |    IOs    |    IOs    |    IOs    |
05/26/2010|17:16:00|     21646|bgsagent        |bgs             |   2.08|  16.6|    413.8|        125|        870|          0|
05/26/2010|17:31:00|     21646|bgsagent        |bgs             |   2.08|  15.6|    363.9|        120|        815|          6|
AVERAGE                       |bgsagent                         |   2.08
NOTE: Complete perproc.out report attached.


Create Average CPU Usage Report

Once we have a Per Process Resource Utilization report, the next step is to concentrate on the major resource hogs. The simplest way is ordering the averages per process. For example, the to 24 Oracle processes:

# awk -F'|' '/AVERAGE/ {printf "%s %5.2f\n", $2, $3 }' perproc.out | sort -nrk 2,2 | head -n 24
ora_p065_dwprp                    79.68
ora_p078_dwprp                    79.27
ora_p077_dwprp                    79.19
ora_p076_dwprp                    78.65
ora_p067_dwprp                    78.43
ora_p066_dwprp                    78.24
ora_p064_dwprp                    77.66
ora_p079_dwprp                    77.42
ora_p191_dwprp                    73.56
ora_p188_dwprp                    72.09
ora_p190_dwprp                    70.89
ora_p189_dwprp                    70.69
ora_p186_dwprp                    70.69
ora_p187_dwprp                    70.65
ora_p184_dwprp                    70.37
ora_p185_dwprp                    69.73
sympd                             59.17
mhragent                          50.55
guard_stap                        49.48
ora_p144_dwprp                    39.08
ora_p027_dwprp                    38.55
coda                              37.77
ora_p143_dwprp                    36.89
ora_j008_dwprp                    36.82

Once the top CPU hogs are established, we can review the actual process historical execution:

OVPA Export 05/27/10 01:46    Logfile: /var/opt/perf/datafiles/logglob SCOPE/UX C.04.70.000(0) batman
          |        |          |    Process     |      User      |       | Phys | IO Byte |File System|    VM     |  System   |
   Date   |  Time  |   PID    |      Name      |      Name      | CPU % |IO Rt |  Rate   |    IOs    |    IOs    |    IOs    |
05/26/2010|17:06:00|     11380|ora_p065_dwprp  |oracle          |  90.21|  29.0|   1758.1|          0|          0|       1725|
05/26/2010|17:07:00|     11380|ora_p065_dwprp  |oracle          |  72.78|  24.3|   1467.8|          0|          0|       1467|
05/26/2010|17:08:00|     11380|ora_p065_dwprp  |oracle          |  69.89|  23.4|   1410.3|          0|          0|       1404|
05/26/2010|17:09:00|     11380|ora_p065_dwprp  |oracle          |  82.40|  25.7|   1550.9|          0|          0|       1544|
05/26/2010|17:10:01|     11380|ora_p065_dwprp  |oracle          |  53.55|  17.8|   1071.6|          0|          0|       1068|
05/26/2010|17:11:00|     11380|ora_p065_dwprp  |oracle          |  78.52|  25.4|   1529.3|          0|          0|       1518|
05/26/2010|17:12:00|     11380|ora_p065_dwprp  |oracle          |  82.50|  27.2|   1638.8|          0|          0|       1640|
05/26/2010|17:13:00|     11380|ora_p065_dwprp  |oracle          |  76.47|  24.4|   1470.4|          0|          0|       1466|
05/26/2010|17:14:00|     11380|ora_p065_dwprp  |oracle          |  84.78|  27.3|   1669.2|          0|          0|       1648|
05/26/2010|17:15:01|     11380|ora_p065_dwprp  |oracle          |  91.19|  32.3|   1947.5|          0|          0|       1947|
05/26/2010|17:16:00|     11380|ora_p065_dwprp  |oracle          |  76.65|  24.7|   1494.2|          0|          0|       1482|
05/26/2010|17:17:00|     11380|ora_p065_dwprp  |oracle          |  69.07|  22.0|   1338.0|          0|          0|       1333|
05/26/2010|17:18:01|     11380|ora_p065_dwprp  |oracle          |  77.72|  24.8|   1494.3|          0|          0|       1505|
05/26/2010|17:19:01|     11380|ora_p065_dwprp  |oracle          |  84.02|  27.2|   1639.7|          0|          0|       1623|
05/26/2010|17:20:01|     11380|ora_p065_dwprp  |oracle          |  84.62|  27.3|   1644.4|          0|          0|       1632|
05/26/2010|17:21:00|     11380|ora_p065_dwprp  |oracle          |  88.72|  30.1|   1813.5|          0|          0|       1812|
05/26/2010|17:22:00|     11380|ora_p065_dwprp  |oracle          |  86.67|  32.7|   1973.7|          0|          0|       1948|
05/26/2010|17:23:00|     11380|ora_p065_dwprp  |oracle          |  69.76|  24.9|   1520.7|          0|          0|       1518|
05/26/2010|17:24:00|     11380|ora_p065_dwprp  |oracle          |  81.26|  26.6|   1601.5|          0|          0|       1606|
05/26/2010|17:25:02|     11380|ora_p065_dwprp  |oracle          |  80.86|  26.2|   1596.7|          0|          0|       1558|
05/26/2010|17:26:00|     11380|ora_p065_dwprp  |oracle          |  80.07|  31.2|   1888.7|          0|          0|       1887|
05/26/2010|17:27:00|     11380|ora_p065_dwprp  |oracle          |  74.90|  26.0|   1587.2|          0|          0|       1580|
05/26/2010|17:28:00|     11380|ora_p065_dwprp  |oracle          |  76.26|  34.8|   2097.3|          0|          0|       2094|
05/26/2010|17:29:00|     11380|ora_p065_dwprp  |oracle          |  69.40|  22.7|   1367.3|          0|          0|       1366|
05/26/2010|17:30:01|     11380|ora_p065_dwprp  |oracle          |  66.47|  21.1|   1275.1|          0|          0|       1263|
05/26/2010|17:31:00|     11380|ora_p065_dwprp  |oracle          |  58.85|  19.8|   1200.1|          0|          0|       1195|
05/26/2010|17:32:00|     11380|ora_p065_dwprp  |oracle          |  72.82|  31.4|   1895.8|          0|          0|       1877|
05/26/2010|17:33:00|     11380|ora_p065_dwprp  |oracle          |  83.68|  26.4|   1607.4|          0|          0|       1597|
05/26/2010|17:34:00|     11380|ora_p065_dwprp  |oracle          |  74.01|  24.3|   1468.0|          0|          0|       1458|
05/26/2010|17:35:01|     11380|ora_p065_dwprp  |oracle          |  73.90|  27.7|   1684.3|          0|          0|       1656|
05/26/2010|17:36:00|     11380|ora_p065_dwprp  |oracle          |  93.10|  29.7|   1790.3|          0|          0|       1805|
05/26/2010|17:37:00|     11380|ora_p065_dwprp  |oracle          |  89.90|  28.7|   1736.9|          0|          0|       1727|
05/26/2010|17:38:00|     11380|ora_p065_dwprp  |oracle          |  93.46|  29.9|   1803.5|          0|          0|       1794|
05/26/2010|17:39:00|     11380|ora_p065_dwprp  |oracle          |  84.69|  27.0|   1628.8|          0|          0|       1620|
05/26/2010|17:40:01|     11380|ora_p065_dwprp  |oracle          |  72.73|  23.5|   1425.7|          0|          0|       1414|
05/26/2010|17:41:00|     11380|ora_p065_dwprp  |oracle          |  95.68|  30.3|   1826.1|          0|          0|       1821|
05/26/2010|17:42:00|     11380|ora_p065_dwprp  |oracle          |  93.32|  29.6|   1791.1|          0|          0|       1802|
05/26/2010|17:43:00|     11380|ora_p065_dwprp  |oracle          |  92.29|  32.6|   1965.1|          0|          0|       1946|
05/26/2010|17:44:00|     11380|ora_p065_dwprp  |oracle          |  80.33|  29.8|   1811.2|          0|          0|       1802|
AVERAGE                       |ora_p065_dwprp                   |  79.68

Now that we know that the Oracle processes are the major players, a complementary data will be the total number of oracle processes been running at a given time, to see if additional process are been spawned, or if by the contrary, a flat number of processes are created with additional CPU resources been use during the spike.

tail -n $(expr $(wc -l cpuio.out | cut -d" " -f 1) - 3 ) cpuio.out | \
awk -F'|' '{if ($4 ~ /^ora_/) printf "%-8s|%s\n", $2, $4}' | \
cut -d"|" -f 1 > temp

awk -v TIME="$(head -n 1 temp)" 'BEGIN {COUNT=0} {if ($1 ~ TIME) COUNT=COUNT+1; else {printf "%s|%d\n", TIME, COUNT; TIME=$1; COUNT=1}}' temp
17:05:01|52
17:06:00|108
17:07:00|100
17:08:00|105
17:09:00|95
17:10:01|80
17:11:00|83
17:12:00|84
17:13:00|104
17:14:00|81
17:15:01|76
17:16:00|99
17:17:00|129
17:18:01|114
17:19:01|115
17:20:01|110
17:21:00|104
17:22:00|105
17:23:00|111
17:24:00|82
17:25:02|100
17:26:00|96
17:27:00|106
17:28:00|95
17:29:00|87
17:30:01|112
17:31:00|128
17:32:00|128
17:33:00|104
17:34:00|116
17:35:01|104
17:36:00|104
17:37:00|94
17:38:00|103
17:39:00|96
17:40:01|73
17:41:00|88
17:42:00|67
17:43:00|70

Contrast with the global process count at each time:

tail -n $(expr $(wc -l cpuio.out | cut -d" " -f 1) - 3 ) cpuio.out | \
awk -F'|' '{printf "%-8s|%s\n", $2, $4}' | \
cut -d"|" -f 1 > temp

awk -v TIME="$(head -n 1 temp)" 'BEGIN {COUNT=0} {if ($1 ~ TIME) COUNT=COUNT+1; else {printf "%s|%d\n", TIME, COUNT; TIME=$1; COUNT=1}}' temp
17:05:01|68
17:06:00|136
17:07:00|142
17:08:00|139
17:09:00|131
17:10:01|115
17:11:00|145
17:12:00|165
17:13:00|136
17:14:00|111
17:15:01|117
17:16:00|128
17:17:00|159
17:18:01|146
17:19:01|147
17:20:01|139
17:21:00|136
17:22:00|134
17:23:00|144
17:24:00|110
17:25:02|132
17:26:00|123
17:27:00|135
17:28:00|122
17:29:00|118
17:30:01|149
17:31:00|164
17:32:00|164
17:33:00|135
17:34:00|142
17:35:01|137
17:36:00|136
17:37:00|123
17:38:00|131
17:39:00|124
17:40:01|107
17:41:00|120
17:42:00|97
17:43:00|100

There is a strong correlation with the number of running oracle processes and the overall CPU utilization. That was expected, but demostration was required.

Solution

Reference

Author