Mohamed Houri’s Oracle Notes

March 1, 2018

Diagnosing the past : CPU bound

Filed under: Oracle — hourim @ 6:24 pm

Few weeks ago I have been confronted to a weird real life system performance issue which, I believe is worth a sharing blog post. It started by a customer complaint about a dramatic slow in inserts of commands and products so that the application in its entirety appeared frozen. A very quick look-up over ASH shows no particular wait events or any new TOP SQL that I am not aware of. I know this application to such an extent that I can recognize its top queries via their sql_id. This is why my first impression was that there is no particular unusual wait event to point out except a high CPU consumption.

But the Load Profile of the corresponding one hour AWR report looks very odd:

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):              95.2               4.7      0.02     0.02
              DB CPU(s):               7.7               0.1      0.00     0.00
      Background CPU(s):               0.2               0.0      0.00     0.00
      Redo size (bytes):         439,216.3          18,666.4
  Logical read (blocks):       1,774,174.9          75,401,1
          Block changes:           2,465.0             104.8
 Physical read (blocks):           3,609.2             153.4
Physical write (blocks):             135.2               5.8
       Read IO requests:           2,436.8             103.6
      Write IO requests:              53.6               2.3
           Read IO (MB):              28.2               1.2
          Write IO (MB):               1.1               0.0
           IM scan rows:               0.0               0.0
Session Logical Read IM:               
             User calls:           5,423.4             230.5
           Parses (SQL):             244.4              10.4
      Hard parses (SQL):               0.5               0.0
     SQL Work Area (MB):              28.3               1.2
                 Logons:               0.7               0.0
         Executes (SQL):           4,470.6             190.0
              Rollbacks:               0.2               0.0
           Transactions:              23.5

Have you ever seen such a dramatic Load Profile? The foreground sessions spend collectively 95 seconds of time for each wall clock second. All those sessions were either actively working or waiting. They burned 7.7 of CPU per wall clock second. For a machine of 24 cores this seems to be a CPU constrained application. The overall workload is generating 1,77 million of logical blocks read per second which represent a rate of approximatively 6GB of logical blocks read per hour. And only a tiny percentage of those block reads are changed (2,465/1,774,174 = 0,13%).

I was really intrigued by this abnormal situation since I didn’t spot any unusual wait event or SQL query. While I was drilling down the different sections of the AWR report, cross-checking properly the performance key indicators they deliver, something in the TOP 10 Foreground Events captured my attention:

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait      Avg   % DB    Wait
Event                                Waits Time (sec)    Wait(ms)  time Class
------------------------------ ----------- ---------- --------- ------ --------
db file sequential read          2,813,888        32K     11.37    9.3 User I/O
DB CPU                                          27,9K              8.1
db file parallel read              278,494      10,8K     38.67    3.1 User I/O
read by other session              444,473       4785     10.77    1.4 User I/O
direct path read                   311,471     4718,6     15.15    1.4 User I/O
gc cr block 2-way                  577,081     2366,4      4.10     .7 Cluster
log file sync                       89,040     2254,7     25.32     .7 Commit
db file scattered read              138,798    2136,8     15.40     .6 User I/O
gc current block 2-way              368,907    1994,1      5.41     .6 Cluster
gc buffer busy acquire              329,900    1776,8      5.39     .5 Cluster
  • The % DB time column doesn’t sum up to 100% (or near this number) totalling only 26,4%
  • The disk is very slow servicing the physical I/O and background events with a dramatic average time wait.

When the %DB time column doesn’t add up to nearly 100% it generally indicates a symptom of a CPU bounded application. Indeed this is confirmed a little bit upward in the same AWR report via the HOST CPU section:

Host CPU
~~~~~~~~                  Load Average
 CPUs Cores Sockets     Begin       End     %User   %System  %WIO  %Idle
----- ----- ------- --------- --------- --------- --------- ------ -----
   48    24       4     55.57   219.13       92.0       2.3    2.3   5.0

The application was already over CPU bounded at the beginning (55) and completely frozen at the end of the load (219 CPU). For a machine with 24 cores, the DB CPU load is indeed very high

 DB CPU Load = DB CPU time/Elapsed Time
            = 27900/3600 = 7.7

And so is the DB time load


              Snap Id      Snap Time      Sessions Curs/Sess  Instance
            --------- ------------------- -------- --------- ---------
Begin Snap:     28812 06-Feb-18 13:00:01       365      12.5         2
  End Snap:     28813 06-Feb-18 14:00:15       421      12.5         2
   Elapsed:               60.23 (mins)
   DB Time:            5,734.18 (mins)

DB time Load = DB time/Elapsed Time
              = 5734.1/60.23 = 95.2

So what else?

I asked the storage team to check the performance of the underlying disk. The confirmation of the dramatic disk latency has been obviously confirmed by the storage team. But the fundamental and legitimate question is what has changed per regards to the yesterday normal functioning of the application that might explain this performance deterioration?

While I was trying to figure out a solution I asked a help from a DBA who is the primary owner of another critical application. To my surprise this DBA told me that one of his applications is also suffering from similar performance deterioration and that this application is hosted by the same server as the one hosting the application I am trying to troubleshoot. What a coincidence.

A couple of minutes later we found that a critical batch has been triggered in the other DBA application. This batch was extensively executing a parallel query using a very high degree of parallelism (DOP) which was consuming the whole SAN bandwidth. Needless to say that when this batch job has been stopped the initial problem I was trying to fix vanishes by itself and the database returned to its normal functioning

Bottom Line

When you are troubleshooting an application which seems to be completely frozen and where the disk is very badly servicing the database and where the DB time % column of the AWR Top 10 foreground Events section is not summing up to 100% then you are probably CPU bounded. In such case, don’t forget to consider, among the panoply of your check list, whether others applications hosted in the same server are not stressing the disk with very aggressive parallel queries.

Advertisements

1 Comment »

  1. Thanks for sharing . Look forward to learn more from you

    Comment by samba kumar b — March 22, 2018 @ 1:00 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

Blog at WordPress.com.

Tony's Oracle Tips

Tony Hasler's light hearted approach to learning about Oracle

Richard Foote's Oracle Blog

Focusing Specifically On Oracle Indexes, Database Administration and Some Great Music

Hatem Mahmoud Oracle's blog

Just another Oracle blog : Database topics and techniques

Mohamed Houri’s Oracle Notes

Qui se conçoit bien s’énonce clairement

Oracle Diagnostician

Performance troubleshooting as exact science

Raheel's Blog

Things I have learnt as Oracle DBA

Coskan's Approach to Oracle

What I learned about Oracle

So Many Oracle Manuals, So Little Time

“Books to the ceiling, Books to the sky, My pile of books is a mile high. How I love them! How I need them! I'll have a long beard by the time I read them”—Lobel, Arnold. Whiskers and Rhymes. William Morrow & Co, 1988.

EU Careers info

Your career in the European Union

Carlos Sierra's Tools and Tips

Tools and Tips for Oracle Performance and SQL Tuning

Oracle Scratchpad

Just another Oracle weblog

OraStory

Dominic Brooks on Oracle Performance, Tuning, Data Quality & Sensible Design ... (Now with added Sets Appeal)

%d bloggers like this: