Few weeks ago I have been confronted to a weird real life system performance issue which, I believe is worth a sharing blog post. It started by a customer complaint about a dramatic slow in inserts of commands and products so that the application in its entirety appeared frozen. A very quick look-up over ASH shows no particular wait events or any new TOP SQL that I am not aware of. I know this application to such an extent that I can recognize its top queries via their sql_id. This is why my first impression was that there is no particular unusual wait event to point out except a high CPU consumption.
But the Load Profile of the corresponding one hour AWR report looks very odd:
Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~~~~ --------------- --------------- --------- --------- DB Time(s): 95.2 4.7 0.02 0.02 DB CPU(s): 7.7 0.1 0.00 0.00 Background CPU(s): 0.2 0.0 0.00 0.00 Redo size (bytes): 439,216.3 18,666.4 Logical read (blocks): 1,774,174.9 75,401,1 Block changes: 2,465.0 104.8 Physical read (blocks): 3,609.2 153.4 Physical write (blocks): 135.2 5.8 Read IO requests: 2,436.8 103.6 Write IO requests: 53.6 2.3 Read IO (MB): 28.2 1.2 Write IO (MB): 1.1 0.0 IM scan rows: 0.0 0.0 Session Logical Read IM: User calls: 5,423.4 230.5 Parses (SQL): 244.4 10.4 Hard parses (SQL): 0.5 0.0 SQL Work Area (MB): 28.3 1.2 Logons: 0.7 0.0 Executes (SQL): 4,470.6 190.0 Rollbacks: 0.2 0.0 Transactions: 23.5
Have you ever seen such a dramatic Load Profile? The foreground sessions spend collectively 95 seconds of time for each wall clock second. All those sessions were either actively working or waiting. They burned 7.7 of CPU per wall clock second. For a machine of 24 cores this seems to be a CPU constrained application. The overall workload is generating 1,77 million of logical blocks read per second which represent a rate of approximatively 6GB of logical blocks read per hour. And only a tiny percentage of those block reads are changed (2,465/1,774,174 = 0,13%).
I was really intrigued by this abnormal situation since I didn’t spot any unusual wait event or SQL query. While I was drilling down the different sections of the AWR report, cross-checking properly the performance key indicators they deliver, something in the TOP 10 Foreground Events captured my attention:
Top 10 Foreground Events by Total Wait Time ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Total Wait Avg % DB Wait Event Waits Time (sec) Wait(ms) time Class ------------------------------ ----------- ---------- --------- ------ -------- db file sequential read 2,813,888 32K 11.37 9.3 User I/O DB CPU 27,9K 8.1 db file parallel read 278,494 10,8K 38.67 3.1 User I/O read by other session 444,473 4785 10.77 1.4 User I/O direct path read 311,471 4718,6 15.15 1.4 User I/O gc cr block 2-way 577,081 2366,4 4.10 .7 Cluster log file sync 89,040 2254,7 25.32 .7 Commit db file scattered read 138,798 2136,8 15.40 .6 User I/O gc current block 2-way 368,907 1994,1 5.41 .6 Cluster gc buffer busy acquire 329,900 1776,8 5.39 .5 Cluster
- The % DB time column doesn’t sum up to 100% (or near this number) totalling only 26,4%
- The disk is very slow servicing the physical I/O and background events with a dramatic average time wait.
When the %DB time column doesn’t add up to nearly 100% it generally indicates a symptom of a CPU bounded application. Indeed this is confirmed a little bit upward in the same AWR report via the HOST CPU section:
Host CPU ~~~~~~~~ Load Average CPUs Cores Sockets Begin End %User %System %WIO %Idle ----- ----- ------- --------- --------- --------- --------- ------ ----- 48 24 4 55.57 219.13 92.0 2.3 2.3 5.0
The application was already over CPU bounded at the beginning (55) and completely frozen at the end of the load (219 CPU). For a machine with 24 cores, the DB CPU load is indeed very high
DB CPU Load = DB CPU time/Elapsed Time = 27900/3600 = 7.7
And so is the DB time load
Snap Id Snap Time Sessions Curs/Sess Instance --------- ------------------- -------- --------- --------- Begin Snap: 28812 06-Feb-18 13:00:01 365 12.5 2 End Snap: 28813 06-Feb-18 14:00:15 421 12.5 2 Elapsed: 60.23 (mins) DB Time: 5,734.18 (mins) DB time Load = DB time/Elapsed Time = 5734.1/60.23 = 95.2
So what else?
I asked the storage team to check the performance of the underlying disk. The confirmation of the dramatic disk latency has been obviously confirmed by the storage team. But the fundamental and legitimate question is what has changed per regards to the yesterday normal functioning of the application that might explain this performance deterioration?
While I was trying to figure out a solution I asked a help from a DBA who is the primary owner of another critical application. To my surprise this DBA told me that one of his applications is also suffering from similar performance deterioration and that this application is hosted by the same server as the one hosting the application I am trying to troubleshoot. What a coincidence.
A couple of minutes later we found that a critical batch has been triggered in the other DBA application. This batch was extensively executing a parallel query using a very high degree of parallelism (DOP) which was consuming the whole SAN bandwidth. Needless to say that when this batch job has been stopped the initial problem I was trying to fix vanishes by itself and the database returned to its normal functioning
Bottom Line
When you are troubleshooting an application which seems to be completely frozen and where the disk is very badly servicing the database and where the DB time % column of the AWR Top 10 foreground Events section is not summing up to 100% then you are probably CPU bounded. In such case, don’t forget to consider, among the panoply of your check list, whether others applications hosted in the same server are not stressing the disk with very aggressive parallel queries.
Thanks for sharing . Look forward to learn more from you
Comment by samba kumar b — March 22, 2018 @ 1:00 pm |
If DB Time(s): value is more than 24 (for a machine of 24 cores, ) than its a CRITICAL situation. But anything less than 24 should be Normal right?
I have a question for you,
Is your CPU_COUNT is set to no of cores or no of threads ?
What is your PARALLEL_THREADS_PER_CPU is set to ?
Comment by Sangram Dash — June 5, 2018 @ 6:14 am |