We received a complaint from a customer about a weekly scheduled overnight oracle job which took 14 hours to get at its normal end during its last execution. The same job has been smoothly running in about 14 minutes for the past several months. So what went wrong this time?
A 30 minutes AWR snapshots report collected during the execution of the last job execution look like:
Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 16178 21-May-12 07:00:09 292 2.9 End Snap: 16179 21-May-12 07:30:16 294 2.9 Elapsed: 30.11 (mins) DB Time: 48.68 (mins) Cache Sizes Begin End ~~~~~~~~~~~ ---------- ---------- Buffer Cache: 10,752M 10,752M Std Block Size: 8K Shared Pool Size: 7,744M 7,744M Log Buffer: 53,972K Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~ --------------- --------------- ---------- ---------- DB Time(s): 1.6 0.8 0.01 0.01 DB CPU(s): 0.9 0.4 0.01 0.00 Redo size: 36,781.3 18,374.5 Logical reads: 86,607.6 43,265.8 Block changes: 172.5 86.2 Physical reads: 2,590.0 1,293.9 Physical writes: 26.0 13.0 User calls: 223.3 111.5 Parses: 92.0 46.0 Hard parses: 2.8 1.4 W/A MB processed: 1.6 0.8 Logons: 0.2 0.1 Executes: 108.3 54.1 Rollbacks: 0.0 0.0 Transactions: 2.0 Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %: 100.00 Redo NoWait %: 100.00 Buffer Hit %: 99.69 In-memory Sort %: 100.00 Library Hit %: 93.95 Soft Parse %: 96.91 Execute to Parse %: 15.02 Latch Hit %: 97.14 Parse CPU to Parse Elapsd %: 82.80 % Non-Parse CPU: 97.99 Shared Pool Statistics Begin End ------ ------ Memory Usage %: 76.47 75.97 % SQL with executions>1: 98.62 97.70 % Memory for SQL w/exec>1: 95.92 94.74 Top 5 Timed Foreground Events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avg wait % DB Event Waits Time(s) (ms) time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- DB CPU 1,561 53.4 db file sequential read 275,799 691 3 23.7 User I/O direct path read 32,996 560 17 19.2 User I/O db file parallel read 720 124 173 4.3 User I/O db file scattered read 8,488 20 2 .7 User I/O
So what to spot from that?
The direct path read wait event look suspect: 32,996 waits at a catastrophic 17 ms per wait. This represents about 9 minutes (32,996 * 17/1000)/60. The db file sequential read wait event may be here in this situation a necessary step for the smooth accomplishment of the desired work. This is why I was quite confident that the direct path is the wait event on which effort should be concentrated in this particular situation.
Before embarking on this direct path read wait event, I would have to say few words about the CPU consumption. The AWR elapsed time is 30.11 minutes and the CPU time as reported in the Top 5 events is 1,561 seconds. Dividing these two metrics gives (1,806.6/1,561) 1.15 CPU at 100% utilisation. Since, this server was running under 32 CPU cores then the data base server was not CPU bound (1,15/32) = 3.6% CPU utilisation.
Since I have been confirmed that neither the PL/SQL code nor the data volume has changed, I have then asked to have the last week AWR report taken at the same time when the job was running. The corresponding Top 5 Timed Foreground Events are shown below:
Top 5 Timed Foreground Events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avg wait % DB Event Waits Time(s) (ms) time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- DB CPU 1,082 54,01 db file sequential read 717,957 936 1 46.7 User I/O direct path read 371 6 15 0.28 User I/O db file parallel read 3,347 4 173 0.18 User I/O db file scattered read 2,852 3 2 0.14 User I/O
from where it can be easily pointed out that, in normal job execution, the direct path read wait contribution to the total execution time is ridiculous: 0,28% (0,09 minutes) while it contributed by 19% (9 minutes) to the total execution time during the last job execution. So what makes the direct path wait event increase by this very high magnitude so that it altered the performance of the overnight job?
My suspicions were still good!!!
The part of the AWR report that has a correlation with the direct path read event and which is worth investigating is SQL ordered by reads, Segments by physical reads and (if in 11gR1 or higher), Segments by direct physical reads.
When I drill down up to the SQL ordered by reads I found this:
Physical Reads Executions Reads per Exec %Total Elapsed Time (s) %CPU %IO QL Id Module -------------- ----------- -------------- ------ --------------- ---- ---- ------------- --------- 4,164,466 1 4,164,466.00 89.0 811.1 21.6 82.9 d6w4psbcwgt54 TOAD 9.7.2.5
Without mentioning the sql text behind this, what kind of overnight batch this job is as far as it has reported using a huge amount of physical read coming from TOAD ? Is this sql issued from TOAD a part of the job? The answer was of course not.
So, clearly someone was issuing a huge query using TOAD which was doing a huge amount of direct path read consuming a lot of CPU resource that usually was only dedicated to the db file sequential read of that job.
Segments by direct physical read AWR part shows also correlated information :
-> Total Direct Physical Reads: 4,209,804 -> Captured Segments account for 51.7% of Total Tablespace Subobject Obj. Direct Owner Name Object Name Name Type Reads %Total ---------- ---------- -------------------- ---------- ----- ------------ ------- MHOADM MHOMH_2011 MH_TABLES TABLES_SX5 TABLE 349,197 8.29 MHOADM MHOMH_2010 MH_TABLES TABLES_SX5 TABLE 345,069 8.20 MHOADM MHOMH_2009 MH_TABLES TABLES_SX1 TABLE 275,781 6.55 MHOADM MHOMH_2010 MH_TABLES TABLES_SX1 TABLE 272,709 6.48 MHOADM MHOMH_2011 MH_TABLES TABLES_SX1 TABLE 265,037 6.30 -------------------------------------------------------------
Where TABLES_SX5 is the table used by the same sql_id d6w4psbcwgt54 issued from TOAD and reported above.
Finally it becomes clear that the job has been in CPU concurrency with an abnormal and unusual select issued from TOAD. The ADDM report that goes with the corresponding AWR also presents the TOAD sql_id d6w4psbcwgt54 as a Top SQL Statements to be tuned:
Recommendation 1: SQL Tuning Estimated benefit is .46 active sessions, 28.18% of total activity. ------------------------------------------------------------------- Action Run SQL Tuning Advisor on the SELECT statement with SQL_ID "d6w4psbcwgt54". Related Object SQL statement with SQL_ID d6w4psbcwgt54. Rationale The SQL spent 100% of its database time on CPU, I/O and Cluster waits. This part of database time may be improved by the SQL Tuning Advisor. Rationale Database time for this SQL was divided as follows: 100% for SQL execution, 0% for parsing, 0% for PL/SQL execution and 0% for Java execution. Rationale SQL statement with SQL_ID "d6w4psbcwgt54" was executed 1 times and had an average elapsed time of 811 seconds. Rationale At least one execution of the statement ran in parallel. Recommendation 2: SQL Tuning Estimated benefit is .19 active sessions, 11.68% of total activity. Action Look at the "Top SQL Statements" finding for SQL statements consuming significant I/O on this segment. For example, the SELECT statement with SQL_ID "d6w4psbcwgt54" is responsible for 100% of "User I/O" and "Cluster" waits for this segment. Rationale The I/O usage statistics for the object are: 9 full object scans, 277587 physical reads, 0 physical writes and 275781 direct reads.
The ADDM report gives an additional credit to my suspicion when it recommends to investigate process that do not belong to this instance running on the host and consuming significant amount of virtual memory. ADDM has identified the TOAD query; hasn’t it?
Recommendation 1: Host Configuration Estimated benefit is 1.62 active sessions, 100% of total activity. ------------------------------------------------------------------ Action Host operating system was experiencing significant paging but no particular root cause could be detected. Investigate processes that do not belong to this instance running on the host that are consuming significant amount of virtual memory. Also consider adding more physical memory to the host.