May | 2012 | Mohamed Houri’s Oracle Notes

May 25, 2012

Tuning a batch job using AWR

Filed under: Trouble shooting — hourim @ 10:16 am
Tags: vmware

We received a complaint from a customer about a weekly scheduled overnight oracle job which took 14 hours to get at its normal end during its last execution. The same job has been smoothly running in about 14 minutes for the past several months. So what went wrong this time?

A 30 minutes AWR snapshots report collected during the execution of the last job execution look like:

        Snap Id      Snap Time      Sessions Curs/Sess
           --------- ------------------- -------- ---------
Begin Snap:     16178 21-May-12 07:00:09       292       2.9
  End Snap:     16179 21-May-12 07:30:16       294       2.9
   Elapsed:               30.11 (mins)
   DB Time:               48.68 (mins)

Cache Sizes                       Begin        End
~~~~~~~~~~~                  ---------- ----------
               Buffer Cache:    10,752M    10,752M  Std Block Size:         8K
           Shared Pool Size:     7,744M     7,744M      Log Buffer:    53,972K

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):                1.6                0.8       0.01       0.01
       DB CPU(s):                0.9                0.4       0.01       0.00
       Redo size:           36,781.3           18,374.5
   Logical reads:           86,607.6           43,265.8
   Block changes:              172.5               86.2
  Physical reads:            2,590.0            1,293.9
 Physical writes:               26.0               13.0
      User calls:              223.3              111.5
          Parses:               92.0               46.0
     Hard parses:                2.8                1.4
W/A MB processed:                1.6                0.8
          Logons:                0.2                0.1
        Executes:              108.3               54.1
       Rollbacks:                0.0                0.0
    Transactions:                2.0

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:  100.00       Redo NoWait %:  100.00
            Buffer  Hit   %:   99.69    In-memory Sort %:  100.00
            Library Hit   %:   93.95        Soft Parse %:   96.91
         Execute to Parse %:   15.02         Latch Hit %:   97.14
Parse CPU to Parse Elapsd %:   82.80     % Non-Parse CPU:   97.99

 Shared Pool Statistics        Begin    End
                              ------  ------
             Memory Usage %:   76.47   75.97
    % SQL with executions>1:   98.62   97.70
  % Memory for SQL w/exec>1:   95.92   94.74

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU                                            1,561          53.4
db file sequential read             275,799         691      3   23.7 User I/O
direct path read                     32,996         560     17   19.2 User I/O
db file parallel read                   720         124    173    4.3 User I/O
db file scattered read                8,488          20      2     .7 User I/O

So what to spot from that?

The direct path read wait event look suspect: 32,996 waits at a catastrophic 17 ms per wait. This represents about 9 minutes (32,996 * 17/1000)/60. The db file sequential read wait event may be here in this situation a necessary step for the smooth accomplishment of the desired work. This is why I was quite confident that the direct path is the wait event on which effort should be concentrated in this particular situation.

Before embarking on this direct path read wait event, I would have to say few words about the CPU consumption. The AWR elapsed time is 30.11 minutes and the CPU time as reported in the Top 5 events is 1,561 seconds. Dividing these two metrics gives (1,806.6/1,561) 1.15 CPU at 100% utilisation. Since, this server was running under 32 CPU cores then the data base server was not CPU bound (1,15/32) = 3.6% CPU utilisation.

Since I have been confirmed that neither the PL/SQL code nor the data volume has changed, I have then asked to have the last week AWR report taken at the same time when the job was running. The corresponding Top 5 Timed Foreground Events are shown below:


 Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU                                            1,082          54,01
db file sequential read               717,957       936      1   46.7  User I/O
direct path read                          371         6     15    0.28 User I/O
db file parallel read                   3,347         4    173    0.18 User I/O
db file scattered read                  2,852         3      2    0.14 User I/O

from where it can be easily pointed out that, in normal job execution, the direct path read wait contribution to the total execution time is ridiculous: 0,28% (0,09 minutes) while it contributed by 19% (9 minutes) to the total execution time during the last job execution. So what makes the direct path wait event increase by this very high magnitude so that it altered the performance of the overnight job?

My suspicions were still good!!!

The part of the AWR report that has a correlation with the direct path read event and which is worth investigating is SQL ordered by reads, Segments by physical reads and (if in 11gR1 or higher), Segments by direct physical reads.

When I drill down up to the SQL ordered by reads I found this:

Physical Reads Executions  Reads per Exec  %Total Elapsed Time (s) %CPU %IO   QL Id         Module
-------------- ----------- --------------  ------ ---------------  ---- ----  ------------- ---------
  4,164,466     1           4,164,466.00    89.0   811.1           21.6 82.9 d6w4psbcwgt54  TOAD 9.7.2.5

Without mentioning the sql text behind this, what kind of overnight batch this job is as far as it has reported using a huge amount of physical read coming from TOAD ? Is this sql issued from TOAD a part of the job? The answer was of course not.

So, clearly someone was issuing a huge query using TOAD which was doing a huge amount of direct path read consuming a lot of CPU resource that usually was only dedicated to the db file sequential read of that job.

Segments by direct physical read AWR part shows also correlated information :

-> Total Direct Physical Reads:       4,209,804
-> Captured Segments account for   51.7% of Total
            Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type         Reads  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
MHOADM     MHOMH_2011 MH_TABLES            TABLES_SX5 TABLE      349,197    8.29
MHOADM     MHOMH_2010 MH_TABLES            TABLES_SX5 TABLE      345,069    8.20
MHOADM     MHOMH_2009 MH_TABLES            TABLES_SX1 TABLE      275,781    6.55
MHOADM     MHOMH_2010 MH_TABLES            TABLES_SX1 TABLE      272,709    6.48
MHOADM     MHOMH_2011 MH_TABLES            TABLES_SX1 TABLE      265,037    6.30
          -------------------------------------------------------------

Where TABLES_SX5 is the table used by the same sql_id d6w4psbcwgt54 issued from TOAD and reported above.
Finally it becomes clear that the job has been in CPU concurrency with an abnormal and unusual select issued from TOAD. The ADDM report that goes with the corresponding AWR also presents the TOAD sql_id d6w4psbcwgt54 as a Top SQL Statements to be tuned:

 Recommendation 1: SQL Tuning
   Estimated benefit is .46 active sessions, 28.18% of total activity.
   -------------------------------------------------------------------
   Action
      Run SQL Tuning Advisor on the SELECT statement with SQL_ID
      "d6w4psbcwgt54".
      Related Object
         SQL statement with SQL_ID d6w4psbcwgt54.     
   Rationale
      The SQL spent 100% of its database time on CPU, I/O and Cluster waits.
      This part of database time may be improved by the SQL Tuning Advisor.
   Rationale
      Database time for this SQL was divided as follows: 100% for SQL
      execution, 0% for parsing, 0% for PL/SQL execution and 0% for Java
      execution.
   Rationale
      SQL statement with SQL_ID "d6w4psbcwgt54" was executed 1 times and had
      an average elapsed time of 811 seconds.
   Rationale
      At least one execution of the statement ran in parallel.

   Recommendation 2: SQL Tuning
   Estimated benefit is .19 active sessions, 11.68% of total activity.

Action
      Look at the "Top SQL Statements" finding for SQL statements consuming
      significant I/O on this segment. For example, the SELECT statement with
      SQL_ID "d6w4psbcwgt54" is responsible for 100% of "User I/O" and
      "Cluster" waits for this segment.

   Rationale
      The I/O usage statistics for the object are: 9 full object scans, 277587
      physical reads, 0 physical writes and 275781 direct reads.

The ADDM report gives an additional credit to my suspicion when it recommends to investigate process that do not belong to this instance running on the host and consuming significant amount of virtual memory. ADDM has identified the TOAD query; hasn’t it?

 Recommendation 1: Host Configuration
   Estimated benefit is 1.62 active sessions, 100% of total activity.
   ------------------------------------------------------------------
   Action
      Host operating system was experiencing significant paging but no
      particular root cause could be detected. Investigate processes that do
      not belong to this instance running on the host that are consuming
      significant amount of virtual memory. Also consider adding more physical
      memory to the host.

Comments (12)

Mohamed Houri’s Oracle Notes

May 25, 2012

Tuning a batch job using AWR

Follow Blog via Email

Categories

Blog Stats

Blogs I Follow

Archives