Mohamed Houri’s Oracle Notes

March 7, 2020

SQL Monitor flaw

Filed under: SQL Monitor — hourim @ 8:13 am

Observe the following real-life real time SQL Monitor report and try to figure out why the Duration is not equal to the Elapsed time

 Global Information
------------------------------
 Status              :  DONE (ERROR)              
 Instance ID         :  3                         
 Session             :  XXXX (1016:28171) 
 SQL ID              :  94dp7vscw26sf             
 SQL Execution ID    :  50331648                  
 Execution Started   :  01/21/2020 13:50:26       
 First Refresh Time  :  01/21/2020 13:50:26       
 Last Refresh Time   :  01/21/2020 13:50:30       
 Duration            :  4s                        
 Module/Action       :  SQL*Plus/-                
 Service             :  YYYYY                   
 Program             :  sqlplus.exe               

Global Stats
===================================================================================
| Elapsed |   Cpu   |    IO    | Application | Cluster  | Buffer |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) |  Gets  | Returned Bytes |
===================================================================================
|    1981 |     249 |     1687 |        0.00 |       45 |     8M |           92MB |
===================================================================================

In the following series of articles about the SQL monitor Duration and Elapsed Time:

I’ve explained why we should always rely on the Duration information as it represents the wall clock time of the query. The Elapsed time doesn’t include time spent by Idle session fetching data. And, when a SQL statement runs parallelly, the Elapsed time includes execution time of both the Query Coordinator and all its parallel server slaves.

So, If I tell you that there is no parsing effect in the above query and that there are no fetch calls, you would normally infer that it has really spent 4 seconds to complete; right?

What if I tell you that this query took approximatively 1981 seconds? Will you believe me?

Indeed, this is a parallel query using DOP 8 which took about 1981 seconds and where the Duration (4 seconds) represents only the time taken by the Query Coordinator. This is due to a flaw I have already explained in this article but where I’ve not been so smart, at that time, to point out the Duration flaw as well.

DONE (ERROR)

In contrast to what the status is saying, this query completes without error. And that is the first clue that should always put you on the right path when trouble shooting performance issue via SQL monitor reports. When this 12cR2 SQL query has been handled to me for investigation I immediately launched it again and here’s below the different phases reported by the SQL Monitor:

Global Information
------------------------------
 Status              :  EXECUTING                 
 Instance ID         :  3                         
 Session             :  XXXX (1016:28171) 
 SQL ID              :  94dp7vscw26sf             
 SQL Execution ID    :  50331648                  
 Execution Started   :  01/21/2020 13:50:26       
 First Refresh Time  :  01/21/2020 13:50:26       
 Last Refresh Time   :  01/21/2020 13:50:30       
 Duration            :  119s                      
 Module/Action       :  SQL*Plus/-                
 Service             :  YYYYY
 Program             :  sqlplus.exe               

Global Stats
===================================================================================
| Elapsed |   Cpu   |    IO    | Application | Cluster  | Buffer |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) |  Gets  | Returned Bytes |
===================================================================================
|     126 |      43 |       82 |        0.00 |     1.98 |   459K |           84MB |
===================================================================================
Parallel Execution Details (DOP=8 , Servers Allocated=16)

../..

Global Information
------------------------------
 Status              :  EXECUTING                 
 Instance ID         :  3                         
 Session             :  XXXX (1016:28171) 
 SQL ID              :  94dp7vscw26sf             
 SQL Execution ID    :  50331648                  
 Execution Started   :  01/21/2020 13:50:26       
 First Refresh Time  :  01/21/2020 13:50:26       
 Last Refresh Time   :  01/21/2020 13:50:30       
 Duration            :  1026s                     
 Module/Action       :  SQL*Plus/-                
 Service             :  YYYYY
 Program             :  sqlplus.exe               

Global Stats
==========================================================================
| Elapsed |   Cpu   |    IO    | Application | Cluster  |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Returned Bytes |
==========================================================================
|    1049 |     146 |      879 |        0.00 |       24 |           88MB |
==========================================================================
Parallel Execution Details (DOP=8 , Servers Allocated=16)

So far so good right?

In fact, if you look closely there is already something odd? Isn’t it?

Have you spotted out that the Last Refresh Time column is not changing?

Anyway, while the query was still running I launched the following query to see how the workload is distributed among the different parallel server slaves

@ getTimeSqlmon.sql
col exec_start format a25
col process    format a10
col status     format a15
select
    -- sql_id
    sql_exec_id   
   ,to_char(sql_exec_start,'dd/mm/yyyy hh24:mi:ss') exec_start
   ,to_char(sysdate,'dd/mm/yyyy hh24:mi:ss') current_time   
   ,process_name process
   ,round((sysdate-sql_exec_start) * 60 * 24,2) "minutes"
   ,round(elapsed_time/1e6,3) "elaps(sec)"
   ,round(cpu_time/1e6,3) "cpu(sec)"
   ,sql_plan_hash_value   "plan hash value"
   ,status
from gv$sql_monitor
where
  sql_id = '&sql_id'
order by exec_start,action;

SQL> @GetTimeSQLMon2
Enter value for sql_id: 94dp7vscw26sf

SQL_EXEC_ID EXEC_START            PROCE    mm elaps(sec)   cpu(sec) SQL_PLAN_HASH_VALUE STATUS
----------- --------------------- ----- ----- ---------- ---------- ------------------- ------
   50331648 21/01/2020 13:50:26   p002  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p003  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p008  16.35      3.364       .478        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p000  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p00f  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p007  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p00d  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p005  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p004  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p00c  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p001  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p009  16.35    978.915    136.371        524852868 EXECUTING ---------->>>>>>
   50331648 21/01/2020 13:50:26   p00e  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p006  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   ora   16.35       .152      4.125        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p00b  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p00a  16.35          0          0        524852868 EXECUTIN
                                  
 17 rows selected.   
SQL> @GetTimeSQLMon2
Enter value for sql_id: 94dp7vscw26sf

SQL_EXEC_ID EXEC_START           PROCESS  minutes elaps(sec)   cpu(sec) plan hash value STATUS   
----------- -------------------- -------- ------- ---------- ---------- --------------- ---------
   50331648 21/01/2020 13:50:26  p002       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p003       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p008       30.28      3.364       .478       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p000       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p00f       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p007       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p00d       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p005       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p004       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p00c       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p001       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p009       30.28   1813.916    225.829       524852868 EXECUTING ---------->>>>>>
   50331648 21/01/2020 13:50:26  p00e       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p006       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  ora        30.28       .152      4.125       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p00b       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p00a       30.28          0          0       524852868 EXECUTING
                                             
17 rows selected.

As you can see there is only a single parallel slave which is working (p009). All other slaves are waiting.

I managed to display, roughly, the number of minutes spent by each process not doing anything i.e. waiting. But if you read this article you will know that when a process waits for more than 30 minutes, its status will change to DONE (ERROR). Spot below how this is true:

SQL> @GetTimeSQLMon2
Enter value for sql_id: 94dp7vscw26sf

SQL_EXEC_ID EXEC_START            PROCESS  minutes elaps(sec)   cpu(sec)  STATUS
----------- --------------------- ------- -------- ---------- ----------  ------------
   50331648 21/01/2020 13:50:26   p002       30.58          0          0  DONE (ERROR) ---------->>>>>> IN ERROR after 30 min waiting
   50331648 21/01/2020 13:50:26   p003       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p008       30.58      3.364       .478  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p000       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00f       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p007       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00d       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p005       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p004       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00c       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p001       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p009       30.58   1833.977    228.496  EXECUTING    ---------->>>>>>
   50331648 21/01/2020 13:50:26   p00e       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p006       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   ora        30.58       .152      4.125  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00b       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00a       30.58          0          0  DONE (ERROR)
                                          
17 rows selected.

And so on until the p009 process ends successfully

SQL> @GetTimeSQLMon2
Enter value for sql_id: 94dp7vscw26sf

SQL_EXEC_ID EXEC_START            PROCESS       minutes elaps(sec)   cpu(sec) STATUS
----------- --------------------- ---------- ---------- ---------- ---------- ---------------
   50331648 21/01/2020 13:50:26   p002            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p003            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p008            42.88      3.364       .478 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p000            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00f            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p007            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00d            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p005            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p004            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00c            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p001            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p009            42.88   1942.557    244.777 DONE (ALL ROWS) ---------->>>>>>
   50331648 21/01/2020 13:50:26   p00e            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p006            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   ora             42.88       .152      4.125 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00b            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00a            42.88          0          0 DONE (ERROR)

17 rows selected.

A couple of minutes before the p009 PX slave finishes its work, the Duration was almost reporting a correct time as the following proves:

Global Information
------------------------------
 Status              :  EXECUTING                 
 Instance ID         :  3                         
 Session             :  XXXX (1016:28171) 
 SQL ID              :  94dp7vscw26sf             
 SQL Execution ID    :  50331648                  
 Execution Started   :  01/21/2020 13:50:26       
 First Refresh Time  :  01/21/2020 13:50:26       
 Last Refresh Time   :  01/21/2020 13:50:30       
 Duration            :  1283s   ------>>>>>   spot the Duration here                        
 Module/Action       :  SQL*Plus/-                
 Service             :  YYYYY
 Program             :  sqlplus.exe               

Global Stats
========================================================================
| Elapsed |   Cpu   |    IO    | Application | Buffer |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   |  Gets  | Returned Bytes |
========================================================================
|    1309 |     174 |     1106 |        0.00 |     5M |           88MB |
========================================================================
Parallel Execution Details (DOP=8 , Servers Allocated=16)

Unfortunately, at the end of the query, the Duration is wrongly updated to show 4 seconds instead of the wall clock time which should be around 1981 seconds

Global Information
------------------------------
 Status              :  DONE (ERROR)              
 Instance ID         :  3                         
 Session             :  XXXX (1016:28171) 
 SQL ID              :  94dp7vscw26sf             
 SQL Execution ID    :  50331648                  
 Execution Started   :  01/21/2020 13:50:26       
 First Refresh Time  :  01/21/2020 13:50:26       
 Last Refresh Time   :  01/21/2020 13:50:30       
 Duration            :  4s   ------>>>>>   and spot the Duration at the end of the query                   
 Module/Action       :  SQL*Plus/-                
 Service             :  YYYYY
 Program             :  sqlplus.exe               

Global Stats
=====================================================================
| Elapsed |   Cpu   |    IO    | Cluster  | Buffer |    Offload     |
| Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Returned Bytes |
=====================================================================
|    1981 |     249 |     1687 |       45 |     8M |           92MB |
=====================================================================
Parallel Execution Details (DOP=8 , Servers Allocated=16)

Those 4 seconds represent either the time difference between the start of the query and the last refresh time or the time spent by the query coordinator which is 3.364+0.478 = 3.842 ~ 4 seconds

In passing here how the workload has been distributed when I fixed the issue

SQL> @getTimeSQLMon2
Enter value for sql_id: 2hrxjvurz5c0p

SQL_EXEC_ID EXEC_START            PROCESS minutes elaps(sec)   cpu(sec)  STATUS
----------- --------------------- ------- ------- ---------- ----------  ---------------
   50331648 21/01/2020 16:09:40   p002      10.88       .403        .52  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p009      10.88     38.332      5.065  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p000      10.88       .425       .533  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p008      10.88     38.138      4.623  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p003      10.88       .468       .628  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00b      10.88     38.345      4.993  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00d      10.88     38.206      4.362  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p005      10.88       .453       .613  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00f      10.88     38.279      5.089  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p007      10.88       .429       .537  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00c      10.88     38.211      5.443  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p004      10.88       .474       .632  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   ora       10.88       2.12      1.507  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00e      10.88     38.348      6.344  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p006      10.88       .413       .613  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p001      10.88       .432       .589  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00a      10.88     38.624      5.481  DONE (ALL ROWS)

17 rows selected.

Summary

When you retrospectively analyze a parallel query performance issue through its corresponding SQL monitoring showing a DONE (ERROR) status, then you should log in your brain the followings:

• There is a probability that the time reported by the Duration is wrong as it either reports the QC time or the time difference between the start of the query and the last time it has been refreshed
• Don’t trust the DONE (ERROR) status. The query might have completed successfully, and this status value is simply indicating that the QC and one or many of its parallel slaves have been waiting for more than 30 minutes

2 Comments »

  1. Hi Mohamed,

    I have few doubts/questions regarding your statement : “or the time spent by the query coordinator which is 3.364+0.478 = 3.842 ~ 4 seconds”

    1. QC’s process name is “ora” — whose ‘elaps(sec)’ is .152 (not 3.364)

    2. In any case, should we really add up ‘elaps(sec)’ and ‘cpu(sec)’ ? IMHO, CPU time should be part of Elapsed time.

    Comment by Mathavan — November 25, 2022 @ 12:05 pm | Reply

  2. Mathavan

    Yes, you are right the QC program’s name is ora.

    Here’s below a select to check whether we really add up elaps and cpu

    SELECT 
            t.report_id
    	  , x1.sql_id
    	 -- , x1.sql_exec_id
    	  , trunc(x1.elapsed_time/1e6,2) elapsed_time
          , trunc(x1.cpu_time/1e6,2)     cpu_time
          , duration
    FROM dba_hist_reports t    
       , xmltable('/report_repository_summary/sql'    
           PASSING xmlparse(document t.report_summary)    
           COLUMNS    
             sql_id                path '@sql_id'     
           , sql_exec_start        path '@sql_exec_start'    
           , sql_exec_id           path '@sql_exec_id'      
           , status                path 'status'    
           , duration              path 'stats/stat[@name="duration"]'  
           , elapsed_time          path 'stats/stat[@name="elapsed_time"]'  
           , cpu_time              path 'stats/stat[@name="cpu_time"]'  
         ) x1 
    where 
       t.component_name = 'sqlmonitor'
    order by x1.sql_id 
    fetch first 20 rows only
    /	
    
    REPORT_ID SQL_ID           ELAPSED_TIME   CPU_TIME DURATION
    --------- ---------------- ------------ ---------- --------------------
         1800 05s9358mm6vrr              59      56.06 59
         1827 0axfcmd9xqxk6              13      12.84 14
         1801 0axfcmd9xqxk6              12      11.87 12
         1818 0c7jw8n8q9407               0        .01 .027112
         1816 0c7jw8n8q9407               0        .27 .413551
         1817 0c7jw8n8q9407               0        .01 .067402
         1882 3h0gzpkgbx46b               0        .04 .08953
         1778 awm8fgc8hw28q               0          0 .006554
         1779 awm8fgc8hw28q               0          0 .006123
         1780 awm8fgc8hw28q               0          0 .005465
         1781 awm8fgc8hw28q               0          0 .005703
         1783 awm8fgc8hw28q               0          0 .006827
         1784 awm8fgc8hw28q               0          0 .007141
         1782 awm8fgc8hw28q               0          0 .006523
         1785 awm8fgc8hw28q               0          0 .006208
         1777 awm8fgc8hw28q               0          0 .008942
         1786 awm8fgc8hw28q               0          0 .00595
         1774 awm8fgc8hw28q               0          0 .006409
         1775 awm8fgc8hw28q               0          0 .005587
         1776 awm8fgc8hw28q               0          0 .00641
    
    20 rows selected.
    

    It looks like you are right. I have to look deeper at this next time

    Best regards
    Mohamed

    Comment by hourim — November 25, 2022 @ 1:09 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a free website or blog at WordPress.com.

Tony's Oracle Tips

Tony Hasler's light hearted approach to learning about Oracle

Richard Foote's Oracle Blog

Focusing Specifically On Oracle Indexes, Database Administration and Some Great Music

Hatem Mahmoud's blog

Just another blog : Databases, Linux and other stuffs

Mohamed Houri’s Oracle Notes

Qui se conçoit bien s’énonce clairement

Oracle Diagnostician

Performance troubleshooting as exact science

Raheel's Blog

Things I have learnt as Oracle DBA

Coskan's Approach to Oracle

What I learned about Oracle

So Many Oracle Manuals, So Little Time

“Books to the ceiling, Books to the sky, My pile of books is a mile high. How I love them! How I need them! I'll have a long beard by the time I read them”—Lobel, Arnold. Whiskers and Rhymes. William Morrow & Co, 1988.

Carlos Sierra's Tools and Tips

Tools and Tips for Oracle Performance and SQL Tuning

Oracle Scratchpad

Just another Oracle weblog

OraStory

Dominic Brooks on Oracle Performance, Tuning, Data Quality & Sensible Design ... (Now with added Sets Appeal)