SQL Monitor | Mohamed Houri’s Oracle Notes

March 7, 2020

SQL Monitor flaw

Filed under: SQL Monitor — hourim @ 8:13 am

Observe the following real-life real time SQL Monitor report and try to figure out why the Duration is not equal to the Elapsed time

 Global Information
------------------------------
 Status              :  DONE (ERROR)              
 Instance ID         :  3                         
 Session             :  XXXX (1016:28171) 
 SQL ID              :  94dp7vscw26sf             
 SQL Execution ID    :  50331648                  
 Execution Started   :  01/21/2020 13:50:26       
 First Refresh Time  :  01/21/2020 13:50:26       
 Last Refresh Time   :  01/21/2020 13:50:30       
 Duration            :  4s                        
 Module/Action       :  SQL*Plus/-                
 Service             :  YYYYY                   
 Program             :  sqlplus.exe               

Global Stats
===================================================================================
| Elapsed |   Cpu   |    IO    | Application | Cluster  | Buffer |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) |  Gets  | Returned Bytes |
===================================================================================
|    1981 |     249 |     1687 |        0.00 |       45 |     8M |           92MB |
===================================================================================

In the following series of articles about the SQL monitor Duration and Elapsed Time:

I’ve explained why we should always rely on the Duration information as it represents the wall clock time of the query. The Elapsed time doesn’t include time spent by Idle session fetching data. And, when a SQL statement runs parallelly, the Elapsed time includes execution time of both the Query Coordinator and all its parallel server slaves.

So, If I tell you that there is no parsing effect in the above query and that there are no fetch calls, you would normally infer that it has really spent 4 seconds to complete; right?

What if I tell you that this query took approximatively 1981 seconds? Will you believe me?

Indeed, this is a parallel query using DOP 8 which took about 1981 seconds and where the Duration (4 seconds) represents only the time taken by the Query Coordinator. This is due to a flaw I have already explained in this article but where I’ve not been so smart, at that time, to point out the Duration flaw as well.

DONE (ERROR)

In contrast to what the status is saying, this query completes without error. And that is the first clue that should always put you on the right path when trouble shooting performance issue via SQL monitor reports. When this 12cR2 SQL query has been handled to me for investigation I immediately launched it again and here’s below the different phases reported by the SQL Monitor:

Global Information
------------------------------
 Status              :  EXECUTING                 
 Instance ID         :  3                         
 Session             :  XXXX (1016:28171) 
 SQL ID              :  94dp7vscw26sf             
 SQL Execution ID    :  50331648                  
 Execution Started   :  01/21/2020 13:50:26       
 First Refresh Time  :  01/21/2020 13:50:26       
 Last Refresh Time   :  01/21/2020 13:50:30       
 Duration            :  119s                      
 Module/Action       :  SQL*Plus/-                
 Service             :  YYYYY
 Program             :  sqlplus.exe               

Global Stats
===================================================================================
| Elapsed |   Cpu   |    IO    | Application | Cluster  | Buffer |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) |  Gets  | Returned Bytes |
===================================================================================
|     126 |      43 |       82 |        0.00 |     1.98 |   459K |           84MB |
===================================================================================
Parallel Execution Details (DOP=8 , Servers Allocated=16)

../..

Global Information
------------------------------
 Status              :  EXECUTING                 
 Instance ID         :  3                         
 Session             :  XXXX (1016:28171) 
 SQL ID              :  94dp7vscw26sf             
 SQL Execution ID    :  50331648                  
 Execution Started   :  01/21/2020 13:50:26       
 First Refresh Time  :  01/21/2020 13:50:26       
 Last Refresh Time   :  01/21/2020 13:50:30       
 Duration            :  1026s                     
 Module/Action       :  SQL*Plus/-                
 Service             :  YYYYY
 Program             :  sqlplus.exe               

Global Stats
==========================================================================
| Elapsed |   Cpu   |    IO    | Application | Cluster  |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Returned Bytes |
==========================================================================
|    1049 |     146 |      879 |        0.00 |       24 |           88MB |
==========================================================================
Parallel Execution Details (DOP=8 , Servers Allocated=16)

So far so good right?

In fact, if you look closely there is already something odd? Isn’t it?

Have you spotted out that the Last Refresh Time column is not changing?

Anyway, while the query was still running I launched the following query to see how the workload is distributed among the different parallel server slaves

@ getTimeSqlmon.sql
col exec_start format a25
col process    format a10
col status     format a15
select
    -- sql_id
    sql_exec_id   
   ,to_char(sql_exec_start,'dd/mm/yyyy hh24:mi:ss') exec_start
   ,to_char(sysdate,'dd/mm/yyyy hh24:mi:ss') current_time   
   ,process_name process
   ,round((sysdate-sql_exec_start) * 60 * 24,2) "minutes"
   ,round(elapsed_time/1e6,3) "elaps(sec)"
   ,round(cpu_time/1e6,3) "cpu(sec)"
   ,sql_plan_hash_value   "plan hash value"
   ,status
from gv$sql_monitor
where
  sql_id = '&sql_id'
order by exec_start,action;

SQL> @GetTimeSQLMon2
Enter value for sql_id: 94dp7vscw26sf

SQL_EXEC_ID EXEC_START            PROCE    mm elaps(sec)   cpu(sec) SQL_PLAN_HASH_VALUE STATUS
----------- --------------------- ----- ----- ---------- ---------- ------------------- ------
   50331648 21/01/2020 13:50:26   p002  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p003  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p008  16.35      3.364       .478        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p000  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p00f  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p007  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p00d  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p005  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p004  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p00c  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p001  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p009  16.35    978.915    136.371        524852868 EXECUTING ---------->>>>>>
   50331648 21/01/2020 13:50:26   p00e  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p006  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   ora   16.35       .152      4.125        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p00b  16.35          0          0        524852868 EXECUTING
   50331648 21/01/2020 13:50:26   p00a  16.35          0          0        524852868 EXECUTIN
                                  
 17 rows selected.

SQL> @GetTimeSQLMon2
Enter value for sql_id: 94dp7vscw26sf

SQL_EXEC_ID EXEC_START           PROCESS  minutes elaps(sec)   cpu(sec) plan hash value STATUS   
----------- -------------------- -------- ------- ---------- ---------- --------------- ---------
   50331648 21/01/2020 13:50:26  p002       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p003       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p008       30.28      3.364       .478       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p000       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p00f       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p007       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p00d       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p005       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p004       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p00c       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p001       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p009       30.28   1813.916    225.829       524852868 EXECUTING ---------->>>>>>
   50331648 21/01/2020 13:50:26  p00e       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p006       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  ora        30.28       .152      4.125       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p00b       30.28          0          0       524852868 EXECUTING
   50331648 21/01/2020 13:50:26  p00a       30.28          0          0       524852868 EXECUTING
                                             
17 rows selected.

As you can see there is only a single parallel slave which is working (p009). All other slaves are waiting.

I managed to display, roughly, the number of minutes spent by each process not doing anything i.e. waiting. But if you read this article you will know that when a process waits for more than 30 minutes, its status will change to DONE (ERROR). Spot below how this is true:

SQL> @GetTimeSQLMon2
Enter value for sql_id: 94dp7vscw26sf

SQL_EXEC_ID EXEC_START            PROCESS  minutes elaps(sec)   cpu(sec)  STATUS
----------- --------------------- ------- -------- ---------- ----------  ------------
   50331648 21/01/2020 13:50:26   p002       30.58          0          0  DONE (ERROR) ---------->>>>>> IN ERROR after 30 min waiting
   50331648 21/01/2020 13:50:26   p003       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p008       30.58      3.364       .478  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p000       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00f       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p007       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00d       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p005       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p004       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00c       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p001       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p009       30.58   1833.977    228.496  EXECUTING    ---------->>>>>>
   50331648 21/01/2020 13:50:26   p00e       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p006       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   ora        30.58       .152      4.125  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00b       30.58          0          0  DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00a       30.58          0          0  DONE (ERROR)
                                          
17 rows selected.

And so on until the p009 process ends successfully

SQL> @GetTimeSQLMon2
Enter value for sql_id: 94dp7vscw26sf

SQL_EXEC_ID EXEC_START            PROCESS       minutes elaps(sec)   cpu(sec) STATUS
----------- --------------------- ---------- ---------- ---------- ---------- ---------------
   50331648 21/01/2020 13:50:26   p002            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p003            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p008            42.88      3.364       .478 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p000            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00f            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p007            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00d            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p005            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p004            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00c            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p001            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p009            42.88   1942.557    244.777 DONE (ALL ROWS) ---------->>>>>>
   50331648 21/01/2020 13:50:26   p00e            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p006            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   ora             42.88       .152      4.125 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00b            42.88          0          0 DONE (ERROR)
   50331648 21/01/2020 13:50:26   p00a            42.88          0          0 DONE (ERROR)

17 rows selected.

A couple of minutes before the p009 PX slave finishes its work, the Duration was almost reporting a correct time as the following proves:

Global Information
------------------------------
 Status              :  EXECUTING                 
 Instance ID         :  3                         
 Session             :  XXXX (1016:28171) 
 SQL ID              :  94dp7vscw26sf             
 SQL Execution ID    :  50331648                  
 Execution Started   :  01/21/2020 13:50:26       
 First Refresh Time  :  01/21/2020 13:50:26       
 Last Refresh Time   :  01/21/2020 13:50:30       
 Duration            :  1283s   ------>>>>>   spot the Duration here                        
 Module/Action       :  SQL*Plus/-                
 Service             :  YYYYY
 Program             :  sqlplus.exe               

Global Stats
========================================================================
| Elapsed |   Cpu   |    IO    | Application | Buffer |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   |  Gets  | Returned Bytes |
========================================================================
|    1309 |     174 |     1106 |        0.00 |     5M |           88MB |
========================================================================
Parallel Execution Details (DOP=8 , Servers Allocated=16)

Unfortunately, at the end of the query, the Duration is wrongly updated to show 4 seconds instead of the wall clock time which should be around 1981 seconds

Global Information
------------------------------
 Status              :  DONE (ERROR)              
 Instance ID         :  3                         
 Session             :  XXXX (1016:28171) 
 SQL ID              :  94dp7vscw26sf             
 SQL Execution ID    :  50331648                  
 Execution Started   :  01/21/2020 13:50:26       
 First Refresh Time  :  01/21/2020 13:50:26       
 Last Refresh Time   :  01/21/2020 13:50:30       
 Duration            :  4s   ------>>>>>   and spot the Duration at the end of the query                   
 Module/Action       :  SQL*Plus/-                
 Service             :  YYYYY
 Program             :  sqlplus.exe               

Global Stats
=====================================================================
| Elapsed |   Cpu   |    IO    | Cluster  | Buffer |    Offload     |
| Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Returned Bytes |
=====================================================================
|    1981 |     249 |     1687 |       45 |     8M |           92MB |
=====================================================================
Parallel Execution Details (DOP=8 , Servers Allocated=16)

Those 4 seconds represent either the time difference between the start of the query and the last refresh time or the time spent by the query coordinator which is 3.364+0.478 = 3.842 ~ 4 seconds

In passing here how the workload has been distributed when I fixed the issue

SQL> @getTimeSQLMon2
Enter value for sql_id: 2hrxjvurz5c0p

SQL_EXEC_ID EXEC_START            PROCESS minutes elaps(sec)   cpu(sec)  STATUS
----------- --------------------- ------- ------- ---------- ----------  ---------------
   50331648 21/01/2020 16:09:40   p002      10.88       .403        .52  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p009      10.88     38.332      5.065  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p000      10.88       .425       .533  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p008      10.88     38.138      4.623  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p003      10.88       .468       .628  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00b      10.88     38.345      4.993  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00d      10.88     38.206      4.362  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p005      10.88       .453       .613  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00f      10.88     38.279      5.089  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p007      10.88       .429       .537  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00c      10.88     38.211      5.443  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p004      10.88       .474       .632  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   ora       10.88       2.12      1.507  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00e      10.88     38.348      6.344  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p006      10.88       .413       .613  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p001      10.88       .432       .589  DONE (ALL ROWS)
   50331648 21/01/2020 16:09:40   p00a      10.88     38.624      5.481  DONE (ALL ROWS)

17 rows selected.

Summary

When you retrospectively analyze a parallel query performance issue through its corresponding SQL monitoring showing a DONE (ERROR) status, then you should log in your brain the followings:

• There is a probability that the time reported by the Duration is wrong as it either reports the QC time or the time difference between the start of the query and the last time it has been refreshed
• Don’t trust the DONE (ERROR) status. The query might have completed successfully, and this status value is simply indicating that the QC and one or many of its parallel slaves have been waiting for more than 30 minutes

Comments (2)

June 28, 2018

SQL Monitor and parallel processing

Filed under: SQL Monitor — hourim @ 6:25 pm

Very often, people get confused about the actual execution time taken by a parallel query. I’ve seen this asked a few times so that I have always thought it deserves a simple blog post. This is then the last article of the series about SQL monitor Duration and Elapsed time information where we’re going to become acquainted with how to get wall clock time of queries executed using parallelism.

SQL monitor and Parallelism

In the first article we learned that SQL monitor Elapsed time doesn’t sum up time spent outside the database by idle sessions. In the second one we outlined that the SQL monitor begins monitoring the query only when this one reaches its execution phase. In this article we will see that,for parallel query, the wall clock time, very often, approaches the Duration information of the SQL monitor report rather than that of the Elapsed time.

Having said that, here’s below a query which I will run using parallelism (t1 being a simple copy of all_objects table):

SQL> select /*+ parallel(4) */ 
        object_type
	  , count(1)
    from t1
    group by object_type;

43 rows selected.

Elapsed: 00:00:09.91

As you can see it took almost 10 seconds wall clock time for the above query to complete using the following parallel execution plan:

SQL_ID  08ka9ybgk43v1, child number 0
-------------------------------------
----------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |       |        |      |            |
|   1 |  PX COORDINATOR          |          |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)    | :TQ10001 |    43 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY         |          |    43 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE           |          |    43 |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH        | :TQ10000 |    43 |  Q1,00 | P->P | HASH       |
|   6 |       HASH GROUP BY      |          |    43 |  Q1,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR |          | 23131 |  Q1,00 | PCWC |            |
|*  8 |         TABLE ACCESS FULL| T1       | 23131 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - access(:Z>=:Z AND :Z<=:Z)

Note
-----
   - Degree of Parallelism is 4 because of hint

If you want to, or are planning to analyze the above parallel query through the SQL monitor report then pay attention to the following:

SELECT DBMS_SQLTUNE.report_sql_monitor(
  sql_id       => '08ka9ybgk43v1',
  type         => 'TEXT',
  report_level => 'ALL') AS report
FROM dual; 

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)     
 Instance ID         :  1                   
 Session             :  XXXX(296:2203)   
 SQL ID              :  08ka9ybgk43v1       
 SQL Execution ID    :  16777216            
 Execution Started   :  06/28/2018 07:30:52 
 First Refresh Time  :  06/28/2018 07:30:54 
 Last Refresh Time   :  06/28/2018 07:31:02 
 Duration            :  10s                 
 Module/Action       :  SQL*Plus/-          
 Service             :  YYYYY
 Program             :  sqlplus.exe         
 Fetch Calls         :  4                   

Global Stats
=========================================================
| Elapsed |   Cpu   |    IO    | Application |  Other   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) |
=========================================================
|      32 |    0.91 |       29 |        1.74 |     0.08 |
=========================================================

It is, again, the Duration of the Global Information which gives the wall clock time taken by the query executed via parallelism. The Elapsed Time of the Global Stats section represents time taken by the Query Coordinator and all its parallel slaves as clearly shown in the same SQL monitor report:

Parallel Execution Details (DOP=4 , Servers Allocated=8)
=================================================================================
|      Name      | Type  | Server# | Elapsed |   Cpu   |    IO    | Application |
|                |       |         | Time(s) | Time(s) | Waits(s) |  Waits(s)   |
=================================================================================
| PX Coordinator | QC    |         |    2.26 |    0.01 |     0.50 |        1.74 |
| p000           | Set 1 |       1 |    0.00 |    0.00 |          |             |
| p001           | Set 1 |       2 |    0.00 |    0.00 |          |             |
| p002           | Set 1 |       3 |    0.00 |    0.00 |          |             |
| p003           | Set 1 |       4 |    0.00 |    0.00 |          |             |
| p004           | Set 2 |       1 |    7.38 |    0.20 |     7.17 |             |
| p005           | Set 2 |       2 |    7.22 |    0.22 |     6.97 |             |
| p006           | Set 2 |       3 |    7.57 |    0.27 |     7.28 |             |
| p007           | Set 2 |       4 |    7.32 |    0.21 |     7.10 |             |
=================================================================================

Based on the information provided by the gv$sql_monitor view I’ve developed the following scripts showing how the query execution time has been distributed between the Query Coordinator and its parallel slaves

-- this gives execution time consumed by parallel slaves
SQL> @PXElaps
Enter value for sql_id: 08ka9ybgk43v1
Enter value for exec_id: 16777216

SQL_ID        SQL_EXEC_ID        SID PROCE    ELAPSED
------------- ----------- ---------- ----- ----------
08ka9ybgk43v1    16777216        254 p003           0
08ka9ybgk43v1    16777216        193 p000           0
08ka9ybgk43v1    16777216        214 p001           0
08ka9ybgk43v1    16777216        234 p002           0
08ka9ybgk43v1    16777216        299 p005        7.22
08ka9ybgk43v1    16777216          2 p007        7.32
08ka9ybgk43v1    16777216        277 p004        7.38
08ka9ybgk43v1    16777216        318 p006        7.57
                                           ----------
Total Slaves                                    29.49

8 rows selected.

-- this gives execution time consumed by Query Coordinator
SQL> @QCElaps
Enter value for sql_id: 08ka9ybgk43v1
Enter value for exec_id: 16777216

SQL_ID           QC_TIME        DOP PX_SERVERS_REQUESTED PX_SERVERS_ALLOCATED
------------- ---------- ---------- -------------------- --------------------
08ka9ybgk43v1       2.26          4                    8                    8

-- this gives execution time consumed by Query Coordinator and its PX slaves
SQL> @QCPXElaps
Enter value for sql_id: 08ka9ybgk43v1
Enter value for sql_exec_id: 16777216

SQL_ID                           SID PROCE      ELAPS
------------------------- ---------- ----- ----------
08ka9ybgk43v1                    254 p003           0
08ka9ybgk43v1                      2 p007        7.32
08ka9ybgk43v1                    214 p001           0
08ka9ybgk43v1                    299 p005        7.22
08ka9ybgk43v1                    296 QC          2.26
08ka9ybgk43v1                    193 p000           0
08ka9ybgk43v1                    277 p004        7.38
08ka9ybgk43v1                    318 p006        7.57
08ka9ybgk43v1                    234 p002           0
                                           ----------
Elapsed time(s)                                 31.75

9 rows selected.

The devil is in the details
Before you start crying ‘yes, but what about…‘, make sure you understand the following statistics of AWR historical execution of a particular parallel query taken from a running system:

SQL> @HistStats
Enter value for sql_id: a8yny7n0x1gj1
Enter value for from_date: 01062018

SNAP_BEGIN                PLAN_HASH_VALUE    EXECS END_OF_FETCH  AVG_ETIME    AVG_PIO    AVG_LIO   AVG_ROWS
------------------------- --------------- -------- ------------ ---------- ---------- ---------- ----------
20-JUN-18 08.00.58.250 PM       765437158        1            0        665    4708199    4789407          0
25-JUN-18 08.00.43.224 PM       765437158        1            0        773    5446986    5545989          0

At first glance many persons will think that this query took 665 seconds. In effect, we should not consider the above average elapsed time without the fundamental information which is: avg_px

agv_px = px_servers_execs_delta/decode (executions_delta, 0, 1, executions_delta)

This metric indicates whether the query runs parallel or not. If it is not null then the AVG_ETIME represents the cumulated execution time of taken by the Query Coordinator and all its parallel slaves used to run the query concurrently.

Here’s below the same AWR historical execution statistics augmented with the avg_px metric:

@HistStats
Enter value for sql_id:  a8yny7n0x1gj1
Enter value for from_date: 01052018

SNAP_BEGIN                PLAN_HASH_VALUE EXECS END_OF_FETCH  AVG_ETIME AVG_PX   AVG_LIO 
------------------------- --------------- ----- ------------ ---------- ------  --------
20-JUN-18 08.00.58.250 PM      765437158      1            0      665      8     4789407    
25-JUN-18 08.00.43.224 PM      765437158      1            0      773      8     5545989

Thanks to the avg_px supplementary information we know now that this query has used parallelism during its execution and that the AVG_ETIME of 665 seconds (or 773) corresponds to the Elapsed time taken by all parallel servers(8) including the query coordinator. If you want to measure the actual wall clock time taken by this query you have then to use the following approximative arithmetic:

 ag_wc_time = trunc(avg_etime/decode(avg_px,0,1,avg_px) ,2) avg_px_time

Which when implemented into the above Histstats script gives this:

@HistStats
Enter value for sql_id:a8yny7n0x1gj1
Enter value for from_date: 01052018

SNAP_BEGIN                PLAN_HASH_VALUE EXECS END_OF_FETCH  AVG_ETIME AVG_PX AVG_WC_TIME  
------------------------- --------------- ----- ------------ ---------- ------ ----------- 
20-JUN-18 08.00.58.250 PM      765437158      1            0      665      8        166.25     
25-JUN-18 08.00.43.224 PM      765437158      1            0      773      8        193.25

Finally we are able to say that, approximatively, this query took about 193 seconds wall clock time during it execution of June the 25th at around 8PM.

Since the above query used parallelism it has been monitored as the following proves:

SQL> select
       sql_exec_id
       ,to_char(sql_exec_start, 'dd/mm/yyyy hh24:mi:ss') exec_date
    from
      gv$sql_monitor
    where
      sql_id = 'a8yny7n0x1gj1';

SQL_EXEC_ID EXEC_DATE
----------- -------------------
 16777225    25/06/2018 20:01:09

SELECT DBMS_SQLTUNE.report_sql_monitor(
  sql_id       => 'a8yny7n0x1gj1',
  sql_exec_id  =>  16777225,
  type         => 'TEXT',
  report_level => 'ALL') AS report
FROM dual;

Global Information
------------------------------
 Status              :  DONE (ERROR)             
 Instance ID         :  1                     
 SQL ID              :  a8yny7n0x1gj1  
 SQL Execution ID    :  16777225                   
 Execution Started   :  06/25/2018 20:01:09   
 First Refresh Time  :  06/25/2018 20:01:09  
 Last Refresh Time   :  06/25/2018 20:04:20   
 Duration            :  191s                  
                 
Global Stats
================================
| Elapsed |   Cpu   |  IO      |
| Time(s) | Time(s) | Waits(s) |
================================
|  776    |  236    |   538    | 
================================

Notice how using AWR historical tables(HistStats.sql) we are able to derive, almost reliably, the actual time taken by a query ran in parallel.

Summary
SQL monitor feature monitors performance of SQL statements while they execute. It is started automatically when a SQL statement runs parallel or serial but has consumed at least 5 seconds either on CPU or on I/O wait. The parsed phased of SQL statements is not monitored. In order to have an idea about the exact duration of the SQL statement execution then we have better to rely on the Duration information rather than on the Elapsed Time of the Global Stats section of the report. This is because Elapsed Time doesn’t include time spent by idle session fetching data for example. And when a SQL statement runs parallel, the Elapsed Time includes execution time of both the Query Coordinator and all its parallel server slaves while the Duration represents the wall clock time of the query that runs both serial and parallel.

SQL scritps

@PxElaps
compute sum label 'Total Slaves time' of elapsed on report
break   on report
select
          sql_id
		 ,sql_exec_id
		 ,sid		 
         , process_name 		 
         , round(elapsed_time/1e6,2) elapsed
       --  , round(cpu_time/1e6) cpu
         --, round(concurrency_wait_time/1e6,2) conc
       --  , round(user_io_wait_time/1e6) IO_Waits
   from gv$sql_monitor
   where sql_id = '&sql_id'   
   and sql_exec_id = '&exec_id'
   and sql_text is  null
   order by round(elapsed_time/1e6,2);

@QCElaps
select
           sql_id
		 , round(elapsed_time/1e6,2) QC_time
         , px_maxdop DOP
         , px_servers_requested
         , px_servers_allocated        
         , round(cpu_time/1e6) cpu
         --, round(concurrency_wait_time/1e6,2) conc
         , round(user_io_wait_time/1e6) IO_Waits
   from gv$sql_monitor
   where sql_id = '&sql_id'
   and sql_exec_id = '&exec_id'
   and sql_text is not null;

@QCPXElaps
column label format a25
col sql_id format a25
compute sum label 'Elapsed time(s)' of elaps on report
break   on report
select
          sql_id	
         ,sid          		 
         ,decode(process_name, 'ora', 'QC', process_name) process
         ,round(elapsed_time/1e6,2) elaps
from gv$sql_monitor
where sql_id    = '&sql_id'   
and sql_exec_id = '&exec_id';

Comments (1)

June 21, 2018

SQL Monitor and parsing

Filed under: SQL Monitor — hourim @ 6:32 pm

In part I of this short series on SQL Monitoring Duration and Elapsed time information we saw that in the case of a query fetching data to the client, we have better to rely on the former information rather than on the latter. Indeed, the Elapsed time of the SQL monitor Global Stats section doesn’t keep count of the time spent passing the data to the client. Piotr Wrzosek, Tanel Poder and Robert have emphasized that, while Duration represents the wall clock time, Elapsed time in SQL monitor refers to the DB time measured for non-idle sessions that are either waiting or actively working on CPU. This is why time spent outside the database by idle sessions is not measured by the SQL monitor Elapsed time information.

In part 2 we will be trying to demonstrate that, all things being equal, the SQL monitor kicks in only during the query execution phase. Any preliminary steps occurring between the launch of the query and its execution phase will not be monitored. Typically, time spent in the hard parse phase, will appear neither in the Duration nor in the Elapsed time of the SQL monitor report.

Here’s below a quick demonstration

Parsing

I started looking around on how to produce a sufficiently long hard parse situation within a single query. Very quickly few ideas germinated in my mind like extreme Adaptive Cursor Sharing, or several SQL Plan Management baselined execution plans which Oracle has to reproduce and cost before selecting the best costed plan. But I finally ended up by using a simple join between two tables of 3 million rows each but with an imense IN-LIST predicate:

create table t1
as select
    rownum n1
   ,trunc((rownum-1)/3) n2
   ,mod(rownum,10) n3
from dual
connect by level <= 3e6;


create table t2
as select
    2*rownum n1
   ,trunc(rownum,5) n2
   ,lpad('x',5) vc
from dual
connect by level <= 3e6;

alter table t1 add constraint t1_pk primary key (n1);
alter table t2 add constraint t2_pk primary key (n1);

select /*+ monitor */
     t1.n1
    ,t1.n2
    ,t1.n3
from
     t1
where
     t1.n1 in
           (select t2.n2
           from t2
           where
                (
                 (
                  (
                   (
                          (t2.n1 in (33183, 31917, 31405)
                           or t2.n1 in (33383, 33917, 38405)
                           or t2.n1 in (150, 234, 456)
                           or t2.n1 in (309, 33456, 675)
                           or t2.n1 in (10, 2345, 42367)
                           or 
                           or
                           etc..
                    )
                   )
                  )
                 )
		   )
	     );

Just before launching this query I actived the 10046 trace event as shown below:


SQL> set timing on

SQL> @46on 12

SQL>@MyLongParsingQuery

        N1         N2         N3
---------- ---------- ----------
         5          1          5
        75         24          5
       117         38          7
       228         75          8
     16728       5575          8

Elapsed: 00:00:14.39

SQL> @46off

As you can see the query took 14 seconds wall clock time to complete. And you have to believe me when I say that the same query when executed without tracing lasted 14 seconds too.

Since the query execution phase is super-fast (less than 5 seconds) and is run serially, it will not be automatically monitored. This is why I forced it to be monitored using the hint /*+ monitor */. Here is below the corresponding SQL monitor report (reduced to the Global Information and Global Stats sections):

Global Information
------------------------------
 Status              :  DONE ( ALL ROWS)             
 Instance ID         :  1                     
 SQL ID              :  1w1jpyxgk0asj                    
 Execution Started   :  06/08/2018 12:16:31   
 First Refresh Time  :  06/08/2018 12:16:31  
 Last Refresh Time   :  06/08/2018 12:16:32   
 Duration            :  1s                  
                 
Global Stats
=================================================
| Elapsed |   Cpu   |  IO      | Fetch | Buffer |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  |
=================================================
|  0.08   |  0.02   |   0.06   |   2   |   542  |
=================================================

It is interesting to note that both Duration and Elapsed time are not showing the correct time of 14 seconds taken by the query.

However the TKPROF of the query 10046 trace file generated above is clearly showing where most of those 14 seconds have been spent:

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1     12.89      14.22          0          0          3           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.02       0.08         40        542          0           4
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4     12.91      14.30         40        542          3           4

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  PGA memory operation                         1179        0.00          0.01
  acknowledge over PGA limit                    117        0.10          1.30
  SQL*Net message to client                       2        0.00          0.00
  db file sequential read                        40        0.01          0.06
  SQL*Net message from client                     2        0.00          0.01
********************************************************************************

Thanks to the above TKPROF lines we now have a clear image of the query execution time distribution:

 elapsed time    = 14.30 seconds (of which 14.22 for parsing)
 cpu consumption = 12.91 seconds (of which 12.89 for parsing)
 wait events     = 0.01 + 1.30 + 0.00 + 0.06 + 0.01 = 1.38 seconds
 elapsed -cpu    = 14.30 -12.91 = 1.39 seconds

Summary

In this second part of the series about Duration and Elapsed time information displayed by the SQL monitor report, we have demonstrated the following two points:

• SQL monitor monitors only the execution phase of the query
• Any preliminary steps preceding the execution phase will not be considered by the SQL monitor. Typically, time taken by a long hard parsing situation, will not show up neither in the Duration nor in the Elapsed time.

Comments (5)

June 9, 2018

ASH, SQL Monitor and Extended trace

Filed under: ASH,SQL Monitor — hourim @ 3:21 pm

Without any analogy with Sergio Leone’s famous movie “The Good, the Bad and the Ugly” which has brought up my teenage years, I would rather define the above three Oracle features as “The Good, the Good and the Good” as they made my consultancy days very often. This article has been inspired by the last Nenad Noveljic blog post about negative unaccounted-for where he mentioned that the SQL Monitor Global Stats section might show a wrong value in the Elapsed time. In Nenad’s case this misleading piece of information was due to unaccounted-for time. In fact, I know three other situations where this time discrepancy might occur: when fetching data, during a long hard parsing query and when using parallel execution. In this article I will start by demonstrating the fetching data case. Two other articles for the parse and parallel cases will follow to complete the series.

Fetching data

I executed a query which takes few minutes to complete but spent a considerable amount of time fetching a large number of columns with quite a large number of rows. While the query was running I took several snapshots of its corresponding 12cR2 SQL monitor which I am presenting below:

Global Information
------------------------------
 Status              :  EXECUTING             
 Instance ID         :  1                     
 SQL ID              :  g6px76dmjv1jy                    
 Execution Started   :  06/05/2018 09:54:19   
 First Refresh Time  :  06/05/2018 09:54:25  
 Last Refresh Time   :  06/05/2018 09:54:29   
 Duration            :  11s                  
                 
Global Stats
=========================================
| Elapsed |   Cpu   |  IO      | Buffer |
| Time(s) | Time(s) | Waits(s) |  Gets  |
=========================================
|    10   |    8.36 |     1.34 |   2M   |
=========================================

SQL Plan Monitoring Details (Plan Hash Value=2278376452)
===========================================================================
| Id    |                 Operation |       Name       | Execs |   Rows   |
|       |                           |                  |       | (Actual) |
===========================================================================
|     0 | SELECT STATEMENT          |                  |     1 |          |
|     1 |   NESTED LOOPS OUTER      |                  |     1 |          |
|     2 |    NESTED LOOPS           |                  |     1 |          |
|     3 |     FILTER                |                  |     1 |          |

As you can see, both Elapsed time and Duration are showing the same time of 10 seconds. But spot as well that at this stage of the query execution, data was not yet sent to the client (Actual Rows column is empty).

At exactly the same moment, ASH is also showing approximately the same elapsed time as shown below:

column  cnt   format 9999
compute sum label 'Total Elapsed' of cnt on report

break   on report

select 
   sql_plan_line_id
  ,count(1) cnt
from
   gv$active_session_history
 where
    sample_time between to_date('&datefrom', 'mm/dd/yyyy hh24:mi:ss')
				and     to_date('&dateto', 'mm/dd/yyyy hh24:mi:ss')
and sql_id = '&sql_id'
group by sql_plan_line_id
order by 2 desc;

SQL> @ashplanlineId
Enter value for datefrom:  06/05/2018 09:54:19
Enter value for dateto:  06/05/2018 09:54:29
Enter value for sql_id: g6px76dmjv1jy


SQL_PLAN_LINE_ID   CNT
---------------- -----
              10    7
              13    1
              11    1
                 -----
Total Elapsed       9

However, as soon as Oracle has started sending data to the client, a discrepancy between Elapsed time in Global Stats section and Duration began to be displayed as shown below:

Global Information
------------------------------
 Status              :  EXECUTING             
 Instance ID         :  1                     
 SQL ID              :  g6px76dmjv1jy                    
 Execution Started   :  06/05/2018 09:54:19   
 First Refresh Time  :  06/05/2018 09:54:25  
 Last Refresh Time   :  06/05/2018 09:56:18   
 Duration            :  120s                  
 Fetch Calls         :  375            
    
Global Stats
=================================================
| Elapsed |   Cpu   |  IO      | Fetch | Buffer |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  |
=================================================
|    32   |    11   |     21   |   375 |   3M   |
=================================================

SQL Plan Monitoring Details (Plan Hash Value=2278376452)
===========================================================================
| Id    |                 Operation |       Name       | Start |   Rows   |
|       |                           |                  | Active| (Actual) |
===========================================================================
| ->  0 | SELECT STATEMENT          |                  |   +30 |     5611 |
| ->  1 |   NESTED LOOPS OUTER      |                  |   +30 |     5611 |
| ->  2 |    NESTED LOOPS           |                  |   +30 |     5611 |
| ->  3 |     FILTER                |                  |   +30 |          |

ASH, in accordance with SQL monitor Elapsed time, started also to report a wrong elapsed time as shown below:

SQL> @ashplanlineId
Enter value for datefrom:  06/05/2018 09:54:19
Enter value for dateto:  06/05/2018 09:56:18
Enter value for sql_id: g6px76dmjv1jy


SQL_PLAN_LINE_ID   CNT
---------------- -----
              13    25
              10     7
              4      1
              4      1
                 -----
Total Elapsed       34

And so on until the query has finished fetching the entire set of data in about 527 seconds while both ASH and Elapsed time of the SQL monitor section are showing roughly 34 seconds of elapsed time as shown below respectively:

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)             
 Instance ID         :  1                     
 SQL ID              :  g6px76dmjv1jy                    
 Execution Started   :  06/05/2018 09:54:19   
 First Refresh Time  :  06/05/2018 09:54:25  
 Last Refresh Time   :  06/05/2018 10:03:06   
 Duration            :  527s                  
 Fetch Calls         :  2085            
    
Global Stats
=================================================
| Elapsed |   Cpu   |  IO      | Fetch | Buffer |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  |
=================================================
|    34   |    12   |     22   |  2085 |   3M   |
=================================================

SQL> @ashplanlineId
Enter value for datefrom:  06/05/2018 09:54:19
Enter value for dateto:  06/05/2018 09:56:18
Enter value for sql_id: g6px76dmjv1jy


SQL_PLAN_LINE_ID   CNT
---------------- -----
              13    26
              10     7
               4     1
              11     1
               5     1
                 -----
Total Elapsed       36

Using Nenad query, I can also report the same wrong elapsed time from ASH.

column  event format a30
column  cnt   format 9999
compute sum label 'Total Elapsed' of cnt on report

break   on report
select 
   session_state, event
  ,count(1) cnt
from
   gv$active_session_history
 where
   sql_id = '&sql_id'
 and sql_exec_id = (select max(sql_exec_id) 
                    from gv$active_session_history
					where sql_id = '&sql_id'
					)
group by event, session_state
;
SESSION EVENT                            CNT
------- ------------------------------ -----
WAITING direct path read                  26
ON CPU                                    11
                                       -----
Total Elapsed                             37

TKPROF

In a second attempt to make this time difference crystal clear I traced another execution of the same query using the 10046 trace events. I reproduced below the corresponding SQL monitor report and TKPROF summary respectively:

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)             
 Instance ID         :  1                     
 SQL ID              :  g6px76dmjv1jy                    
 Execution Started   :  06/05/2018 09:27:23   
 First Refresh Time  :  06/05/2018 09:27:30  
 Last Refresh Time   :  06/05/2018 09:39:21   
 Duration            :  718s                  
 Fetch Calls         :  2085            
    
Global Stats
=================================================
| Elapsed |   Cpu   |  IO      | Fetch | Buffer |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  |
=================================================
|    198  |    15   |     183  |  2085 |   3M   |
=================================================


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.21       0.21          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     2085     15.18     198.03     647106    2918455          0       31246
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     2087     15.39     198.24     647106    2918455          0       31246


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    2085        0.00          0.00
  SQL*Net more data to client                     2        0.00          0.00
  Disk file operations I/O                       90        0.00          0.00
  db file sequential read                     30313        0.09        158.42
  reliable message                                1        0.00          0.00  
  enq: KO – fast object checkpoint                1        0.00          0.00
  direct path read                             1220        0.09         24.76
  SQL*Net message from client                  2085        0.93        519.45
********************************************************************************

As you can see, the Duration of the SQL monitor is showing that the query took 718 seconds while the Elapsed time is showing only 198 seconds (15 sec for CPU and 183 sec for IO waits).
However, the TKPROF file comforts the reliability of the SQL monitor Duration information since it shows that the total elapsed time of the query is:

158.42 + 24.76     = 183.18 seconds for I/O waits
158.42 + 519.45    = 702.63 seconds for I/O waits and  SQL*Net message from client
702.63 + 15.39(cpu)= 718.02 seconds

Naturally, ASH can’t display the correct piece of information since a large part of the query has been spent on the idle SQL*Net message from client wait event:

SQL> @ashplanlineId
Enter value for datefrom:  06/05/2018 09:27:23
Enter value for dateto:  06/05/2018 09:39:21
Enter value for sql_id: g6px76dmjv1jy

SQL_PLAN_LINE_ID   CNT
---------------- -----
              12   153
              13    28
              10     9
              11     6
              14     1
                 -----
Total Elapsed      197

Summary

We have clearly shown that when a query is fetching a large amount of data to the client, this transfer time is not taken into account by the Elapsed time of the SQL monitor Global Stats section. This is why, as mentioned by Nenad in his article, you should always rely on the time reported by the Duration in the Global information of the SQL Monitor instead of the Elapsed time of the Global stats section.

Comments (10)

Mohamed Houri’s Oracle Notes

March 7, 2020

SQL Monitor flaw

DONE (ERROR)

Summary

June 28, 2018

SQL Monitor and parallel processing

June 21, 2018

SQL Monitor and parsing

June 9, 2018

ASH, SQL Monitor and Extended trace

Follow Blog via Email

Categories

Blog Stats

Blogs I Follow

Archives