Observe the following real-life real time SQL Monitor report and try to figure out why the Duration is not equal to the Elapsed time
Global Information ------------------------------ Status : DONE (ERROR) Instance ID : 3 Session : XXXX (1016:28171) SQL ID : 94dp7vscw26sf SQL Execution ID : 50331648 Execution Started : 01/21/2020 13:50:26 First Refresh Time : 01/21/2020 13:50:26 Last Refresh Time : 01/21/2020 13:50:30 Duration : 4s Module/Action : SQL*Plus/- Service : YYYYY Program : sqlplus.exe Global Stats =================================================================================== | Elapsed | Cpu | IO | Application | Cluster | Buffer | Offload | | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Gets | Returned Bytes | =================================================================================== | 1981 | 249 | 1687 | 0.00 | 45 | 8M | 92MB | ===================================================================================
In the following series of articles about the SQL monitor Duration and Elapsed Time:
I’ve explained why we should always rely on the Duration information as it represents the wall clock time of the query. The Elapsed time doesn’t include time spent by Idle session fetching data. And, when a SQL statement runs parallelly, the Elapsed time includes execution time of both the Query Coordinator and all its parallel server slaves.
So, If I tell you that there is no parsing effect in the above query and that there are no fetch calls, you would normally infer that it has really spent 4 seconds to complete; right?
What if I tell you that this query took approximatively 1981 seconds? Will you believe me?
Indeed, this is a parallel query using DOP 8 which took about 1981 seconds and where the Duration (4 seconds) represents only the time taken by the Query Coordinator. This is due to a flaw I have already explained in this article but where I’ve not been so smart, at that time, to point out the Duration flaw as well.
DONE (ERROR)
In contrast to what the status is saying, this query completes without error. And that is the first clue that should always put you on the right path when trouble shooting performance issue via SQL monitor reports. When this 12cR2 SQL query has been handled to me for investigation I immediately launched it again and here’s below the different phases reported by the SQL Monitor:
Global Information ------------------------------ Status : EXECUTING Instance ID : 3 Session : XXXX (1016:28171) SQL ID : 94dp7vscw26sf SQL Execution ID : 50331648 Execution Started : 01/21/2020 13:50:26 First Refresh Time : 01/21/2020 13:50:26 Last Refresh Time : 01/21/2020 13:50:30 Duration : 119s Module/Action : SQL*Plus/- Service : YYYYY Program : sqlplus.exe Global Stats =================================================================================== | Elapsed | Cpu | IO | Application | Cluster | Buffer | Offload | | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Gets | Returned Bytes | =================================================================================== | 126 | 43 | 82 | 0.00 | 1.98 | 459K | 84MB | =================================================================================== Parallel Execution Details (DOP=8 , Servers Allocated=16) ../.. Global Information ------------------------------ Status : EXECUTING Instance ID : 3 Session : XXXX (1016:28171) SQL ID : 94dp7vscw26sf SQL Execution ID : 50331648 Execution Started : 01/21/2020 13:50:26 First Refresh Time : 01/21/2020 13:50:26 Last Refresh Time : 01/21/2020 13:50:30 Duration : 1026s Module/Action : SQL*Plus/- Service : YYYYY Program : sqlplus.exe Global Stats ========================================================================== | Elapsed | Cpu | IO | Application | Cluster | Offload | | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Returned Bytes | ========================================================================== | 1049 | 146 | 879 | 0.00 | 24 | 88MB | ========================================================================== Parallel Execution Details (DOP=8 , Servers Allocated=16)
So far so good right?
In fact, if you look closely there is already something odd? Isn’t it?
Have you spotted out that the Last Refresh Time column is not changing?
Anyway, while the query was still running I launched the following query to see how the workload is distributed among the different parallel server slaves
@ getTimeSqlmon.sql col exec_start format a25 col process format a10 col status format a15 select -- sql_id sql_exec_id ,to_char(sql_exec_start,'dd/mm/yyyy hh24:mi:ss') exec_start ,to_char(sysdate,'dd/mm/yyyy hh24:mi:ss') current_time ,process_name process ,round((sysdate-sql_exec_start) * 60 * 24,2) "minutes" ,round(elapsed_time/1e6,3) "elaps(sec)" ,round(cpu_time/1e6,3) "cpu(sec)" ,sql_plan_hash_value "plan hash value" ,status from gv$sql_monitor where sql_id = '&sql_id' order by exec_start,action; SQL> @GetTimeSQLMon2 Enter value for sql_id: 94dp7vscw26sf SQL_EXEC_ID EXEC_START PROCE mm elaps(sec) cpu(sec) SQL_PLAN_HASH_VALUE STATUS ----------- --------------------- ----- ----- ---------- ---------- ------------------- ------ 50331648 21/01/2020 13:50:26 p002 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p003 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p008 16.35 3.364 .478 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p000 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p00f 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p007 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p00d 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p005 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p004 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p00c 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p001 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p009 16.35 978.915 136.371 524852868 EXECUTING ---------->>>>>> 50331648 21/01/2020 13:50:26 p00e 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p006 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 ora 16.35 .152 4.125 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p00b 16.35 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p00a 16.35 0 0 524852868 EXECUTIN 17 rows selected.
SQL> @GetTimeSQLMon2 Enter value for sql_id: 94dp7vscw26sf SQL_EXEC_ID EXEC_START PROCESS minutes elaps(sec) cpu(sec) plan hash value STATUS ----------- -------------------- -------- ------- ---------- ---------- --------------- --------- 50331648 21/01/2020 13:50:26 p002 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p003 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p008 30.28 3.364 .478 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p000 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p00f 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p007 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p00d 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p005 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p004 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p00c 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p001 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p009 30.28 1813.916 225.829 524852868 EXECUTING ---------->>>>>> 50331648 21/01/2020 13:50:26 p00e 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p006 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 ora 30.28 .152 4.125 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p00b 30.28 0 0 524852868 EXECUTING 50331648 21/01/2020 13:50:26 p00a 30.28 0 0 524852868 EXECUTING 17 rows selected.
As you can see there is only a single parallel slave which is working (p009). All other slaves are waiting.
I managed to display, roughly, the number of minutes spent by each process not doing anything i.e. waiting. But if you read this article you will know that when a process waits for more than 30 minutes, its status will change to DONE (ERROR). Spot below how this is true:
SQL> @GetTimeSQLMon2 Enter value for sql_id: 94dp7vscw26sf SQL_EXEC_ID EXEC_START PROCESS minutes elaps(sec) cpu(sec) STATUS ----------- --------------------- ------- -------- ---------- ---------- ------------ 50331648 21/01/2020 13:50:26 p002 30.58 0 0 DONE (ERROR) ---------->>>>>> IN ERROR after 30 min waiting 50331648 21/01/2020 13:50:26 p003 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p008 30.58 3.364 .478 DONE (ERROR) 50331648 21/01/2020 13:50:26 p000 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p00f 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p007 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p00d 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p005 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p004 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p00c 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p001 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p009 30.58 1833.977 228.496 EXECUTING ---------->>>>>> 50331648 21/01/2020 13:50:26 p00e 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p006 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 ora 30.58 .152 4.125 DONE (ERROR) 50331648 21/01/2020 13:50:26 p00b 30.58 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p00a 30.58 0 0 DONE (ERROR) 17 rows selected.
And so on until the p009 process ends successfully
SQL> @GetTimeSQLMon2 Enter value for sql_id: 94dp7vscw26sf SQL_EXEC_ID EXEC_START PROCESS minutes elaps(sec) cpu(sec) STATUS ----------- --------------------- ---------- ---------- ---------- ---------- --------------- 50331648 21/01/2020 13:50:26 p002 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p003 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p008 42.88 3.364 .478 DONE (ERROR) 50331648 21/01/2020 13:50:26 p000 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p00f 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p007 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p00d 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p005 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p004 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p00c 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p001 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p009 42.88 1942.557 244.777 DONE (ALL ROWS) ---------->>>>>> 50331648 21/01/2020 13:50:26 p00e 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p006 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 ora 42.88 .152 4.125 DONE (ERROR) 50331648 21/01/2020 13:50:26 p00b 42.88 0 0 DONE (ERROR) 50331648 21/01/2020 13:50:26 p00a 42.88 0 0 DONE (ERROR) 17 rows selected.
A couple of minutes before the p009 PX slave finishes its work, the Duration was almost reporting a correct time as the following proves:
Global Information ------------------------------ Status : EXECUTING Instance ID : 3 Session : XXXX (1016:28171) SQL ID : 94dp7vscw26sf SQL Execution ID : 50331648 Execution Started : 01/21/2020 13:50:26 First Refresh Time : 01/21/2020 13:50:26 Last Refresh Time : 01/21/2020 13:50:30 Duration : 1283s ------>>>>> spot the Duration here Module/Action : SQL*Plus/- Service : YYYYY Program : sqlplus.exe Global Stats ======================================================================== | Elapsed | Cpu | IO | Application | Buffer | Offload | | Time(s) | Time(s) | Waits(s) | Waits(s) | Gets | Returned Bytes | ======================================================================== | 1309 | 174 | 1106 | 0.00 | 5M | 88MB | ======================================================================== Parallel Execution Details (DOP=8 , Servers Allocated=16)
Unfortunately, at the end of the query, the Duration is wrongly updated to show 4 seconds instead of the wall clock time which should be around 1981 seconds
Global Information ------------------------------ Status : DONE (ERROR) Instance ID : 3 Session : XXXX (1016:28171) SQL ID : 94dp7vscw26sf SQL Execution ID : 50331648 Execution Started : 01/21/2020 13:50:26 First Refresh Time : 01/21/2020 13:50:26 Last Refresh Time : 01/21/2020 13:50:30 Duration : 4s ------>>>>> and spot the Duration at the end of the query Module/Action : SQL*Plus/- Service : YYYYY Program : sqlplus.exe Global Stats ===================================================================== | Elapsed | Cpu | IO | Cluster | Buffer | Offload | | Time(s) | Time(s) | Waits(s) | Waits(s) | Gets | Returned Bytes | ===================================================================== | 1981 | 249 | 1687 | 45 | 8M | 92MB | ===================================================================== Parallel Execution Details (DOP=8 , Servers Allocated=16)
Those 4 seconds represent either the time difference between the start of the query and the last refresh time or the time spent by the query coordinator which is 3.364+0.478 = 3.842 ~ 4 seconds
In passing here how the workload has been distributed when I fixed the issue
SQL> @getTimeSQLMon2 Enter value for sql_id: 2hrxjvurz5c0p SQL_EXEC_ID EXEC_START PROCESS minutes elaps(sec) cpu(sec) STATUS ----------- --------------------- ------- ------- ---------- ---------- --------------- 50331648 21/01/2020 16:09:40 p002 10.88 .403 .52 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p009 10.88 38.332 5.065 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p000 10.88 .425 .533 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p008 10.88 38.138 4.623 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p003 10.88 .468 .628 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p00b 10.88 38.345 4.993 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p00d 10.88 38.206 4.362 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p005 10.88 .453 .613 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p00f 10.88 38.279 5.089 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p007 10.88 .429 .537 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p00c 10.88 38.211 5.443 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p004 10.88 .474 .632 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 ora 10.88 2.12 1.507 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p00e 10.88 38.348 6.344 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p006 10.88 .413 .613 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p001 10.88 .432 .589 DONE (ALL ROWS) 50331648 21/01/2020 16:09:40 p00a 10.88 38.624 5.481 DONE (ALL ROWS) 17 rows selected.
Summary
When you retrospectively analyze a parallel query performance issue through its corresponding SQL monitoring showing a DONE (ERROR) status, then you should log in your brain the followings:
• There is a probability that the time reported by the Duration is wrong as it either reports the QC time or the time difference between the start of the query and the last time it has been refreshed
• Don’t trust the DONE (ERROR) status. The query might have completed successfully, and this status value is simply indicating that the QC and one or many of its parallel slaves have been waiting for more than 30 minutes