This is a small note about a situation I have encountered and which I thought it is worth sharing with you. There was an insert/select executing in parallel DOP 16 on a 11.2.0.3 Oracle database for which the end user was complaining about the exceptional time it was taking without completing. Since the job was still running I tried getting its Real Time SQL monitoring report:
Global Information ------------------------------ Status : DONE (ERROR) Instance ID : 1 Session : XXXXX (392:229) SQL ID : bbccngk0nn2z2 SQL Execution ID : 16777216 Execution Started : 06/22/2015 11:57:06 First Refresh Time : 06/22/2015 11:57:06 Last Refresh Time : 06/22/2015 11:57:46 Duration : 40s Global Stats ================================================================================================= | Elapsed | Cpu | IO | Concurrency | Other | Buffer | Read | Read | Write | Write | | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Gets | Reqs | Bytes | Reqs | Bytes | ================================================================================================= | 15315 | 15220 | 54 | 0.38 | 40 | 2G | 8601 | 2GB | 5485 | 1GB |
The insert/select according to the above report summary is DONE with (ERROR).
So why the end user is still complaining about the not ending batch job? And why he didn’t receive an error?
After having ruled out the resumable time out hypothesis I came back to the v$sql_monitor and issued the following two selects:
SQL> SELECT 2 sql_id, 3 process_name, 4 status 5 FROM v$sql_monitor 6 WHERE sql_id = 'bbccngk0nn2z2' 7 AND status ='EXECUTING' 8 ORDER BY process_name ; SQL_ID PROCE STATUS ------------- ----- ------------ bbccngk0nn2z2 p000 EXECUTING bbccngk0nn2z2 p001 EXECUTING bbccngk0nn2z2 p002 EXECUTING bbccngk0nn2z2 p003 EXECUTING bbccngk0nn2z2 p004 EXECUTING bbccngk0nn2z2 p005 EXECUTING bbccngk0nn2z2 p006 EXECUTING bbccngk0nn2z2 p007 EXECUTING bbccngk0nn2z2 p008 EXECUTING bbccngk0nn2z2 p009 EXECUTING bbccngk0nn2z2 p010 EXECUTING bbccngk0nn2z2 p011 EXECUTING bbccngk0nn2z2 p012 EXECUTING bbccngk0nn2z2 p013 EXECUTING bbccngk0nn2z2 p014 EXECUTING bbccngk0nn2z2 p015 EXECUTING bbccngk0nn2z2 p019 EXECUTING bbccngk0nn2z2 p031 EXECUTING SQL> SELECT 2 sql_id, 3 process_name, 4 status 5 FROM v$sql_monitor 6 WHERE sql_id = 'bbccngk0nn2z2' 7 AND status ='DONE (ERROR)' 8 ORDER BY process_name ; SQL_ID PROCE STATUS ------------- ----- ------------------- bbccngk0nn2z2 ora DONE (ERROR) bbccngk0nn2z2 p016 DONE (ERROR) bbccngk0nn2z2 p017 DONE (ERROR) bbccngk0nn2z2 p018 DONE (ERROR) bbccngk0nn2z2 p020 DONE (ERROR) bbccngk0nn2z2 p021 DONE (ERROR) bccngk0nn2z2 p022 DONE (ERROR) bbccngk0nn2z2 p023 DONE (ERROR) bbccngk0nn2z2 p024 DONE (ERROR) bbccngk0nn2z2 p025 DONE (ERROR) bbccngk0nn2z2 p026 DONE (ERROR) bbccngk0nn2z2 p027 DONE (ERROR) bbccngk0nn2z2 p028 DONE (ERROR) bbccngk0nn2z2 p029 DONE (ERROR) bbccngk0nn2z2 p030 DONE (ERROR)
Among the 32 parallel servers half are executing and half are in error! How could this be possible? I have already been confronted to a parallel process that ends in its entirety when a single parallel server is in error. For example I have encountered several times the following error which is due to a parallel broadcast distribution of a high data row source exploding henceforth the TEMP tablespace:
ERROR at line 1: ORA-12801: error signaled in parallel query server P013 ORA-01652: unable to extend temp segment by 128 in tablespace TEMP
A simple select against v$active_session_history confirmed that the insert/select is still running and it is consuming CPU
SQL> select sql_id, count(1) 2 from gv$active_session_history 3 where sample_time between to_date('22062015 12:30:00', 'ddmmyyyy hh24:mi:ss') 4 and to_date('22062015 13:00:00', 'ddmmyyyy hh24:mi:ss') 5 group by sql_id 6 order by 2 desc; SQL_ID COUNT(1) ------------- ---------- bbccngk0nn2z2 2545 4 0uuczutvk6jqj 1 8f1sjvfxuup9w 1 SQL> select decode(event,null, 'on cpu', event), count(1) 2 from gv$active_session_history 3 where sample_time between to_date('22062015 12:30:00', 'ddmmyyyy hh24:mi:ss') 4 and to_date('22062015 13:00:00', 'ddmmyyyy hh24:mi:ss') 5 and sql_id = 'bbccngk0nn2z2' 6 group by event 7 order by 2 desc; DECODE(EVENT,NULL,'ONCPU',EVENT) COUNT(1) -------------------------------- --------- on cpu 5439 db file sequential read 3 SQL> / DECODE(EVENT,NULL,'ONCPU',EVENT) COUNT(1) --------------------------------- --------- on cpu 5460 db file sequential read 3 SQL> / DECODE(EVENT,NULL,'ONCPU',EVENT) COUNT(1) -------------------------------- --------- on cpu 5470 db file sequential read 3
And after a while
SQL> / DECODE(EVENT,NULL,'ONCPU',EVENT) COUNT(1) ---------------------------------- --------- on cpu 15152 db file sequential read 9
While the parallel insert is still running I took several SQL monitoring reports of which the two followings ones:
Parallel Execution Details (DOP=16 , Servers Allocated=32) ============================================================================================ | Name | Type | Server# | Elapsed |Buffer | Read | Wait Events | | | | | Time(s) | Gets | Bytes | (sample #) | ============================================================================================ | PX Coordinator | QC | | 0.48 | 2531 | 16384 | | | p000 | Set 1 | 1 | 1049 | 128M | 63MB | direct path read (1) | | p001 | Set 1 | 2 | 1518 | 222M | 61MB | | | p002 | Set 1 | 3 | 893 | 109M | 59MB | | | p003 | Set 1 | 4 | 1411 | 194M | 62MB | direct path read (1) | | p004 | Set 1 | 5 | 460 | 64M | 62MB | direct path read (1) | | p005 | Set 1 | 6 | 771 | 87M | 322MB | direct path read (1) | | | | | | | | direct path read temp (5) | | p006 | Set 1 | 7 | 654 | 67M | 62MB | direct path read (1) | | p007 | Set 1 | 8 | 179 | 24M | 55MB | direct path read (1) | | p008 | Set 1 | 9 | 1638 | 235M | 70MB | | | p009 | Set 1 | 10 | 360 | 46M | 54MB | direct path read (1) | | p010 | Set 1 | 11 | 1920 | 294M | 337MB | direct path read temp (6) | --> 1920s | p011 | Set 1 | 12 | 289 | 30M | 69MB | | | p012 | Set 1 | 13 | 839 | 98M | 66MB | direct path read (1) | | p013 | Set 1 | 14 | 524 | 63M | 55MB | | | p014 | Set 1 | 15 | 1776 | 263M | 69MB | | | p015 | Set 1 | 16 | 1016 | 130M | 61MB | direct path read (1) | | p016 | Set 2 | 1 | 0.22 | 1166 | 3MB | | | p017 | Set 2 | 2 | 1.36 | 6867 | 51MB | | | p018 | Set 2 | 3 | 1.02 | 1298 | 36MB | | | p019 | Set 2 | 4 | 6.71 | 2313 | 129MB | direct path read temp (2) | | p020 | Set 2 | 5 | 0.40 | 978 | 16MB | | | p021 | Set 2 | 6 | 1.32 | 8639 | 41MB | direct path read temp (1) | | p022 | Set 2 | 7 | 0.18 | 896 | 2MB | | | p023 | Set 2 | 8 | 0.23 | 469 | 9MB | | --> 0.23s | p024 | Set 2 | 9 | 0.52 | 3635 | 19MB | | --> 0.52s | p025 | Set 2 | 10 | 0.33 | 1163 | 3MB | | | p026 | Set 2 | 11 | 0.65 | 260 | 31MB | db file sequential read (1) | | p027 | Set 2 | 12 | 0.21 | 1099 | 6MB | | | p028 | Set 2 | 13 | 0.58 | 497 | 20MB | | | p029 | Set 2 | 14 | 1.43 | 4278 | 54MB | | | p030 | Set 2 | 15 | 0.30 | 3481 | 8MB | | | p031 | Set 2 | 16 | 2.86 | 517 | 91MB | | ============================================================================================
Parallel Execution Details (DOP=16 , Servers Allocated=32) ============================================================================================= | Name | Type | Server# | Elapsed | Buffer | Read | Wait Events | | | | | Time(s) | Gets | Bytes | (sample #) | ============================================================================================= | PX Coordinator | QC | | 0.48 | 2531 | 16384 | | | p000 | Set 1 | 1 | 1730 | 202M | 63MB | direct path read (1) | | p001 | Set 1 | 2 | 2416 | 351M | 61MB | | | p002 | Set 1 | 3 | 1094 | 133M | 59MB | | | p003 | Set 1 | 4 | 2528 | 348M | 64MB | direct path read (1) | | p004 | Set 1 | 5 | 965 | 129M | 63MB | direct path read (1) | | p005 | Set 1 | 6 | 1089 | 129M | 322MB | direct path read (1) | | | | | | | | direct path read temp (5) | | p006 | Set 1 | 7 | 1459 | 165M | 62MB | direct path read (1) | | p007 | Set 1 | 8 | 221 | 30M | 55MB | direct path read (1) | | p008 | Set 1 | 9 | 2640 | 357M | 70MB | | | p009 | Set 1 | 10 | 952 | 115M | 54MB | direct path read (1) | | p010 | Set 1 | 11 | 3117 | 471M | 337MB | direct path read temp (6) | --> 3117s | p011 | Set 1 | 12 | 400 | 42M | 69MB | | | p012 | Set 1 | 13 | 1621 | 195M | 66MB | direct path read (1) | | p013 | Set 1 | 14 | 1126 | 132M | 55MB | | | p014 | Set 1 | 15 | 2662 | 370M | 72MB | | | p015 | Set 1 | 16 | 1194 | 147M | 61MB | direct path read (1) | | p016 | Set 2 | 1 | 0.22 | 1166 | 3MB | | | p017 | Set 2 | 2 | 1.36 | 6867 | 51MB | | | p018 | Set 2 | 3 | 1.02 | 1298 | 36MB | | | p019 | Set 2 | 4 | 6.72 | 2313 | 131MB | direct path read temp (2) | | p020 | Set 2 | 5 | 0.40 | 978 | 16MB | | | p021 | Set 2 | 6 | 1.32 | 8639 | 41MB | direct path read temp (1) | | p022 | Set 2 | 7 | 0.18 | 896 | 2MB | | | p023 | Set 2 | 8 | 0.23 | 469 | 9MB | | --> 0.23s | p024 | Set 2 | 9 | 0.52 | 3635 | 19MB | | --> 0.52s | p025 | Set 2 | 10 | 0.33 | 1163 | 3MB | | | p026 | Set 2 | 11 | 0.65 | 260 | 31MB | db file sequential read (1) | | p027 | Set 2 | 12 | 0.21 | 1099 | 6MB | | | p028 | Set 2 | 13 | 0.58 | 497 | 20MB | | | p029 | Set 2 | 14 | 1.43 | 4278 | 54MB | | | p030 | Set 2 | 15 | 0.30 | 3481 | 8MB | | | p031 | Set 2 | 16 | 2.89 | 517 | 92MB | | =============================================================================================
If you look carefully to the above reports you will notice that the elapsed time of the parallel servers mentioned being in ERROR (p16-p30) is not increasing in contrast to the elapsed time of the parallel servers mentioned being in EXECUTION (p0-p15) which is continuously increasing.
Thanks to Randolf Geist (again) I knew that there is a bug in Real time SQL monitoring report which occurs when a parallel server is not working for more than 30 minutes. In such a case the Real time SQL monitoring will starts showing those parallel severs in in ERROR confusing the situation.
As far as I was able to reproduce the issue I started the process again at 16h03 and I kept executing the following select from time to time having no rows for each execution
SELECT sql_id, process_name, status FROM v$sql_monitor WHERE sql_id = '5np4u0m0h69jx' –- changed a little bit the sql_id AND status ='DONE (ERROR)' ORDER BY process_name ; no rows selected
Until at around 16h37 i.e. after 30 minutes (and a little bit more) of execution the above select started showing processes in error:
SQL> SELECT 2 sql_id, 3 process_name, 4 status 5 FROM v$sql_monitor 6 WHERE sql_id = '5np4u0m0h69jx' 7 AND status ='DONE (ERROR)' 8 ORDER BY process_name ; SQL_ID PROCE STATUS ------------- ----- --------------- 5np4u0m0h69jx ora DONE (ERROR) 5np4u0m0h69jx p016 DONE (ERROR) 5np4u0m0h69jx p017 DONE (ERROR) 5np4u0m0h69jx p018 DONE (ERROR) 5np4u0m0h69jx p020 DONE (ERROR) 5np4u0m0h69jx p021 DONE (ERROR) 5np4u0m0h69jx p022 DONE (ERROR) 5np4u0m0h69jx p023 DONE (ERROR) 5np4u0m0h69jx p024 DONE (ERROR) 5np4u0m0h69jx p025 DONE (ERROR) 5np4u0m0h69jx p026 DONE (ERROR) 5np4u0m0h69jx p027 DONE (ERROR) 5np4u0m0h69jx p028 DONE (ERROR) 5np4u0m0h69jx p029 DONE (ERROR) 5np4u0m0h69jx p030 DONE (ERROR)
At the very beginning of the process several parallel servers was not running while several others were busy. And when the first parallel server (p10 in this case) reaches more than 1800 seconds (1861 seconds in this case) the real time Sql monitoring started showing the not working parallel servers in ERROR.
Bottom line: don’t be confused (as I have been) by that DONE (ERROR) status, your SQL statement might still be running consuming time and energy despite this wrong real time SQL monitoring reporting status
[…] Part 2: where I have explained the oddity shown by the SQL monitoring report when monitoring non active parallel server for more than 30 minutes. […]
Pingback by Don’t pre-empt the CBO from doing its work | Mohamed Houri’s Oracle Notes — July 2, 2015 @ 2:03 pm |
[…] Houri about what happens when a parallel slave is still assigned to an executing statement but isn’t given any work to do for a long time; and an item from Stefan Koehler about _px_trace and tracking down why the […]
Pingback by PQ Index anomaly | Oracle Scratchpad — July 15, 2015 @ 7:35 pm |