It is well known that any record found in dba_hist_active_session_history has inevitably been routed there from v$active_session_history. If so, then how could we interpret the following cut & past from a running production system?
ASH first
SQL> select event, count(1) from gv$active_session_history where sample_time between to_date('06072015 18:30:00', 'ddmmyyyy hh24:mi:ss') and to_date('06072015 19:30:00', 'ddmmyyyy hh24:mi:ss') group by event order by 2 desc; EVENT COUNT(1) ---------------------------------------------------------------- ---------- 372 direct path read 185 log file parallel write 94 Disk file Mirror Read 22 control file sequential read 20 control file parallel write 18 direct path write temp 16 Streams AQ: qmn coordinator waiting for slave to start 12 db file parallel read 11 gc cr multi block request 6 enq: KO - fast object checkpoint 4 db file sequential read 3 ges inquiry response 3 os thread startup 2 PX Deq: Signal ACK RSG 2 enq: CF - contention 1 PX Deq: Slave Session Stats 1 Disk file operations I/O 1 IPC send completion sync 1 reliable message 1 null event 1 enq: CO - master slave det 1 db file parallel write 1 gc current block 2-way 1
AWR next
SQL> select event, count(1) from dba_hist_active_sess_history where sample_time between to_date('06072015 18:30:00', 'ddmmyyyy hh24:mi:ss') and to_date('06072015 19:30:00', 'ddmmyyyy hh24:mi:ss') group by event order by 2 desc; EVENT COUNT(1) ---------------------------------------------------------------- ---------- SQL*Net break/reset to client 12950 enq: TM - contention 12712 624 db file sequential read 386 enq: TX - row lock contention 259 SQL*Net message from dblink 74 direct path read 62 SQL*Net more data from dblink 27 log file parallel write 26 log file sync 15 SQL*Net more data from client 9 control file sequential read 7 Disk file Mirror Read 6 gc cr grant 2-way 5 db file parallel write 4 read by other session 3 control file parallel write 3 Streams AQ: qmn coordinator waiting for slave to start 3 log file sequential read 2 direct path read temp 2 enq: KO - fast object checkpoint 2 gc cr multi block request 1 CSS initialization 1 gc current block 2-way 1 reliable message 1 db file parallel read 1 gc buffer busy acquire 1 ges inquiry response 1 direct path write temp 1 rdbms ipc message 1 os thread startup 1
12,950 snapshots of SQL*Net break/reset to client and 12,712 snapshots of an enq: TM – contention wait events in AWR not found in ASH. How can we interpret this situation?
This 11.2.0.4.0 database is implemented under a RAC infrastructure with 2 instances. Let’s look at the ASH of the two instances separately
Instance 1 first
SQL> select event, count(1) from v$active_session_history where sample_time between to_date('06072015 18:30:00', 'ddmmyyyy hh24:mi:ss') and to_date('06072015 19:30:00', 'ddmmyyyy hh24:mi:ss') group by event order by 2 desc; no rows selected
Instance 2 next
SQL> select event, count(1) from v$active_session_history where sample_time between to_date('06072015 18:30:00', 'ddmmyyyy hh24:mi:ss') and to_date('06072015 19:30:00', 'ddmmyyyy hh24:mi:ss') group by event order by 2 desc; EVENT COUNT(1) ---------------------------------------------------------------- ---------- 372 direct path read 185 log file parallel write 94 Disk file Mirror Read 22 control file sequential read 20 control file parallel write 18 direct path write temp 16 Streams AQ: qmn coordinator waiting for slave to start 12 db file parallel read 11 gc cr multi block request 6 enq: KO - fast object checkpoint 4 db file sequential read 3 ges inquiry response 3 os thread startup 2 PX Deq: Signal ACK RSG 2 enq: CF - contention 1 PX Deq: Slave Session Stats 1 Disk file operations I/O 1 IPC send completion sync 1 reliable message 1 null event 1 enq: CO - master slave det 1 db file parallel write 1 gc current block 2-way 1
All what is sampled in ASH in that specific time interval is coming from the second instance while the first instance doesn’t report any record for the corresponding time interval. This inevitably questions either the ash size of instance one or an imbalanced workload between the two instances:
ASH size first
SQL> select 2 inst_id 3 ,total_size 4 from gv$ash_info; INST_ID TOTAL_SIZE ---------- ---------- 1 100663296 2 100663296
ASH Activity next
SQL> select inst_id ,total_size ,awr_flush_emergency_count from gv$ash_info; INST_ID TOTAL_SIZE AWR_FLUSH_EMERGENCY_COUNT ---------- ---------- ------------------------- 1 100663296 136 2 100663296 0
Typically the activity is mainly oriented towards instance 1 and the abnormal and unusual 12,712 SQL*Net break/reset to client wait events have exacerbated the rate of insert into ASH buffers of instance one generating the 136 awr_flush_emergency_count and, as such, the discrepancies between ASH and AWR.
This is also confirmed by the difference in the ASH retention period between the two instances
Instance 1 first where only 3 hours of ASH data are kept
SQL> select min(sample_time), max(sample_time) 2 from v$active_session_history; MIN(SAMPLE_TIME) MAX(SAMPLE_TIME) --------------------------------------- ------------------------- 08-JUL-15 05.51.20.502 AM 08-JUL-15 08.35.48.233 AM
Instance 2 next where several days worth of ASH data are still present
SQL> select min(sample_time), max(sample_time) 2 from v$active_session_history; MIN(SAMPLE_TIME) MAX(SAMPLE_TIME) --------------------------------------- ------------------------- 25-JUN-15 20.01.43 08-JUL-15 08.37.17.233 AM
The solution would be one of the following points (I think in the order of priority):
- Solve this SQL*Net break/reset to client issue which is dramatically filling up the ash buffer causing unexpected rapid flush of important and more precise data
- Balance the work load activity between the two instances
- Increase the ash size of the instance 1 by means of alter system set “_ash_size”=25165824;
In the next article I will explain how I have identified what is causing this unusual SQL*Net break/reset to client wait events.