Mohamed Houri’s Oracle Notes

May 12, 2014

Disjunctive subquery

Filed under: Trouble shooting — hourim @ 1:29 pm

Here it is a SQL query called from a .Net web service which is ”time outing” and to which I have been asked to bring a fix.

SELECT      d.col_id,
            d.col_no,
            d.col_dost_id,
            d.col_r_id,
            d.xxx,
            d.yyy,
            ……..
            d.zzz
      FROM table_mho d
      WHERE (d.COL_UK = ‘LRBRE-12052014’
          OR EXISTS (select 1
                     from table_mho d1
                     where d1.col_id = d.col_id
                       and exists (select 1
                                   from table_mho d2
                                   where d2.COL_UK = ‘LRBRE-12052014’
                                     and d1.master_col_id = d2.col_id
                                     and d2.col_type = 'M' )
                       and d1.col_type = 'S'
                       )
              )
    order by d.col_id;

Looking carefully at the content of this query I have immediately got a clue on what might be happening here: Disjunctive Subquery.

A disjunctive subquery represents a subquery that appears in an OR predicate (disjunction). And the above query has indeed an OR predicate followed by an EXISTS clause:

          OR EXISTS (select 1
                     from table_mho d1
                     where d1.col_id = d.col_id
                       and exists (select 1
                                   from table_mho d2
                                   where d2.COL_UK = ‘LRBRE-12052014’
                                     and d1.master_col_id = d2.col_id
                                     and d2.col_type = 'M' )
                       and d1.col_type = 'S'
                       )

I am not going to dig in the details of disjunctive subqueries and their inability to be unnested by the CBO for releases prior to 12c. I will be writing in a near future (I hope) a general article in which disjunctive subqueries will be explained and popularized via a reproducible model. The goal of this brief blog post is just to show how I have been successful to trouble shoot the above web service performance issue by transforming a disjunctive subquery into an UNION ALL SQL statement so that I gave the CBO an opportunity to choose an optimal plan.

Here it is the sub-optimal plan for the original query

---------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name             | Starts | E-Rows | A-Rows |   A-Time   |
---------------------------------------------------------------------------------------------------
|   1 |  SORT ORDER BY                 |                  |      1 |  46854 |      1 |00:00:10.70 |
|*  2 |   FILTER                       |                  |      1 |        |      1 |00:00:10.70 |
|   3 |    TABLE ACCESS FULL           | TABLE_MHO        |      1 |    937K|    937K|00:00:00.94 |
|   4 |    NESTED LOOPS                |                  |    937K|      1 |      0 |00:00:07.26 |
|*  5 |     TABLE ACCESS BY INDEX ROWID| TABLE_MHO        |    937K|      1 |     60 |00:00:06.65 |
|*  6 |      INDEX UNIQUE SCAN         | COL_MHO_PK       |    937K|      1 |    937K|00:00:04.14 |
|*  7 |     TABLE ACCESS BY INDEX ROWID| TABLE_MHO        |     60 |      1 |      0 |00:00:00.01 |
|*  8 |      INDEX UNIQUE SCAN         | COL_MHO_UK       |     60 |      1 |     60 |00:00:00.01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("D"."COL_UK"=‘LRBRE-12052014’ OR  IS NOT NULL))
   5 - filter(("D1"."MASTER_COL_ID" IS NOT NULL AND "D1"."COL_TYPE"='S'))
   6 - access("D1"."COL_ID"=:B1)
   7 - filter(("D2"."COL_TYPE"='M' AND "D1"."MASTER_COL_ID"="D2"."COL_ID"))
   8 - access("D2"."COL_UK"=‘LRBRE-12052014’)

Statistics
---------------------------------------------------
          0  recursive calls
          0  db block gets
    3771234  consistent gets
      22748  physical reads
          0  redo size
       1168  bytes sent via SQL*Net to client
        244  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          1  rows processed

Note the apparition of the FILTER operation (n°2) which is less efficient. One of the dramatic consequences of that is the NESTED LOOP operation (n°4) which has been started 937,000 times without producing any rows but nevertheless generating almost 4 millions of buffer gets. Because of this disjunctive subquery, Oracle is not able to merge the subquery clause with the rest of the query in order to consider another optimal path.

There is a simple technique if you want to re-write the above query in order to get rid of the disjunctive subquery: use of an UNION ALL as I did for my original query (bear in mind that in my actual case COL_UK column is NOT NULL)

SELECT ww.**
FROM
(SELECT     d.col_id,
            d.col_no,
            d.col_dost_id,
            d.col_r_id,
            d.xxx,
            d.yyy,
            ……..
            d.zzz
      FROM table_mho d
      WHERE d.COL_UK = ‘LRBRE-12052014’
UNION ALL
SELECT      d.col_id,
            d.col_no,
            d.col_dost_id,
            d.col_r_id,
            d.xxx,
            d.yyy,
            ……..
            d.zzz
      FROM table_mho d
      WHERE d.COL_UK != ‘LRBRE-12052014’
      AND EXISTS (select 1
                     from table_mho d1
                     where d1.col_id = d.col_id
                       and exists (select 1
                                   from table_mho d2
                                   where d2.COL_UK = ‘LRBRE-12052014’
                                     and d1.master_col_id = d2.col_id
                                     and d2.col_type = 'M' )
                       and d1.col_type = 'S'
                       )
              )
) ww
 order by ww.col_id;

And here it is the new corresponding optimal plan

------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name                  | Starts | E-Rows | A-Rows |   A-Time   |
------------------------------------------------------------------------------------------------------------
|   1 |  SORT ORDER BY                     |                       |      1 |      2 |      1 |00:00:00.01 |
|   2 |   VIEW                             |                       |      1 |      2 |      1 |00:00:00.01 |
|   3 |    UNION-ALL                       |                       |      1 |        |      1 |00:00:00.01 |
|   4 |     TABLE ACCESS BY INDEX ROWID    | TABLE_MHO             |      1 |      1 |      1 |00:00:00.01 |
|*  5 |      INDEX UNIQUE SCAN             | COL_MHO_UK            |      1 |      1 |      1 |00:00:00.01 |
|   6 |     NESTED LOOPS                   |                       |      1 |      1 |      0 |00:00:00.01 |
|   7 |      VIEW                          | VW_SQ_1               |      1 |      1 |      0 |00:00:00.01 |
|   8 |       HASH UNIQUE                  |                       |      1 |      1 |      0 |00:00:00.01 |
|   9 |        NESTED LOOPS                |                       |      1 |      1 |      0 |00:00:00.01 |
|* 10 |         TABLE ACCESS BY INDEX ROWID| TABLE_MHO             |      1 |      1 |      0 |00:00:00.01 |
|* 11 |          INDEX UNIQUE SCAN         | COL_MHO_UK            |      1 |      1 |      1 |00:00:00.01 |
|* 12 |         TABLE ACCESS BY INDEX ROWID| TABLE_MHO             |      0 |      1 |      0 |00:00:00.01 |
|* 13 |          INDEX RANGE SCAN          | COL_COL_MHO_FK_I      |      0 |     62 |      0 |00:00:00.01 |
|* 14 |      TABLE ACCESS BY INDEX ROWID   | TABLE_MHO             |      0 |      1 |      0 |00:00:00.01 |
|* 15 |       INDEX UNIQUE SCAN            | COL_MHO_PK            |      0 |      1 |      0 |00:00:00.01 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("D"."COL_UK"=‘LRBRE-12052014’)
  10 - filter("D2"."COL_TYPE"='M')
  11 - access("D2"."COL_UK"=‘LRBRE-12052014’)
  12 - filter("D1"."COL_TYPE"='S')
  13 - access("D1"."MASTER_COL_ID"="D2"."COL_ID")
       filter("D1"."MASTER_COL_ID" IS NOT NULL)
  14 - filter("D"."COL_UK"<>‘LRBRE-12052014’)
  15 - access("COL_ID"="D"."COL_ID")

Statistics
------------------------------------------------------
          0  recursive calls
          0  db block gets
          8  consistent gets
          0  physical reads
          0  redo size
       1168  bytes sent via SQL*Net to client
        403  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          1  rows processed

I went from a massif 3,771,234 of consistent gets to just 8 logical I/O. The subquery has been transformed into an in-line view (VW_SQ_1) while the not always good FILTER operation disappeared letting the place to rapid operations accurately estimated by the CBO without re-computing the statistics in between

Bottom Line:  Trouble shooting a query performance problem can sometime be achieved by reformulating the query so that you give the CBO a way to circumvent a transformation it can’t do with the original query.

Advertisements

May 7, 2014

ORA-04030 when parsing XML messages

Filed under: Trouble shooting — hourim @ 2:56 pm

Two days ago I have been asked to look at the following error which occurred in a PRE-PRODUCTION data base just few weeks before going live:

ORA-04030: out of process memory when trying to allocate 4032 bytes (qmxdGetElemsBy,qmemNextBuf:alloc)

The aim of this article is to explain how (with the help of my colleagues) I succeeded to circumvent this ORA-04030 error.

First, I was aware that ORA-04031 is linked to an issue with the SGA while the current ORA-04030 is linked with an abnormal consummation of PGA memory.Unfortunately, having no prior acquaintances with this application makes the issue hard to solve. All what I have been said is that when the corresponding PRE-PRODUCTION application has been triggered to parse 10 XML messages with 3MB of size for each message, it crashes during the parsing process of the 6th xml message with the above error.

I asked, nevertheless, the customer whether treating simultaneously 10 x 3MB XML messages is a situation that is possible to happen in their real life PRODUCTION application or they are just trying to fudge the database with extreme situations. And the answer was, to my disappointment, YES. This kind of XML load is very plausible once the application will go live.

I asked (again) if I can have those 10 XML messages in order to inject them into a DEVELOPMENT data base. Once they handled me a copy of the desired messages, I tried to parse them in the DEVELOPMENT instance which, hopefully (yes hopefully) crashes with the same error.

There is an ORA-04030 diagnostic tool developed by Oracle Support which, if you give it the corresponding alert log file, will generate for your system a set of recommendations that if implemented might solve the issue.

This tool when provided with the corresponding trace file, suggested us, among many other suggestions, to:

  1. Set the hidden parameter _realfree_heap_pagesize_hint
  2. Use bulk collect with the limit clause and gave us a link showing how to do that following Steven Feuerstein  article published on oracle magazine

Taken into account my ignorance of the application business and how the code has been exactly managed to parse the xml messages, I was looking to use a tactical fix, which will be safe, necessitate a minimum of tests and with a big chance of no side effects. As far as I was in a development database I instrumented (added extra logging) the PL/SQL code until I have clearly isolated the piece of code which is causing the error:


BEGIN
   -- New parser
       lx_parser := xmlparser.newParser;

   -- Parse XML file
       xmlparser.parseClob(lx_parser,pil_xml_file);

 
   --pil_xml_file);
       lxd_docxml := xmlparser.getDocument(lx_parser);

   -- get all elements
         lxd_element := xmldom.GETDOCUMENTELEMENT(lxd_docxml);

   FOR li_metadata_id IN 1..pit_metadata.COUNT
   LOOP
     IF pit_metadata(li_metadata_id).MXTA_FATHER_ID IS NULL
     THEN
       gi_num_row_table := 1;
       P_PARSE_XML_ELEMENTS(lxd_element
                           ,NULL
                           ,pit_metadata
                           ,li_metadata_id
                           ,pit_metadata(li_metadata_id).TAG_PATH
                            );
     END IF;
   END LOOP;
   …./….
 END;

The above piece of code is called in a loop and is traumatizing the PGA memory.

With the help of a colleague, we decided to use a tactical fix which consists of freeing up the pga memory each time a xml document is treated using the Oracle xmldom.freedocument API. In other words, I added the following piece of code at the end of the above stored procedure:


     END IF;

   END LOOP;

   …./….

xmldom.freedocument(lxd_docxml); -- added this

END;

And guess what?

The ORA-04030 ceases to disturb us.

Bottom Line: When troubleshooting an issue in a real life production application try first to reproduce it in a TEST environment. Then instrument again the PL/SQL code in this TEST environment until you narrow the error close to its calling program. Then look to a tactical fix that will not need a lot of non-regression tests and which should be of limited side effect.

April 22, 2014

Trouble shooting a performance issue

Filed under: Trouble shooting — hourim @ 2:34 pm

Here it is a case of Trouble shooting a performance issue taken from a real life production system that I wanted to share with you. There is an Oracle stored procedure continuously activated via a scheduled job. Its goal is to parse incoming XML messages before accepting them into or rejecting them from a production application. 80% of the XML messages are treated in an acceptable execution time. But 20% of those messages when injected into the queue are hanging and slowing down the database with a side effect on the execution time of the normal messages. My strategy was to ask for one of those problematic XML messages from PRODUCTION instance and execute it in the TEST data base with a 10046 trace file. So I did, and among many observations, I stopped at the following select I wanted to share with you the method I followed to trouble shoot it:

SELECT *
FROM
 ( SELECT WD.WAGD_ID,WH.WAGH_ID,WD.WAG_ID,WH.ARRIVAL_DATE,WH.DEPARTURE_DATE,
  WD.CHECK_REPERAGE ,ROW_NUMBER() OVER (PARTITION BY WH.WAGD_ID ORDER BY
  WH.ARRIVAL_DATE) RN FROM SRI_ORDER_DETAILS WD ,SRI_ORDER_HISTORIC WH
  WHERE WD.FROM_COM_ID = :B1 AND WD.WAGD_ID = WH.WAGD_ID AND WH.PDES_ID IS
  NULL ) WHERE RN = 1 

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.00       0.00          0          0          0           0
Execute   3124      0.28       0.27          0          0          0           0
Fetch     3124     31.79      32.05          0   10212356          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     6250     32.07      32.33          0   10212356          0           0

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 151     (recursive depth: 1)
Number of plan statistics captured: 2

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  VIEW  (cr=3269 pr=0 pw=0 time=14483 us cost=709 size=1245 card=15)
         0          0          0   WINDOW SORT PUSHED RANK (cr=3269 pr=0 pw=0 time=14480 us cost=709 size=660 card=15)
         0          0          0    MERGE JOIN  (cr=3269 pr=0 pw=0 time=14466 us cost=708 size=660 card=15)
         0          0          0     TABLE ACCESS BY INDEX ROWID SRI_ORDER_DETAILS (cr=3269 pr=0 pw=0 time=14464 us cost=532
     15139      15139      15139       INDEX FULL SCAN SRI_WAGD_PK (cr=2822 pr=0 pw=0 time=2305 us cost=80 size=0
         0          0          0     SORT JOIN (cr=0 pr=0 pw=0 time=0 us cost=176 size=320528 card=12328)
         0          0          0      TABLE ACCESS FULL SRI_ORDER_HISTORIC (cr=0 pr=0 pw=0 time=0 us cost=67 size=320528
********************************************************************************

Notice that the above select undergone 3,124 executions which have necessitated more than one million of logical reads to end up finally by producing 0 records. There is for sure a way to reduce this enormous waste of time and energy in manipulating non necessary millions of logical I/O. My experience in trouble shooting performance issues shows me that this kind of excessive logical I/O can be very often linked to two situations:

  •  Use of a not very precise index to access a table via an index rowid and throwing away the majority of rows
  •  The first operation of the plan is the one that manipulated the big number of rows in the entire row source execution plan

Coincidently, the first operation in my row source plan is an INDEX FULL SCAN and it is the operation generating the maximum number of rows (15,139). Well, now I can say that in order to tune this select I need first to have a more precise access to the SRI_ORDER_DETAILS table. Here how I managed to solve this issue:

1. Get a real time execution plan from memory

The desired execution plan confirming the row source stats from the Tkprof file is given here below:

----------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                     |      1 |        |      1 |00:00:00.03 |    3469 |
|*  1 |  VIEW                          |                     |      1 |     13 |      1 |00:00:00.03 |    3469 |
|*  2 |   WINDOW SORT PUSHED RANK      |                     |      1 |     13 |      1 |00:00:00.03 |    3469 |
|   3 |    MERGE JOIN                  |                     |      1 |     13 |      1 |00:00:00.03 |    3469 |
|*  4 |     TABLE ACCESS BY INDEX ROWID| SRI_ORDER_DETAILS   |      1 |     13 |      1 |00:00:00.02 |    3269 |
|   5 |      INDEX FULL SCAN           | SRI_WAGD_PK         |      1 |  15139 |  15139 |00:00:00.01 |    2822 |
|*  6 |     SORT JOIN                  |                     |      1 |  13410 |      1 |00:00:00.01 |     200 |
|*  7 |      TABLE ACCESS FULL         | SRI_ORDER_HISTORIC  |      1 |  13410 |  13410 |00:00:00.01 |     200 |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("RN"=1)
   2 - filter(ROW_NUMBER() OVER ( PARTITION BY "WH"."WAGD_ID" ORDER BY "WH"."ARRIVAL_DATE")<=1)
   4 - filter("WD"."FROM_COM_ID"=3813574)
   6 - access("WD"."WAGD_ID"="WH"."WAGD_ID")
       filter("WD"."WAGD_ID"="WH"."WAGD_ID")
   7 - filter("WH"."PDES_ID" IS NULL)

We identify a lot of data (15,139) by FULL SCANNING the primary key index SRI_WAGD_PK . The identified rows via the single PK column (wagd_id) index (operation n°5) are sent back to their parent operation (n°4) which wasted a lot of effort throwing away the entire set of generated rows (except one row for the bind variable I used during my test) by applying filter n°4 (FROM_COM_ID). Notice also how the CBO has made a complete mess of the cardinality estimate for the SORT JOIN operation n°6.

Now that the symptoms have been clearly identified next will be presented the solution

2. Create a new precise index

The solution that immediately came to my mind is to create an index with the FROM_COM_ID column. I was just wondering on how best is to design it?

  •  Using a single column index
  • Or a composite index
  •  And  in case of a composite index will the FROM_COM_ID be put at the beginning of the index or not?

As far as the single PK index contains the WAGD_ID column I was tended to create a composite index starting by the FROM_COM_ID. Something which resembles to this:

SQL> create index mho_test on SRI_ORDER_DETAILS(FROM_COM_ID, wagd_id);

Index created.

SQL> exec dbms_stats.gather_table_stats(user, 'SRI_ORDER_DETAILS', cascade => true, no_invalidate=> false);

PL/SQL procedure successfully completed.

When I re-queried again, the new row source execution plan looked like:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name                      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                           |      1 |        |      1 |00:00:00.01 |       7 |
|*  1 |  VIEW                           |                           |      1 |     13 |      1 |00:00:00.01 |       7 |
|*  2 |   WINDOW SORT PUSHED RANK       |                           |      1 |     13 |      1 |00:00:00.01 |       7 |
|   3 |    NESTED LOOPS                 |                           |      1 |        |      1 |00:00:00.01 |       7 |
|   4 |     NESTED LOOPS                |                           |      1 |     13 |      1 |00:00:00.01 |       6 |
|   5 |      TABLE ACCESS BY INDEX ROWID| SRI_ORDER_DETAILS         |      1 |     13 |      1 |00:00:00.01 |       3 |
|*  6 |       INDEX RANGE SCAN          | MHO_TEST                  |      1 |     13 |      1 |00:00:00.01 |       2 |
|*  7 |      INDEX RANGE SCAN           | SRI_WAGH_ARRIVAL_DATE_UK  |      1 |      1 |      1 |00:00:00.01 |       3 |
|*  8 |     TABLE ACCESS BY INDEX ROWID | SRI_ORDER_HISTORIC        |      1 |      1 |      1 |00:00:00.01 |       1 |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("RN"=1)
   2 - filter(ROW_NUMBER() OVER ( PARTITION BY "WH"."WAGD_ID" ORDER BY "WH"."ARRIVAL_DATE")<=1)
   6 - access("WD"."FROM_COM_ID"=3813574)
   7 - access("WD"."WAGD_ID"="WH"."WAGD_ID")
   8 - filter("WH"."PDES_ID" IS NULL)

Thanks to this new precise index(MHO_TEST ) we have drastically reduced the total number of logical reads (from 3,469 to 7) and the estimations done by the CBO are almost acceptable which made the CBO prefering a more convenient NESTED LOOPS instead of the previous inadequately estimated SORT JOIN operation.

3. Conclusion

When trouble shooting performance issue in real life production system, look carefully at INDEX access operations that generate a lot of rows and for which the parent table access operation is discarding the majority of those corresponding rows. In such case there is certainly a way to create a precise index which not only will avoid that waste of effort but might conduct the CBO to have better estimations and hence an optimal plan

March 21, 2013

ORA-02431: Cannot disable constraint

Filed under: Trouble shooting — hourim @ 7:21 pm

Recently a question came up on the otn forum which reminded me to write a small blog article that I will be referring to instead of creating a different test each time I see people asking how to trouble shoot the same error as that mentioned by the Original Poster(OP). The OP was struggling about an existing constraint which, despite he is seeing it via a select against user_constraints table, he, nevertheless, was unable to disable it because of ORA-02431 error: cannot disable constraint FK_Batch_Products no such constraint.
Here below is the select against the OP user_constraints table:

CONSTRAINT_NAME             CONSTRAINT_TYPE      STATUS
--------------------------- ------------------ -----------
FK_Product_SourceSpecies    R                  ENABLED
FK_Product_CreatePerson     R                  ENABLED
FK_Product_ModifyPerson     R                  ENABLED
FK_Product_ExpressionSystem R                  ENABLED
FK_Product_Localisation     R                  ENABLED
FK_Batch_Products           R                  ENABLED

Have you already spotted the obvious?

Well if not then let me tell you one thing:  each time I see lowercase letters in an Oracle object names then I am hundred percent sure that the owner of this object will have trouble identifying those objects and I will not be surprised when he will be faced to such a kind of non existing object error.

Below I have modeled the problem and have shown the solution to the OP.

SQL> create table t (id number, vc varchar2(10));

Table created.

SQL> alter table t add constraint t_pk primary key (id);

Table altered.

SQL> alter table t add constraint "t_lower_case" check (vc != 'NONE');

Table altered.

SQL> select table_name, constraint_name
  2  from user_constraints
  3  where table_name = 'T';

TABLE_NAME                     CONSTRAINT_NAME
------------------------------ ------------------------------
T                              T_PK
T                              t_lower_case

SQL> alter table t drop constraint t_lower_case;
alter table t drop constraint t_lower_case
                              *
ERROR at line 1:
ORA-02443: Cannot drop constraint  - nonexistent constraint

SQL> alter table t drop constraint "t_lower_case";

Table altered.

Bottom Line : it is very important to be careful when creating Oracle objects; give  them correct naming standard without enclosing their names between double quotes i.e. “ “

May 25, 2012

Tuning a batch job using AWR

Filed under: Trouble shooting — hourim @ 10:16 am
Tags:

We received a complaint from a customer about a weekly scheduled overnight oracle job which took 14 hours to get at its normal end during its last execution.  The same job has been smoothly running in about 14 minutes for the past several months. So what went wrong this time?

A 30 minutes AWR snapshots report collected during the execution of the last job execution look like:

        Snap Id      Snap Time      Sessions Curs/Sess
           --------- ------------------- -------- ---------
Begin Snap:     16178 21-May-12 07:00:09       292       2.9
  End Snap:     16179 21-May-12 07:30:16       294       2.9
   Elapsed:               30.11 (mins)
   DB Time:               48.68 (mins)

Cache Sizes                       Begin        End
~~~~~~~~~~~                  ---------- ----------
               Buffer Cache:    10,752M    10,752M  Std Block Size:         8K
           Shared Pool Size:     7,744M     7,744M      Log Buffer:    53,972K

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):                1.6                0.8       0.01       0.01
       DB CPU(s):                0.9                0.4       0.01       0.00
       Redo size:           36,781.3           18,374.5
   Logical reads:           86,607.6           43,265.8
   Block changes:              172.5               86.2
  Physical reads:            2,590.0            1,293.9
 Physical writes:               26.0               13.0
      User calls:              223.3              111.5
          Parses:               92.0               46.0
     Hard parses:                2.8                1.4
W/A MB processed:                1.6                0.8
          Logons:                0.2                0.1
        Executes:              108.3               54.1
       Rollbacks:                0.0                0.0
    Transactions:                2.0

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:  100.00       Redo NoWait %:  100.00
            Buffer  Hit   %:   99.69    In-memory Sort %:  100.00
            Library Hit   %:   93.95        Soft Parse %:   96.91
         Execute to Parse %:   15.02         Latch Hit %:   97.14
Parse CPU to Parse Elapsd %:   82.80     % Non-Parse CPU:   97.99

 Shared Pool Statistics        Begin    End
                              ------  ------
             Memory Usage %:   76.47   75.97
    % SQL with executions>1:   98.62   97.70
  % Memory for SQL w/exec>1:   95.92   94.74

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU                                            1,561          53.4
db file sequential read             275,799         691      3   23.7 User I/O
direct path read                     32,996         560     17   19.2 User I/O
db file parallel read                   720         124    173    4.3 User I/O
db file scattered read                8,488          20      2     .7 User I/O

So what to spot from that?

The direct path read wait event look suspect: 32,996 waits at a catastrophic 17 ms per wait. This represents about 9 minutes (32,996 * 17/1000)/60. The db file sequential read wait event may be here in this situation a necessary step for the smooth accomplishment of the desired work.  This is why I was quite confident that the direct path is the wait event on which effort should be concentrated in this particular situation.

Before embarking on this direct path read wait event, I would have to say few words about the CPU consumption. The AWR elapsed time is 30.11 minutes and the CPU time as reported in the Top 5 events is 1,561 seconds. Dividing these two metrics gives (1,806.6/1,561) 1.15 CPU at 100% utilisation. Since, this server was running under 32 CPU cores then the data base server was not CPU bound (1,15/32) = 3.6% CPU utilisation.

Since I have been confirmed that neither the PL/SQL code nor the data volume has changed, I have then asked to have the last week AWR report taken at the same time when the job was running.  The corresponding Top 5 Timed Foreground Events are shown below:


 Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU                                            1,082          54,01
db file sequential read               717,957       936      1   46.7  User I/O
direct path read                          371         6     15    0.28 User I/O
db file parallel read                   3,347         4    173    0.18 User I/O
db file scattered read                  2,852         3      2    0.14 User I/O

from where it can be easily pointed out that, in normal job execution, the direct path read wait contribution to the total execution time is ridiculous: 0,28% (0,09 minutes) while it contributed by 19% (9 minutes) to the total execution time during the last job execution.  So what makes the direct path wait event increase by this very high magnitude so that it altered the performance of the overnight job?

My suspicions were still good!!!

The part of the AWR report that has a correlation with the direct path read event and which is worth investigating is SQL ordered by reads, Segments by physical reads and (if in 11gR1 or higher),  Segments by direct physical reads.

 When I drill down up to the SQL ordered by reads I found this:

Physical Reads Executions  Reads per Exec  %Total Elapsed Time (s) %CPU %IO   QL Id         Module
-------------- ----------- --------------  ------ ---------------  ---- ----  ------------- ---------
  4,164,466     1           4,164,466.00    89.0   811.1           21.6 82.9 d6w4psbcwgt54  TOAD 9.7.2.5
 

Without mentioning the sql text behind this, what kind of overnight batch this job is as far as it has reported using a huge amount of physical read coming from TOAD ?  Is this sql issued from TOAD a part of the job? The answer was of course not.

So, clearly someone was issuing a huge query using TOAD which was doing a huge amount of direct path read consuming a lot of CPU resource that usually was only dedicated to the db file sequential read of that job.

Segments by direct physical read AWR part shows also correlated information :

-> Total Direct Physical Reads:       4,209,804
-> Captured Segments account for   51.7% of Total
            Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type         Reads  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
MHOADM     MHOMH_2011 MH_TABLES            TABLES_SX5 TABLE      349,197    8.29
MHOADM     MHOMH_2010 MH_TABLES            TABLES_SX5 TABLE      345,069    8.20
MHOADM     MHOMH_2009 MH_TABLES            TABLES_SX1 TABLE      275,781    6.55
MHOADM     MHOMH_2010 MH_TABLES            TABLES_SX1 TABLE      272,709    6.48
MHOADM     MHOMH_2011 MH_TABLES            TABLES_SX1 TABLE      265,037    6.30
          -------------------------------------------------------------

Where TABLES_SX5 is the table used by the same sql_id d6w4psbcwgt54 issued from TOAD and reported above.
Finally it becomes clear that the job has been in CPU concurrency with an abnormal and unusual select issued from TOAD.  The ADDM report that goes with the corresponding AWR also presents the TOAD sql_id d6w4psbcwgt54 as a Top SQL Statements to be tuned:

 Recommendation 1: SQL Tuning
   Estimated benefit is .46 active sessions, 28.18% of total activity.
   -------------------------------------------------------------------
   Action
      Run SQL Tuning Advisor on the SELECT statement with SQL_ID
      "d6w4psbcwgt54".
      Related Object
         SQL statement with SQL_ID d6w4psbcwgt54.     
   Rationale
      The SQL spent 100% of its database time on CPU, I/O and Cluster waits.
      This part of database time may be improved by the SQL Tuning Advisor.
   Rationale
      Database time for this SQL was divided as follows: 100% for SQL
      execution, 0% for parsing, 0% for PL/SQL execution and 0% for Java
      execution.
   Rationale
      SQL statement with SQL_ID "d6w4psbcwgt54" was executed 1 times and had
      an average elapsed time of 811 seconds.
   Rationale
      At least one execution of the statement ran in parallel.

   Recommendation 2: SQL Tuning
   Estimated benefit is .19 active sessions, 11.68% of total activity.

Action
      Look at the "Top SQL Statements" finding for SQL statements consuming
      significant I/O on this segment. For example, the SELECT statement with
      SQL_ID "d6w4psbcwgt54" is responsible for 100% of "User I/O" and
      "Cluster" waits for this segment.

   Rationale
      The I/O usage statistics for the object are: 9 full object scans, 277587
      physical reads, 0 physical writes and 275781 direct reads.
 

The ADDM report gives an additional credit to my suspicion when it recommends to investigate process that do not belong to this instance running on the host and consuming significant amount of virtual memory.  ADDM has identified the TOAD query; hasn’t it?

 Recommendation 1: Host Configuration
   Estimated benefit is 1.62 active sessions, 100% of total activity.
   ------------------------------------------------------------------
   Action
      Host operating system was experiencing significant paging but no
      particular root cause could be detected. Investigate processes that do
      not belong to this instance running on the host that are consuming
      significant amount of virtual memory. Also consider adding more physical
      memory to the host.

February 25, 2012

Tuning an insert/select via dblink

Filed under: Trouble shooting — hourim @ 11:40 am

Recently an interesting question came up on an Oracle French Forum about a performance problem during an insert/select operation via a dblink. The poster was suffering from the very long time (several hours without any response) the insert was taking. This is why he came to that forum and posted a question which looks like “what do you think if I commit every 1000 rows? Will my insert speed up”?  That’s not the right way to follow in the process of troubleshooting an Oracle performance problem. I always remember Cary Millsap mantra “Why guess when you can know”. Fortunately Oracle gives us a lot fo tools and trace events to instrument what going on inside the Oracle database. Here below are the steps of discussion we’ve had (the original poster, others and I) in that forum.

I first have asked the poster to trace his insert select using events 10046


alter session set events '10046 trace name context forever, level 12';

INSERT /*+ append */ INTO my_table
(
champs_1,
champs_2,
champs_3,
champs_4,
champs_5,
champs_6,
champs_7
)
SELECT * FROM la_table_cible@mon_dblink ;

commit ;

alter session set events '10046 trace name context off';

I am not going to stress that I have asked the poster to ensure that his direct path load will not be silently ignored by Oracle in case the table “my_table” has trigger or foreign key integrity constraints implemented on it. I have also asked if there will be subsequent deletes (after the insert) on this table making future direct path load not reusing any freed space made by that delete.

The generated trace file profiled via orasrp profiler shows that 98% of the insert/select response time is spent waiting in the event SQL*Net message from dblink.

Up to this step, we persuaded the poster that his initial guess of committing every 1000 records was not the ultimate solution; but instead he should focus on reducing the volume of data transferred from the distant data base to the local one. The poster sends us the explain plan and Statistics of the select part when this one is executed in the distant data base

Statistics
----------------------------------------------
212        recursive calls
3          db block gets
678993     consistent gets
483744     physical reads
592        redo size
124692665  bytes sent via SQL*Net TO client
1391305    bytes received via SQL*Net FROM client
126437     SQL*Net roundtrips TO/FROM client
17         sorts (memory)
0          sorts (disk)
1896535    rows processed

He is selecting 1,896,535 records via 126,437 SQL*Net roundtrips meaning that he is using the default arraysize which is 15 (1,896,535/126,437). For this, he did 0.358 (678,993/1,896,535) logical read for each selected row which seems to be quite acceptable. But, he is sending a very large amount of data 124MB over a data base link and making a hundred of thousands SQL*Net roundtrips. Note, that he is doing also a fairly large amount of physical read (483,744) when compared to the total number of logical reads (678,993).

I’ve then asked the poster to stop selecting all the rows from the distant data base and to start selecting only necessary rows. I asked also to post the explain plan of this query in attempt to help him tuning the select part of his problematic insert. When he said that he can’t intervene into the distant data base, I have then asked him to manage his insert with a PL/SQL stored procedure as shown below:


CREATE type distant_typ AS object (champs_1 number,champs_2 number,champs_3 number, ... champs_7 date);
2  /

CREATE type distant_typ_tab AS TABLE of distant_typ;
2  /

DECLARE
CURSOR get_data_from_db IS
SELECT
     champs_1,
     champs_2,
     champs_3,
     champs_4,
     champs_5,
     champs_6,
     champs_7
FROM la_table_cible@mon_dblink
;

BEGIN

OPEN get_data_from_db;
LOOP
FETCH get_data_from_db BULK collect INTO distant_typ_tab LIMIT 100;
FOR i in 1..distant_typ_tab.COUNT
LOOP
INSERT INTO INSERT /*+ append */ INTO ma_table
(champs_1,
 champs_2,
 champs_3,
 champs_4,
 champs_5,
 champs_6,
 champs_7
)
SELECT
 champs_1,
 champs_2,
 champs_3,
 champs_4,
 champs_5,
 champs_6,
 champs_7
FROM table (distant_typ_tab);
EXIXT when get_data_from_db%notfound;
END LOOP;
END LOOP;
COMMIT;
END;

And magically the poster comes out and said “fantastic my insert/select now is completing in a very acceptable time”.

June 16, 2011

Bind variable, shared pool and cursor sharing parameter

Filed under: Trouble shooting — hourim @ 6:23 pm

I was asked to investigate a real life production problem concerning the shared pool (library cache) of the SGA which come to be full so that the only solution which remains available was to stop and restart the data base.  I did immediately start thinking if I have not already been confronted to a similar case until I do realize that I have read something close to this situation in Chapter 4 – Memory Structures – of Tom Kyte book Expert Oracle Data Base Architecture 9i and 10g Programming Techniques and Solution;

So my first step in the trouble shooting path of this memory problem was to issue the following select against the v$sql dynamic view:

SELECT sql_text
      ,executions
FROM v$sql
WHERE sql_text LIKE '%Package_Name.P_UPDATE%'
AND executions <= 2
;

Which returns the following data picture:

sql_text                                           executions
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
…/…
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1

Thousands calls of the same instructions

BEGIN Package_Name.P_UPDATE_procedure(p_in_p1 ..) END;

executed once for each call. It is clear that  this application is suffering from not using bind variables and is dramatically filling up the library cache of the shared pool. This needs absolutely to be fixed.

Then I asked myself how could they fill up the shared pool while all what they are doing is calling a stored PL/SQL procedure? We all know that when using PL/SQL (static SQL) we should not encounter issues related to bind variables. PL/SQL itself takes care of our code and uses bind variables behind the scene. The only situation where we have to look carefully to the use of bind variable within PL/SQL is when we use Dynamic sql into stored procedures or functions (this is in fact another reason to avoid using dynamic SQL).

So how could they arrive to such a shared pool filled by thousands of same non re-executed PL/SQL calls?

And here where the problem resides: it is true that when using static PL/SQL stored objects, you don’t have to care about using bind variables inside those stored programs but you have to care about the third party which will call your PL/SQL stored procedure; they should do that using input parameters as bind variables; otherwise you will have your shared pool (library cache) full of calls to your stored PL/SQL objects as it is the case here.

Knowing this, I started figuring out the solution. I have identified the application responsible of those thousands of PL/SQL calls.  It is a non Oracle application developed with a language named SNAP and which is calling our PL/SQL stored objects without using bind variables. I explained them that they need to use OCI interface to prepare their statement, and implement it correctly using bind variables via OCIStmtPrepare2 and OCIDefineByPos (http://www.amazon.fr/Troubleshooting-Oracle-Performance-Christian-Antognini/dp/1590599179

Although they agreed doing that they have asked me to find a work around in between. The only dirty work around I was able to suggest them is to alter the session they are connecting to (and via which they are calling the stored PL/SQL procedure) so that the corresponding cursor_sharing value will be set to “FORCE”.

alter session set cursor_sharing=force;

Which they did immediately.

However, the same sql against the v$sql executed after this change was still showing the same picture of thousands of non re-executed calls

sql_text                                          executions
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
BEGIN Package_Name.P_UPDATE_procedure(p_in_p1          1
…/…

What went wrong?

I verified that they have really altered their session to set the cursor_sharing parameter to FORCE!!! So what’s going on here?

The problem in this case is that even if you set the cursor_sharing parameter to FORCE if you still keep the calls like:

BEGIN
Package_Name.P_UPDATE_procedure(p_in_p1,…);
END;

Oracle will not perform bind variables substitution (even when cursor sharing is set to force). In order for this substitution to occur we need to change the above code to be as follows (see http://apress.com/book/view/9781590596364)

   CALL Package_Name.P_UPDATE_procedure(p_in_p1,…);

So they did the change and so we obtain the new picture taken from v$sql:

SELECT
    substr(sql_text,1,55)
   ,executions
FROM v$sql
WHERE sql_text LIKE '%Package_Name.P_UPDATE%'
AND executions <= 2
;

which gives now the new following picture:

sql_text                                                 executions
------------------------------------------------------- ----------
CALL Package_Name.P_UPDATE_procedure (:"SYS_B_00",TO        897
CALL Package_Name.P_UPDATE_procedure (:"SYS_B_00",TO       8380
CALL Package_Name.P_UPDATE_procedure (:"SYS_B_00",TO         24
CALL Package_Name.P_UPDATE_procedure (:"SYS_B_00",TO      18671

Woo. See how the number of executions has gone so high 18.671. The shared pool becomes very attractive now.

It is also known that setting cursor_sharing to force may have unexpected side effects; for example when there exist a function based index having literals in its definition (substr (col_nam, 4, 2) for example), in this case you will notice that your query which was using your function based index starts not using it since you’ve set your cursor sharing to force. This is simply because oracle will change your query predicate from substr (col_name, 4,2) to something like substr (col_name,:SYS_B_0,SYS_B_1) which will not match your query predicate anymore and hence will not use your index.

The bottom line is that setting the cursor_sharing to force has to be a temporary solution and that the definitive solution should be to use correctly bind variables when calling PL/SQL stored object.

Create a free website or blog at WordPress.com.

Tony's Oracle Tips

Tony Hasler's light hearted approach to learning about Oracle

Richard Foote's Oracle Blog

Focusing Specifically On Oracle Indexes, Database Administration and Some Great Music

Hatem Mahmoud Oracle's blog

Just another Oracle blog : Database topics and techniques

Mohamed Houri’s Oracle Notes

Qui se conçoit bien s’énonce clairement

Oracle Diagnostician

Performance troubleshooting as exact science

Raheel's Blog

Things I have learnt as Oracle DBA

Coskan's Approach to Oracle

What I learned about Oracle

So Many Oracle Manuals, So Little Time

“Books to the ceiling, Books to the sky, My pile of books is a mile high. How I love them! How I need them! I'll have a long beard by the time I read them”—Lobel, Arnold. Whiskers and Rhymes. William Morrow & Co, 1988.

Carlos Sierra's Tools and Tips

Tools and Tips for Oracle Performance and SQL Tuning

Oracle Scratchpad

Just another Oracle weblog

OraStory

Dominic Brooks on Oracle Performance, Tuning, Data Quality & Sensible Design ... (Now with added Sets Appeal)