Parallel Query, Broadcast distribution and TEMP space

February 24, 2015

Parallel Query, Broadcast distribution and TEMP space

Filed under: Parallel Processing — hourim @ 6:24 pm

When it comes to parallel operations there are fundamentally two data distribution methods between individual parallel (PX) server sets: hash-hash and broadcast distribution. Suppose that you are going to operate a parallel HASH JOIN between two full table scans using a Degree of Parallelism (DOP) of 2. Since the HASH JOIN operation needs two distinct set of PX servers there will be actually always 2*DOP = 4 slaves acting in parallel. They represent two slaves per PX set: (PX1 and PX2) for the first set and (PX3 and PX4) for the second set. Each table in the join is read by one of the PX server set in parallel. When both PX servers have finished collecting their data set they need to distribute it to the subsequent parallel HASH JOIN operation. This data distribution can be done using, commonly, either a hash-hash or a broadcast distribution. Using the former method (hash-hash) the result set gathered by the parallel scan of the build and the probe tablesin the join are both sent to the parallel server set responsible for the hash join operation. Using a BROADCAST distribution method, instead of distributing rows from both result sets Oracle sends the smaller result set to all parallel server slaves of the set responsible of the subsequent parallel hash joins operation.

I bolded the word smaller result set to emphasize that smaller is relative to the other row source in the join. Do you consider that a build table with 78 million of rows is a small data set? It might be considered smaller in the eye of the Oracle optimizer when the probe table is estimated to generate 770 million of rows. This is exactly what happened to me.

If this concept of parallel data distribution is still not clear for you and you want to understand what I have encountered and how I’ve managed to pull out myself from this nightmare then continuing reading this article might be worth the effort.

I was asked to create a big table based on a join with two other big tables so that doing this serially was practically impossible to complete in an acceptable execution time. I decided, therefore, to create it in parallel. After having enabled parallel DML and forced parallel DDL I launched the following create table statement with a DOP of 8 both for DML and DDL

SQL> create /*+ parallel(8) */
       table table_of_dates
       tablespace dtbs
       pctfree 0
    as
    select
        /*+ parallel(8)
          full(t)
          full(tr)
      */
       t.tr_tabl_id
     , t.table1_date_time
     , t.writing_date
     , min(tr.arrivl_date)
     , max(tr.arrivl_date)
   from
      table1 t
   left outer join table2 tr
   on t.tr_tabl_id = tr.tr_tabl_id
   join table3
   on t.order_id = table3.order_id
   and tr.status not in ('CANCELED')
   where t.writing_date <= to_date('17.06.2011', 'dd.mm.yyyy')
   and table3.order_type = ‘Broadcast’
   group by
         t.tr_tabl_id
       , t.table1_date_time
       , t.writing_date;

create /*+ parallel(8) */
*
ERROR at line 1:
ORA-12801: error signaled in parallel query server P013
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP

As you can notice, it went with an ORA-01652 error. Below it the corresponding Real Time SQL Monitoring Report showing that the HASH JOIN operation at line 7 reaches 67G which is beyond the size of the current physical TEMP tablespace (64G) and hence the ORA-01652 error.

Parallel Execution Details (DOP=8, Servers Allocated=16)
SQL Plan Monitoring Details (Plan Hash Value=3645515647)
=====================================================================================
| Id |              Operation              |        Name | Execs |   Rows   | Temp  |
|    |                                     |             |       | (Actual) | (Max) |
=====================================================================================
|  0 | CREATE TABLE STATEMENT              |             |    17 |          |       |
|  1 |   PX COORDINATOR                    |             |    17 |          |       |
|  2 |    PX SEND QC (RANDOM)              | :TQ10003    |       |          |       |
|  3 |     LOAD AS SELECT                  |             |       |          |       |
|  4 |      HASH GROUP BY                  |             |       |          |       |
|  5 |       PX RECEIVE                    |             |       |          |       |
|  6 |        PX SEND HASH                 | :TQ10002    |     8 |          |       |
|  7 |         HASH JOIN                   |             |     8 |        0 |   67G |
|  8 |          PX RECEIVE                 |             |     8 |     626M |       |
|  9 |           PX SEND BROADCAST         | :TQ10001    |     8 |     626M |       |
| 10 |            HASH JOIN                |             |     8 |      78M |       |
| 11 |             PX RECEIVE              |             |     8 |     372K |       |
| 12 |              PX SEND BROADCAST      | :TQ10000    |     8 |     372K |       |
| 13 |               PX BLOCK ITERATOR     |             |     8 |    46481 |       |
| 14 |                INDEX FAST FULL SCAN | IDX_TABLE3_3|   182 |    46481 |       |
| 15 |             PX BLOCK ITERATOR       |             |     8 |      88M |       |
| 16 |              TABLE ACCESS FULL      | TABLE1      |   120 |      88M |       |
| 17 |          PX BLOCK ITERATOR          |             |     8 |     717M |       |
| 18 |           TABLE ACCESS FULL         | TABLE2      |   233 |     717M |       |
=====================================================================================

I stumped few minutes looking at the above execution plan and have finally decided to try a second create table with a reduced degree of parallelism (4 instead of 8) and here what I got

SQL> create /*+ parallel(4) */
       table table_of_dates
       tablespace dtbs
       pctfree 0
    as
    select
        /*+ parallel(4)
         full(t)
         full(tr)
      */
       t.tr_tabl_id
     , t.table1_date_time
     , t.writing_date
     , min(tr.arrivl_date)
     , max(tr.arrivl_date)
   from
      table1 t
   left outer join table2 tr
   on t.tr_tabl_id = tr.tr_tabl_id
   join table3
   on t.order_id = table3.order_id
   and tr.status not in ('CANCELED')
   where t.writing_date <= to_date('17.06.2011', 'dd.mm.yyyy')
   and table3.order_type = ‘Broadcast’
   group by
         t.tr_tabl_id
       , t.table1_date_time
       , t.writing_date;

Table created.
Elapsed: 00:31:42.29

The table has been this time successfully created within 32 minutes approximatively.

Before going to the next issue in the pipe, I wanted to understand why reducing the Degree Of Parallelism (DOP) from 8 to 4 made the create statement successful? The obvious thing I have attempted was to compare the 8 DOP execution plan with the 4 DOP one. The first plan has already been shown above. The second one is presented here below (reduced only to the information that is vital to the aim of this article):

Parallel Execution Details (DOP=4 , Servers Allocated=8)
SQL Plan Monitoring Details (Plan Hash Value=326881411)
=============================================================================================
| Id |                 Operation                  |       Name     | Execs |   Rows   |Temp |
|    |                                            |                |       | (Actual) |(Max)|
=============================================================================================
|  0 | CREATE TABLE STATEMENT                     |                |     9 |        4 |     |
|  1 |   PX COORDINATOR                           |                |     9 |        4 |     |
|  2 |    PX SEND QC (RANDOM)                     | :TQ10003       |     4 |        4 |     |
|  3 |     LOAD AS SELECT                         |                |     4 |        4 |     |
|  4 |      HASH GROUP BY                         |                |     4 |      75M |     |
|  5 |       PX RECEIVE                           |                |     4 |     168M |     |
|  6 |        PX SEND HASH                        | :TQ10002       |     4 |     168M |     |
|  7 |         HASH JOIN                          |                |     4 |     168M | 34G |
|  8 |          PX RECEIVE                        |                |     4 |     313M |     |
|  9 |           PX SEND BROADCAST                | :TQ10001       |     4 |     313M |     |
| 10 |            HASH JOIN                       |                |     4 |      78M |     |
| 11 |             BUFFER SORT                    |                |     4 |     186K |     |
| 12 |              PX RECEIVE                    |                |     4 |     186K |     |
| 13 |               PX SEND BROADCAST            | :TQ10000       |     1 |     186K |     |
| 14 |                TABLE ACCESS BY INDEX ROWID | TABLE3         |     1 |    46481 |     |
| 15 |                 INDEX RANGE SCAN           | IDX_ORDER_TYPE |     1 |    46481 |     |
| 16 |             PX BLOCK ITERATOR              |                |     4 |      88M |     |
| 17 |              TABLE ACCESS FULL             | TABLE1         |   115 |      88M |     |
| 18 |          PX BLOCK ITERATOR                 |                |     4 |     770M |     |
| 19 |           TABLE ACCESS FULL                | TABLE2         |   256 |     770M |     |
=============================================================================================

They don’t share exactly the same execution plan (they have two different plan hash values). The irritating question was: why, by halving down the degree of parallelism from 8 to 4 the same SQL statement necessitated almost half (34G) the amount of TEMP space and completed successfully?

The answer resides into the parallel distribution (PQ Distrib) method used by the parallel server sets to distribute their collected set of rows to the subsequent parallel server set (doing the hash join).

In the DOP 4 execution plan above we can see that we have 4 PX parallel server sets each one responsible of filling up one of the virtual TQ tables: TQ10000, TQ10001, TQ10002 and TQ10003. Here below is how to read the above execution plan:

The first PX1 set of slaves reads TABLE3 and broadcast its data set to the second PX2 set of slaves through the TQ10000 virtual table.
PX2 set reads TABLE1, hash join it with the data set it has received (TQ10000) from PX1 set and broadcast its result to the next parallel server
set which is PX3 via the second TQ10001 virtual table.
PX3 set of parallel slaves probes TABLE2 by parallel full scanning it and hash join it with the build result set (TQ10001) it has received from
PX2 parallel set. This operation ends up by filling up the third virtual TQ table TQ10002 and by sending it to the next and last PX server PX4
using a hash distribution.
Finally, PX4 set of slaves will receive the TQ10002 data, hash group by it, fill the last virtual table (TQ10003) table and send it to the query
coordinator (QC) which will end up by creating the table_of_dates table

That is a simple way of reading a parallel execution plan. By listing the above parallel operations I aimed to emphasize that data (HASH JOIN operation at line 10) produced by the parallel set of slaves PX2 is broadcasted (PX SEND BROADCAST operation at line 9) to the next parallel set PX3. And this means that each slave of PX2 set will pass every row it has received to every slave of the PX3 set. A typical reason to do such a distribution method is that the data set of the first row source (78M of rows in this case) is ”smaller” than the second row source to be joined with (770M of rows).

In order to make the picture clear let’s zoom around the hash join operation in the 4 DOP plan

Parallel Execution Details (DOP=4 , Servers Allocated=8)
======================================================================
| Id |        Operation    |       Name     | Execs |   Rows   |Temp |
|    |                     |                |       | (Actual) |(Max)|
======================================================================
|  7 |HASH JOIN            |                |     4 |     168M | 34G |
|  8 | PX RECEIVE          |                |     4 |     313M |     |
|  9 |  PX SEND BROADCAST  | :TQ10001       |     4 |     313M |     |
| 10 |   HASH JOIN         |                |     4 |      78M |     |
| 18 | PX BLOCK ITERATOR   |                |     4 |     770M |     |
| 19 |  TABLE ACCESS FULL  | TABLE2         |   256 |     770M |     |
======================================================================

The ”smaller” result set produced by the PX2 set of slaves at line 10 is 78M of rows. As far as Oracle decided to broadcast those 78M of rows towards PX3 set it has been duplicated as many times as there are slaves in PX3 set to receive the broadcasted data. As far as we have a DOP of 4 then this means that the HASH JOIN operation at line 7 has received 4*78 = 313M of rows (operation at line 8) that has been built and hashed in memory in order to be able to probe the result set produced by the second parallel server set coming from operations 18-19.

Multiply the DOP by 2, keep the same parallel distribution method and the same HASH JOIN operation will have to build and hash a table of 8*78 = 626M of rows. Which ultimately has required more than 67GB of TEMP space (instead of the initial DOP 4 34GB) as shown below:

Parallel Execution Details (DOP=8, Servers Allocated=16)
===================================================================
| Id |     Operation     |        Name | Execs |   Rows   | Temp  |
|    |                   |             |       | (Actual) | (Max) |
===================================================================
|  7 |HASH JOIN          |             |     8 |        0 |   67G |
|  8 | PX RECEIVE        |             |     8 |     626M |       |
|  9 |  PX SEND BROADCAST| :TQ10001    |     8 |     626M |       |
| 10 |   HASH JOIN       |             |     8 |      78M |       |
| 17 | PX BLOCK ITERATOR |             |     8 |     717M |       |
| 18 |  TABLE ACCESS FULL| TABLE2      |   233 |     717M |       |
===================================================================

Now that I know from where the initial ORA-01652 error is coming from, changing the parallel distribution method from BROADCAST to HASH might do the job without requiring more than 64GB of TEMP space. A simple way to accomplish this task is to hint the parallel select to use the desired parallel distribution method. From the outline of the successful DOP 4 execution plan I took the appropriate hint (pq_distribute(alias hash hash)), adapted it and issued the following create table:

SQL> create /*+ parallel(8) */
       table table_of_dates
       tablespace dtbs
       pctfree 0
    as
    select
        /*+ parallel(8)
          full(t)
          full(tr)
	  pq_distribute(@"SEL$E07F6F7C" "T"@"SEL$2" hash hash)
          px_join_filter(@"SEL$E07F6F7C" "T"@"SEL$2")
          pq_distribute(@"SEL$E07F6F7C" "TR"@"SEL$1" hash hash)
          px_join_filter(@"SEL$E07F6F7C" "TR"@"SEL$1")
      */
       t.tr_tabl_id
     , t.table1_date_time
     , t.writing_date
     , min(tr.arrivl_date)
     , max(tr.arrivl_date)
   from
      table1 t
   left outer join table2 tr
   on t.tr_tabl_id = tr.tr_tabl_id
   join table3
   on t.order_id = table3.order_id
   and tr.status not in ('CANCELED')
   where t.writing_date <= to_date('17.06.2011', 'dd.mm.yyyy')
   and table3.order_type = ‘Broadcast’
   group by
         t.tr_tabl_id
       , t.table1_date_time
       , t.writing_date;	   

Table created.
Elapsed: 00:12:46.33

And the job has been done in less than 13 minutes instead of the initial 33 minutes with DOP 4.

The new execution plan is:

SQL Plan Monitoring Details (Plan Hash Value=5257928)
====================================================================================
| Id |              Operation              |        Name |Execs |   Rows   | Temp  |
|    |                                     |             |      | (Actual) | (Max) |
====================================================================================
|  0 | CREATE TABLE STATEMENT              |             |   17 |        8 |       |
|  1 |   PX COORDINATOR                    |             |   17 |        8 |       |
|  2 |    PX SEND QC (RANDOM)              | :TQ10004    |    8 |        8 |       |
|  3 |     LOAD AS SELECT                  |             |    8 |        8 |       |
|  4 |      HASH GROUP BY                  |             |    8 |      75M |   10G |
|  5 |       HASH JOIN                     |             |    8 |     168M |       |
|  6 |        JOIN FILTER CREATE           | :BF0000     |    8 |      78M |       |
|  7 |         PX RECEIVE                  |             |    8 |      78M |       |
|  8 |          PX SEND HASH               | :TQ10002    |    8 |      78M |       |
|  9 |           HASH JOIN BUFFERED        |             |    8 |      78M |    3G |
| 10 |            JOIN FILTER CREATE       | :BF0001     |    8 |    46481 |       |
| 11 |             PX RECEIVE              |             |    8 |    46481 |       |
| 12 |              PX SEND HASH           | :TQ10000    |    8 |    46481 |       |
| 13 |               PX BLOCK ITERATOR     |             |    8 |    46481 |       |
| 14 |                INDEX FAST FULL SCAN | IDX_TABLE3_3|  181 |    46481 |       |
| 15 |            PX RECEIVE               |             |    8 |      88M |       |
| 16 |             PX SEND HASH            | :TQ10001    |    8 |      88M |       |
| 17 |              JOIN FILTER USE        | :BF0001     |    8 |      88M |       |
| 18 |               PX BLOCK ITERATOR     |             |    8 |      88M |       |
| 19 |                TABLE ACCESS FULL    | TABLE1      |  121 |      88M |       |
| 20 |        PX RECEIVE                   |             |    8 |     770M |       |
| 21 |         PX SEND HASH                | :TQ10003    |    8 |     770M |       |
| 22 |          JOIN FILTER USE            | :BF0000     |    8 |     770M |       |
| 23 |           PX BLOCK ITERATOR         |             |    8 |     770M |       |
| 24 |            TABLE ACCESS FULL        | TABLE2      |  245 |     770M |       |
====================================================================================

Changing the parallel distribution method from broadcast to hash-hash has not only reduced the TEMP space usage but has halved the execution time of the create table.

Conclusion
Watch carefully the degree of parallelism you are going to use. In case of a broadcast distribution method for relatively big ”smaller result set” you might end up by devouring a huge amount of TEMP space and failing to succeed with the ORA-01652: unable to extend temp segment by 128 in tablespace TEMP. Hopefully, with the arrival of the 12c Adaptive Distribution Method and the new HYBRID HASH parallel distribution method, the STATISTIC COLLECTOR operation placed below the distribution method will operate a dynamic switch from a BROADCAST distribution to a HASH distribution whenever the number of rows to be distributed exceeds a threshold which is 2*DOP. With my initial DOP of 8 I would have had a threshold of 16 which is largely below the 78 million of rows that I am due to distribute further up.

Comments (11)

11 Comments »

You say “Carefully watch the DOP”, but Oracle recommend you don’t specify any degree at all.
You should just be using the plain vanilla /*+ PARALLEL */.

Secondly, why are you telling the optimiser to do full scans?
Surely you know you should just leave it to its own devices. Ensure all relevant indexes are present & stats. are gathered.

Then, all you need is:
```
create 
       table table_of_dates
       PARALLEL
       tablespace dtbs
       pctfree 0
    as
    select /*+ PARALLEL */
       t.tr_tabl_id
     , t.table1_date_time
     , t.writing_date
     , min(tr.arrivl_date)
     , max(tr.arrivl_date)
   from  .....
```
I would consider table compression too with such a large amount of data.

Comment by Martin Rose — February 24, 2015 @ 7:25 pm | Reply
- Hi Martin
  
  If we live in an ideal world your suggestions might end up being the right ones. But in the real world you very often face situations where, despite you know that statistics should be adequately gathered, you are nevertheless asked to solve the issue without changing the statistics collection politic because gathering stats for large tables can be very expensive and, going from histogram to non-histogram, needs several days of tests before implementing in Production.
  
  And by the way I’ve first started by following exactly what you have suggested in your comment i.e. letting the Optimizer choosing the right plan. Unfortunately the create table was still running 10 hours after I have launched it when I decided to kill the underlying session. The corresponding execution plan showed that the Optimizer has opted for a costly nested loop with an outer row source of 100 million of rows driving an inner table and index access executed 100 million times.
  
  As for your compress suggestion, the OLTP option needs an extra license everyone is not ready to pay for. Basic compression would have been Okay provided the subsequent insert are going to be done via direct path and provided this one is not silently ignored.
  
  Finally as for the DOP to be aware of, I totally agree with you that we should let the task of determining the adequate parallel degree to Oracle but this is true provided that the IO calibrate statistics are not missing and that parallel_degree_policy = AUTO which is not the case neither for the former nor for the later in my create table environment
  
  Comment by hourim — February 25, 2015 @ 11:34 am | Reply
  - You haven’t said what indexes are on the tables.
    
    I would expect to see the following at the very minimum:
    
    CREATE INDEX TABLE1_I1 ON TABLE1 (TR_TABL_ID, WRITING_DATE)
    COMPRESS 1;
    
    CREATE INDEX TABLE1_I2 ON TABLE1 (ORDER_ID);
    
    CREATE INDEX TABLE2_I1 ON TABLE2 (TR_TABL_ID);
    
    CREATE INDEX TABLE3_I1 ON TABLE3 (ORDER_ID);
    
    Comment by Martin Rose — February 25, 2015 @ 10:41 pm
Here is how to calibrate your I/O.

http://docs.oracle.com/cd/E11882_01/server.112/e41573/iodesign.htm#PFGRF94384

And you set PARALLEL_DEGREE_POLICY = AUTO in the init.ora (or for your session, if you don’t want to do it globally, ALTER SESSION SET PARALLEL_DEGREE_POLICY = AUTO).

Comment by Martin Rose — February 25, 2015 @ 10:52 pm | Reply
- There’s ADAPTIVE too.
  
  https://docs.oracle.com/database/121/REFRN/refrn10310.htm#REFRN10310
  
  Comment by Martin Rose — February 25, 2015 @ 11:14 pm | Reply
  - Interestingly, it says,
    
    “Automatic degree of parallelism will be enabled regardless of the value of PARALLEL_DEGREE_POLICY if a PARALLEL hint is used at the SQL statement level.”
    
    (So it appears you don’t need to set the parameter in the init.ora, or for the session, and that the parallel hint itself IS enough).
    
    Comment by Martin Rose — February 25, 2015 @ 11:20 pm
[…] https://hourim.wordpress.com/2015/02/24/parallel-query-broadcast-distribution-and-temp-space/ […]

Pingback by AUTOMATIC DEGREE OF PARALLELISM (DOP) - PART - I - ORACLE IN ACTION — February 27, 2015 @ 6:08 am | Reply
Not really associated with this question, but I found this & thought I’d mention it.

https://blogs.oracle.com/datawarehousing/entry/new_way_to_enable_parallel

Comment by Martin Rose — March 7, 2015 @ 7:15 pm | Reply
- Martin
  
  Yes thanks for that.
  
  I saw that tweet two weeks ago and followed the author which has allowed me to find a nice Oracle white paper about 12c Parallel Processing. That’s why twitter rocks.
  
  Best regards
  
  Comment by hourim — March 8, 2015 @ 8:03 am | Reply
[…] I’ve added the pq_distribute (tab1 hash hash) hint above because several refreshes crashed because of the broadcast distribution that ended up by overconsuming TEMP space raising the now classical error: […]

Pingback by Parallel refreshing a materialized view | Mohamed Houri’s Oracle Notes — April 18, 2015 @ 4:58 pm | Reply
Mohamed,

Thank you for this note. Although I am commenting so late, this note has been one of my favourites because I have learned a lot from it. Would you know what could be the reason if multiple parallel queries/DMLs start using BROADCAST distribution or how I can find out the reason? I am asking because I am trying to troubleshoot an application batch job performance and while it managed to run in acceptable time last week, it has regressed during this week run and I can see multiple parallel DMLs have switched to using BROADCAST distribution, despite having good enough estimates for row sources.

Comment by Narendra — December 5, 2023 @ 9:07 am | Reply

RSS feed for comments on this post. TrackBack URI

Mohamed Houri’s Oracle Notes

February 24, 2015