When it comes to parallel operations there are fundamentally two data distribution methods between individual parallel (PX) server sets: hash-hash and broadcast distribution. Suppose that you are going to operate a parallel HASH JOIN between two full table scans using a Degree of Parallelism (DOP) of 2. Since the HASH JOIN operation needs two distinct set of PX servers there will be actually always 2*DOP = 4 slaves acting in parallel. They represent two slaves per PX set: (PX1 and PX2) for the first set and (PX3 and PX4) for the second set. Each table in the join is read by one of the PX server set in parallel. When both PX servers have finished collecting their data set they need to distribute it to the subsequent parallel HASH JOIN operation. This data distribution can be done using, commonly, either a hash-hash or a broadcast distribution. Using the former method (hash-hash) the result set gathered by the parallel scan of the build and the probe tablesin the join are both sent to the parallel server set responsible for the hash join operation. Using a BROADCAST distribution method, instead of distributing rows from both result sets Oracle sends the smaller result set to all parallel server slaves of the set responsible of the subsequent parallel hash joins operation.
I bolded the word smaller result set to emphasize that smaller is relative to the other row source in the join. Do you consider that a build table with 78 million of rows is a small data set? It might be considered smaller in the eye of the Oracle optimizer when the probe table is estimated to generate 770 million of rows. This is exactly what happened to me.
If this concept of parallel data distribution is still not clear for you and you want to understand what I have encountered and how I’ve managed to pull out myself from this nightmare then continuing reading this article might be worth the effort.
I was asked to create a big table based on a join with two other big tables so that doing this serially was practically impossible to complete in an acceptable execution time. I decided, therefore, to create it in parallel. After having enabled parallel DML and forced parallel DDL I launched the following create table statement with a DOP of 8 both for DML and DDL
SQL> create /*+ parallel(8) */ table table_of_dates tablespace dtbs pctfree 0 as select /*+ parallel(8) full(t) full(tr) */ t.tr_tabl_id , t.table1_date_time , t.writing_date , min(tr.arrivl_date) , max(tr.arrivl_date) from table1 t left outer join table2 tr on t.tr_tabl_id = tr.tr_tabl_id join table3 on t.order_id = table3.order_id and tr.status not in ('CANCELED') where t.writing_date <= to_date('17.06.2011', 'dd.mm.yyyy') and table3.order_type = ‘Broadcast’ group by t.tr_tabl_id , t.table1_date_time , t.writing_date; create /*+ parallel(8) */ * ERROR at line 1: ORA-12801: error signaled in parallel query server P013 ORA-01652: unable to extend temp segment by 128 in tablespace TEMP
As you can notice, it went with an ORA-01652 error. Below it the corresponding Real Time SQL Monitoring Report showing that the HASH JOIN operation at line 7 reaches 67G which is beyond the size of the current physical TEMP tablespace (64G) and hence the ORA-01652 error.
Parallel Execution Details (DOP=8, Servers Allocated=16) SQL Plan Monitoring Details (Plan Hash Value=3645515647) ===================================================================================== | Id | Operation | Name | Execs | Rows | Temp | | | | | | (Actual) | (Max) | ===================================================================================== | 0 | CREATE TABLE STATEMENT | | 17 | | | | 1 | PX COORDINATOR | | 17 | | | | 2 | PX SEND QC (RANDOM) | :TQ10003 | | | | | 3 | LOAD AS SELECT | | | | | | 4 | HASH GROUP BY | | | | | | 5 | PX RECEIVE | | | | | | 6 | PX SEND HASH | :TQ10002 | 8 | | | | 7 | HASH JOIN | | 8 | 0 | 67G | | 8 | PX RECEIVE | | 8 | 626M | | | 9 | PX SEND BROADCAST | :TQ10001 | 8 | 626M | | | 10 | HASH JOIN | | 8 | 78M | | | 11 | PX RECEIVE | | 8 | 372K | | | 12 | PX SEND BROADCAST | :TQ10000 | 8 | 372K | | | 13 | PX BLOCK ITERATOR | | 8 | 46481 | | | 14 | INDEX FAST FULL SCAN | IDX_TABLE3_3| 182 | 46481 | | | 15 | PX BLOCK ITERATOR | | 8 | 88M | | | 16 | TABLE ACCESS FULL | TABLE1 | 120 | 88M | | | 17 | PX BLOCK ITERATOR | | 8 | 717M | | | 18 | TABLE ACCESS FULL | TABLE2 | 233 | 717M | | =====================================================================================
I stumped few minutes looking at the above execution plan and have finally decided to try a second create table with a reduced degree of parallelism (4 instead of 8) and here what I got
SQL> create /*+ parallel(4) */ table table_of_dates tablespace dtbs pctfree 0 as select /*+ parallel(4) full(t) full(tr) */ t.tr_tabl_id , t.table1_date_time , t.writing_date , min(tr.arrivl_date) , max(tr.arrivl_date) from table1 t left outer join table2 tr on t.tr_tabl_id = tr.tr_tabl_id join table3 on t.order_id = table3.order_id and tr.status not in ('CANCELED') where t.writing_date <= to_date('17.06.2011', 'dd.mm.yyyy') and table3.order_type = ‘Broadcast’ group by t.tr_tabl_id , t.table1_date_time , t.writing_date; Table created. Elapsed: 00:31:42.29
The table has been this time successfully created within 32 minutes approximatively.
Before going to the next issue in the pipe, I wanted to understand why reducing the Degree Of Parallelism (DOP) from 8 to 4 made the create statement successful? The obvious thing I have attempted was to compare the 8 DOP execution plan with the 4 DOP one. The first plan has already been shown above. The second one is presented here below (reduced only to the information that is vital to the aim of this article):
Parallel Execution Details (DOP=4 , Servers Allocated=8) SQL Plan Monitoring Details (Plan Hash Value=326881411) ============================================================================================= | Id | Operation | Name | Execs | Rows |Temp | | | | | | (Actual) |(Max)| ============================================================================================= | 0 | CREATE TABLE STATEMENT | | 9 | 4 | | | 1 | PX COORDINATOR | | 9 | 4 | | | 2 | PX SEND QC (RANDOM) | :TQ10003 | 4 | 4 | | | 3 | LOAD AS SELECT | | 4 | 4 | | | 4 | HASH GROUP BY | | 4 | 75M | | | 5 | PX RECEIVE | | 4 | 168M | | | 6 | PX SEND HASH | :TQ10002 | 4 | 168M | | | 7 | HASH JOIN | | 4 | 168M | 34G | | 8 | PX RECEIVE | | 4 | 313M | | | 9 | PX SEND BROADCAST | :TQ10001 | 4 | 313M | | | 10 | HASH JOIN | | 4 | 78M | | | 11 | BUFFER SORT | | 4 | 186K | | | 12 | PX RECEIVE | | 4 | 186K | | | 13 | PX SEND BROADCAST | :TQ10000 | 1 | 186K | | | 14 | TABLE ACCESS BY INDEX ROWID | TABLE3 | 1 | 46481 | | | 15 | INDEX RANGE SCAN | IDX_ORDER_TYPE | 1 | 46481 | | | 16 | PX BLOCK ITERATOR | | 4 | 88M | | | 17 | TABLE ACCESS FULL | TABLE1 | 115 | 88M | | | 18 | PX BLOCK ITERATOR | | 4 | 770M | | | 19 | TABLE ACCESS FULL | TABLE2 | 256 | 770M | | =============================================================================================
They don’t share exactly the same execution plan (they have two different plan hash values). The irritating question was: why, by halving down the degree of parallelism from 8 to 4 the same SQL statement necessitated almost half (34G) the amount of TEMP space and completed successfully?
The answer resides into the parallel distribution (PQ Distrib) method used by the parallel server sets to distribute their collected set of rows to the subsequent parallel server set (doing the hash join).
In the DOP 4 execution plan above we can see that we have 4 PX parallel server sets each one responsible of filling up one of the virtual TQ tables: TQ10000, TQ10001, TQ10002 and TQ10003. Here below is how to read the above execution plan:
- The first PX1 set of slaves reads TABLE3 and broadcast its data set to the second PX2 set of slaves through the TQ10000 virtual table.
- PX2 set reads TABLE1, hash join it with the data set it has received (TQ10000) from PX1 set and broadcast its result to the next parallel server
set which is PX3 via the second TQ10001 virtual table. - PX3 set of parallel slaves probes TABLE2 by parallel full scanning it and hash join it with the build result set (TQ10001) it has received from
PX2 parallel set. This operation ends up by filling up the third virtual TQ table TQ10002 and by sending it to the next and last PX server PX4
using a hash distribution. - Finally, PX4 set of slaves will receive the TQ10002 data, hash group by it, fill the last virtual table (TQ10003) table and send it to the query
coordinator (QC) which will end up by creating the table_of_dates table
That is a simple way of reading a parallel execution plan. By listing the above parallel operations I aimed to emphasize that data (HASH JOIN operation at line 10) produced by the parallel set of slaves PX2 is broadcasted (PX SEND BROADCAST operation at line 9) to the next parallel set PX3. And this means that each slave of PX2 set will pass every row it has received to every slave of the PX3 set. A typical reason to do such a distribution method is that the data set of the first row source (78M of rows in this case) is ”smaller” than the second row source to be joined with (770M of rows).
In order to make the picture clear let’s zoom around the hash join operation in the 4 DOP plan
Parallel Execution Details (DOP=4 , Servers Allocated=8) ====================================================================== | Id | Operation | Name | Execs | Rows |Temp | | | | | | (Actual) |(Max)| ====================================================================== | 7 |HASH JOIN | | 4 | 168M | 34G | | 8 | PX RECEIVE | | 4 | 313M | | | 9 | PX SEND BROADCAST | :TQ10001 | 4 | 313M | | | 10 | HASH JOIN | | 4 | 78M | | | 18 | PX BLOCK ITERATOR | | 4 | 770M | | | 19 | TABLE ACCESS FULL | TABLE2 | 256 | 770M | | ======================================================================
The ”smaller” result set produced by the PX2 set of slaves at line 10 is 78M of rows. As far as Oracle decided to broadcast those 78M of rows towards PX3 set it has been duplicated as many times as there are slaves in PX3 set to receive the broadcasted data. As far as we have a DOP of 4 then this means that the HASH JOIN operation at line 7 has received 4*78 = 313M of rows (operation at line 8) that has been built and hashed in memory in order to be able to probe the result set produced by the second parallel server set coming from operations 18-19.
Multiply the DOP by 2, keep the same parallel distribution method and the same HASH JOIN operation will have to build and hash a table of 8*78 = 626M of rows. Which ultimately has required more than 67GB of TEMP space (instead of the initial DOP 4 34GB) as shown below:
Parallel Execution Details (DOP=8, Servers Allocated=16) =================================================================== | Id | Operation | Name | Execs | Rows | Temp | | | | | | (Actual) | (Max) | =================================================================== | 7 |HASH JOIN | | 8 | 0 | 67G | | 8 | PX RECEIVE | | 8 | 626M | | | 9 | PX SEND BROADCAST| :TQ10001 | 8 | 626M | | | 10 | HASH JOIN | | 8 | 78M | | | 17 | PX BLOCK ITERATOR | | 8 | 717M | | | 18 | TABLE ACCESS FULL| TABLE2 | 233 | 717M | | ===================================================================
Now that I know from where the initial ORA-01652 error is coming from, changing the parallel distribution method from BROADCAST to HASH might do the job without requiring more than 64GB of TEMP space. A simple way to accomplish this task is to hint the parallel select to use the desired parallel distribution method. From the outline of the successful DOP 4 execution plan I took the appropriate hint (pq_distribute(alias hash hash)), adapted it and issued the following create table:
SQL> create /*+ parallel(8) */ table table_of_dates tablespace dtbs pctfree 0 as select /*+ parallel(8) full(t) full(tr) pq_distribute(@"SEL$E07F6F7C" "T"@"SEL$2" hash hash) px_join_filter(@"SEL$E07F6F7C" "T"@"SEL$2") pq_distribute(@"SEL$E07F6F7C" "TR"@"SEL$1" hash hash) px_join_filter(@"SEL$E07F6F7C" "TR"@"SEL$1") */ t.tr_tabl_id , t.table1_date_time , t.writing_date , min(tr.arrivl_date) , max(tr.arrivl_date) from table1 t left outer join table2 tr on t.tr_tabl_id = tr.tr_tabl_id join table3 on t.order_id = table3.order_id and tr.status not in ('CANCELED') where t.writing_date <= to_date('17.06.2011', 'dd.mm.yyyy') and table3.order_type = ‘Broadcast’ group by t.tr_tabl_id , t.table1_date_time , t.writing_date; Table created. Elapsed: 00:12:46.33
And the job has been done in less than 13 minutes instead of the initial 33 minutes with DOP 4.
The new execution plan is:
SQL Plan Monitoring Details (Plan Hash Value=5257928) ==================================================================================== | Id | Operation | Name |Execs | Rows | Temp | | | | | | (Actual) | (Max) | ==================================================================================== | 0 | CREATE TABLE STATEMENT | | 17 | 8 | | | 1 | PX COORDINATOR | | 17 | 8 | | | 2 | PX SEND QC (RANDOM) | :TQ10004 | 8 | 8 | | | 3 | LOAD AS SELECT | | 8 | 8 | | | 4 | HASH GROUP BY | | 8 | 75M | 10G | | 5 | HASH JOIN | | 8 | 168M | | | 6 | JOIN FILTER CREATE | :BF0000 | 8 | 78M | | | 7 | PX RECEIVE | | 8 | 78M | | | 8 | PX SEND HASH | :TQ10002 | 8 | 78M | | | 9 | HASH JOIN BUFFERED | | 8 | 78M | 3G | | 10 | JOIN FILTER CREATE | :BF0001 | 8 | 46481 | | | 11 | PX RECEIVE | | 8 | 46481 | | | 12 | PX SEND HASH | :TQ10000 | 8 | 46481 | | | 13 | PX BLOCK ITERATOR | | 8 | 46481 | | | 14 | INDEX FAST FULL SCAN | IDX_TABLE3_3| 181 | 46481 | | | 15 | PX RECEIVE | | 8 | 88M | | | 16 | PX SEND HASH | :TQ10001 | 8 | 88M | | | 17 | JOIN FILTER USE | :BF0001 | 8 | 88M | | | 18 | PX BLOCK ITERATOR | | 8 | 88M | | | 19 | TABLE ACCESS FULL | TABLE1 | 121 | 88M | | | 20 | PX RECEIVE | | 8 | 770M | | | 21 | PX SEND HASH | :TQ10003 | 8 | 770M | | | 22 | JOIN FILTER USE | :BF0000 | 8 | 770M | | | 23 | PX BLOCK ITERATOR | | 8 | 770M | | | 24 | TABLE ACCESS FULL | TABLE2 | 245 | 770M | | ====================================================================================
Changing the parallel distribution method from broadcast to hash-hash has not only reduced the TEMP space usage but has halved the execution time of the create table.
Conclusion
Watch carefully the degree of parallelism you are going to use. In case of a broadcast distribution method for relatively big ”smaller result set” you might end up by devouring a huge amount of TEMP space and failing to succeed with the ORA-01652: unable to extend temp segment by 128 in tablespace TEMP. Hopefully, with the arrival of the 12c Adaptive Distribution Method and the new HYBRID HASH parallel distribution method, the STATISTIC COLLECTOR operation placed below the distribution method will operate a dynamic switch from a BROADCAST distribution to a HASH distribution whenever the number of rows to be distributed exceeds a threshold which is 2*DOP. With my initial DOP of 8 I would have had a threshold of 16 which is largely below the 78 million of rows that I am due to distribute further up.