Mohamed Houri’s Oracle Notes

March 5, 2018

SEMI JOIN in modern relational databases

Filed under: Oracle — hourim @ 8:17 pm

A semi join returns rows only from the first table provided at least one matching row is found in the second table. The difference between a conventional join and a semi join is that rows from the first table are returned once at most even when the join column of the second table contains duplicate rows. There is no ANSI SQL way to invoke a semi join. The following syntax doesn’t exist:

  SQL>  select 
          t1.* 
        from t1 
          semi join t2 
        on t1.n1= t2.n1

In almost all modern relational databases, the semi join logical operation is triggered via the predicate IN, ANY, SOME, EXISTS and DISTINCT. MS-SQL Server has an extra particularity which makes it using a semi join during an INTERSECT operation as we will see later in this article.

The following queries illustrate these diferent semi-join cases in a 12.2.0.1 Oracle release:

SQL> create table t1
     as
        select
            rownum n1,
            mod(rownum,5) n2,
            trunc( (rownum - 1 / 3) ) n3
        from
            dual
        connect by
            level <= 1e4; SQL> create table t2
     as
        select
            rownum n1,
            mod(rownum,3) n2,
            trunc( (rownum - 1 / 5) ) n3
        from
            dual
        connect by
        level <= 5;

1. Using IN, ANY and SOME

SQL> select 
       t1.n1
     from
       t1
     where
       t1.n1 in (select t2.n1 from t2);

SQL> select 
       t1.n1
     from
       t1
     where
      t1.n1 = any (select t2.n1 from t2);

SQL> select 
       t1.n1
     from
       t1
     where
    t1.n1 = some (select t2.n1 from t2);
   
Plan hash value: 1275841967
----------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |      5 |
|*  1 |  HASH JOIN RIGHT SEMI|      |      1 |      5 |      5 |
|   2 |   TABLE ACCESS FULL  | T2   |      1 |      5 |      5 |
|   3 |   TABLE ACCESS FULL  | T1   |      1 |  10000 |  10000 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1"."N1"="T2"."N1")

2. Using EXISTS

select 
   t1.n1
 from
   t1
 where
   exists (select null from t2 where t2.n1 = t1.n1);

Plan hash value: 1275841967
----------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |      5 |
|*  1 |  HASH JOIN RIGHT SEMI|      |      1 |      5 |      5 |
|   2 |   TABLE ACCESS FULL  | T2   |      1 |      5 |      5 |
|   3 |   TABLE ACCESS FULL  | T1   |      1 |  10000 |  10000 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."N1"="T1"."N1")

3. Using DISTINCT in INNER join and the effect of the 12cR1 Partial join

select 
  distinct
   t1.n1
 from
   t1
 inner join t2
 on (t1.n1 = t2.n1);

Plan hash value: 3703458891
---------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      5 |
|   1 |  HASH UNIQUE        |      |      1 |      5 |      5 |
|*  2 |   HASH JOIN SEMI    |      |      1 |      5 |      5 |
|   3 |    TABLE ACCESS FULL| T2   |      1 |      5 |      5 |
|   4 |    TABLE ACCESS FULL| T1   |      1 |  10000 |    256 | –- only 256 rows
---------------------------------------------------------------

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      DB_VERSION('12.2.0.1')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$58A6D7F6")
      MERGE(@"SEL$1" >"SEL$2")
      OUTLINE(@"SEL$2")
      OUTLINE(@"SEL$1")
      FULL(@"SEL$58A6D7F6" "T2"@"SEL$1")
      FULL(@"SEL$58A6D7F6" "T1"@"SEL$1")
      LEADING(@"SEL$58A6D7F6" "T2"@"SEL$1" "T1"@"SEL$1")
      USE_HASH(@"SEL$58A6D7F6" "T1"@"SEL$1")
      USE_HASH_AGGREGATION(@"SEL$58A6D7F6")
      PARTIAL_JOIN(@"SEL$58A6D7F6" "T1"@"SEL$1")
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   2 – access("T1"."N1"="T2"."N1")

The apparition of the semi join in the above execution plan is the result of the partial join transformation introduced by Oracle as from 12cR1. Earlier versions were not able to use this partial join transformation scanning, as such, the entire content of t1 table is scanned instead of tiny 256 rows as shown below:

select 
   /*+ NO_PARTIAL_JOIN(@"SEL$58A6D7F6" "T1"@"SEL$1") */
  distinct
   t1.n1
 from
   t1
 inner join t2
 on (t1.n1 = t2.n1);

Plan hash value: 975306333

---------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      5 |
|   1 |  HASH UNIQUE        |      |      1 |      5 |      5 |
|*  2 |   HASH JOIN         |      |      1 |      5 |      5 |
|   3 |    TABLE ACCESS FULL| T2   |      1 |      5 |      5 |
|   4 |    TABLE ACCESS FULL| T1   |      1 |  10000 |  10000 | -- 10,000 rows
---------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="T2"."N1")

So far all the semi join logical operations have been executed via a HASH JOIN physical operation. This doesn’t mean that a semi joins can’t be used in a NESTED LOOPS physical operation. Here’s below the demonstratation:

SQL> create index t2_idx on t2(n1);
SQL> select 
       t1.n1
     from
       t1
     where
      t1.n1 in (select t2.n1 from t2);

----------------------------------------------------------------
| Id  | Operation          | Name   | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |      1 |        |      5 |
|   1 |  NESTED LOOPS SEMI |        |      1 |      5 |      5 |
|   2 |   TABLE ACCESS FULL| T1     |      1 |  10000 |  10000 |
|*  3 |   INDEX RANGE SCAN | T2_IDX |  10000 |      1 |      5 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T1"."N1"="T2"."N1")

4. SEMI join and uniqueness
I started this article by specifying that a semi join guaranties at most one row will be returned (per join column) from the left table even if there are duplicate rows in the join column of the second table. An example being worth a thousand words so here we go:

SQL> -- insert first a duplicate row in the rigth table
SQL> insert into t2 values(5,42,42);
SQL> commit;

-- first we will start using a conventional join
SQL> select 
   t1.n1
 from
   t1
 inner join t2
 on (t1.n1 = t2.n1);
        N1
----------
         1
         2
         3
         4
         5
         5
6 rows selected

----------------------------------------------------------------
| Id  | Operation          | Name   | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |      1 |        |      6 |
|   1 |  NESTED LOOPS      |        |      1 |      5 |      6 |
|   2 |   TABLE ACCESS FULL| T1     |      1 |  10000 |  10000 |
|*  3 |   INDEX RANGE SCAN | T2_IDX |  10000 |      1 |      6 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 – access("T1"."N1"="T2"."N1")

-- and now a semi join
SQL> select 
       t1.n1
     from
       t1
     where
      t1.n1 in (select t2.n1 from t2);

        N1
----------
         1
         2
         3
         4
         5

5 rows selected
----------------------------------------------------------------
| Id  | Operation          | Name   | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |      1 |        |      5 |
|   1 |  NESTED LOOPS SEMI |        |      1 |      5 |      5 |
|   2 |   TABLE ACCESS FULL| T1     |      1 |  10000 |  10000 |
|*  3 |   INDEX RANGE SCAN | T2_IDX |  10000 |      1 |      5 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T1"."N1"="T2"."N1")

It is for this reason that the semi join has been implemented: do not duplicate rows from the outer table when they multiple join with rows from the inner table.

Since now we know the semi join main goal we can infer that, as long as, Oracle is sure that the right table has no duplicate values in the join column, it will fall back to a simple inner join. Let’s check:

SQL> drop index t2_idx;
SQL> create unique index t2_uq_idx on t2(n1);
SQL> select
       t1.n1
     from
       t1
     where
       exists (select null from t2 where t2.n1 = t1.n1);

        N1
     ------
         1
         2
         3
         4
         5
-------------------------------------------------------------------
| Id  | Operation          | Name      | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |      1 |        |      5 |
|   1 |  NESTED LOOPS      |           |      1 |      5 |      5 |
|   2 |   TABLE ACCESS FULL| T1        |      1 |  10000 |  10000 |
|*  3 |   INDEX UNIQUE SCAN| T2_UQ_IDX |  10000 |      1 |      5 |
-------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."N1"="T1"."N1")

Notice that we have no semi join anymore. Thanks to the unique index on t2(n1) table, Oracle knows that there are no duplicate rows in the join column of the outer table. It can then reliably use an inner join because rows from the outer table will at most join exactly once with rows from the inner table.

There exists another way to make Oracle falling to an inner join instead of a semi join even when there are duplicate rows in the join column of the outer table. In fact whenever Oracle uses an operation that eliminates duplicate rows from the outer table (for example sort unique or group by) it can confidently use an inner join in place of a semi join. Here’s below one case demontrated:

SQL> drop index t2_uq_idx;
SQL> insert into t2 values(5,42,42);

SQL> create unique index t1_uq_idx on t1(n1);

SQL> select
       t1.n1
     from
       t1
     where
       exists (select null from t2 where t2.n1 = t1.n1);

        N1
     ------
         1
         2
         3
         4
         5
--------------------------------------------------------------------
| Id  | Operation           | Name      | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |      1 |        |      5 |
|   1 |  NESTED LOOPS       |           |      1 |      5 |      5 |
|   2 |   SORT UNIQUE       |           |      1 |      5 |      5 |
|   3 |    TABLE ACCESS FULL| T2        |      1 |      5 |      6 |
|*  4 |   INDEX UNIQUE SCAN | T1_UQ_IDX |      5 |      1 |      5 | -- 5 rows
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T2"."N1"="T1"."N1")

As you can see Oracle didn’t use a semi join because it has scanned all rows from the outer table, ordered them and eliminated duplicate rows via the SORT UNIQUE operation. As such when joining with the t1 table Oracle knows that it can join t2 rows at most exactly once. Notice in passing that while the semi join has full scanned the t1 table (10,000 rows) the inner join touched only 5 distinct index keys.

5. SEMI join and foreign key integrity

Suppose now that we want to select rows from a child table matching their parent rows in the parent table.

–- drop and recreate table t1 and t2
SQL> create index t1_idx on t1(n1);
SQL> alter table t1 add constraint t1_uq unique (n1);
SQL> alter table t2 add constraint t2_t1_fk foreign key (n1) references t1(n1);

The setup is now done. But let’s disable first the foreign key, the unique constraint and try a semi join query:

SQL> alter table t1 disable constraint t1_uq;
SQL> alter table t2 disable constraint t2_t1_fk;

SQL> select
        t2.n1
     from
        t2
     where
       exists (select null from t1 where t1.n1 = t2.n1);

-----------------------------------------------------------------------
|   Id  | Operation               | Name   | Starts | E-Rows | A-Rows |
-----------------------------------------------------------------------
|     0 | SELECT STATEMENT        |        |      1 |        |      5 |
|- *  1 |  HASH JOIN SEMI         |        |      1 |      5 |      5 |
|     2 |   NESTED LOOPS SEMI     |        |      1 |      5 |      5 |
|-    3 |    STATISTICS COLLECTOR |        |      1 |        |      5 |
|     4 |     TABLE ACCESS FULL   | T2     |      1 |      5 |      5 |
|  *  5 |    INDEX RANGE SCAN     | T1_IDX |      5 |  10000 |      5 |
|-    6 |   INDEX FAST FULL SCAN  | T1_IDX |      0 |  10000 |      0 |
-----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1"."N1"="T2"."N1")
   5 - access("T1"."N1"="T2"."N1")

Note
-----
   - this is an adaptive plan (rows marked '-' are inactive)

As expected a classical semi join operation.

Let’s now enable the unique key of the parent table t1;

SQL> alter table t1 enable constraint t1_uq;

SQL> select
        t2.n1
     from
        t2
     where
       exists (select null from t1 where t1.n1 = t2.n1);

----------------------------------------------------------------
| Id  | Operation          | Name   | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |      1 |        |      5 |
|   1 |  NESTED LOOPS      |        |      1 |      5 |      5 |
|   2 |   TABLE ACCESS FULL| T2     |      1 |      5 |      5 |
|*  3 |   INDEX RANGE SCAN | T1_IDX |      5 |      1 |      5 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T1"."N1"="T2"."N1")

The semi join has gone because Oracle knows that there is a unique constraint on the join column from the outer table t1, it can then safely use an inner join.

Finally let’s enable the foreign key and re-execute the semi join query:

SQL> alter table t2 enable constraint t2_t1_fk;

SQL> select
        t2.n1
     from
        t2
     where
       exists (select null from t1 where t1.n1 = t2.n1);

-------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |
-------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |      5 |
|*  1 |  TABLE ACCESS FULL| T2   |      1 |      5 |      5 |
-------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("T2"."N1" IS NOT NULL)

Oracle can use a trusted foreign key relationship to remove a logical semi-join and accesses only the t2 table.

5. SEMI join and INTERSECT

In SQL Server an intersect set operation can also be done via a semi-join as the following proves:

create table t1(n1 int, n2 int,n3 int) ;
create table t2(n1 int, n2 int, n3 int) ;

insert into t1(n1,n2,n3)
select top 10000
  row_number() over (order by a.name)
 ,row_number() over (order by a.name)%5
 ,(row_number() over (order by a.name) -1) /3
from sys.all_objects a cross join sys.all_objects;

insert into t2(n1,n2,n3)
select top 5
  row_number() over (order by a.name )
 ,row_number() over (order by a.name)%3
 ,(row_number() over (order by a.name)-1)/5
from sys.all_objects a cross join sys.all_objects;

select t2.n1 from t2 intersect select t1.n1 from t1;


But neither Oracle nor PostgreSQL use a semi join when evaluating an intersect set operation as the followings prove respectively:

–- Oracle 12cR2
SQL> select t2.n1 from t2 intersect select t1.n1 from t1;

---------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      5 |
|   1 |  INTERSECTION       |      |      1 |        |      5 |
|   2 |   SORT UNIQUE       |      |      1 |      6 |      6 |
|   3 |    TABLE ACCESS FULL| T2   |      1 |      6 |      6 |
|   4 |   SORT UNIQUE       |      |      1 |  10000 |  10000 |
|   5 |    TABLE ACCESS FULL| T1   |      1 |  10000 |  10000 |
---------------------------------------------------------------
-- PostgreSQL 10.1
create table t1 as
  select 
     generate_series(1, 1e4) n1
	,generate_series(1, 1e4)%5 n2
	,(generate_series(1, 1e4)-1)/3 n3;
	
create table t2 as
  select 
     generate_series(1, 5) n1
	,generate_series(1, 5)%3 n2
	,(generate_series(1, 5)-1)/5 n3;
	
select t2.n1 from t2 intersect select t1.n1 from t1;


postgres=# explain analyze select t2.n1 from t2 intersect select t1.n1 from t1;
                                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 HashSetOp Intersect  (cost=0.00..347.00 rows=200 width=36) (actual time=13.186..13.189 rows=5 loops=1)
   ->  Append  (cost=0.00..316.90 rows=12040 width=36) (actual time=0.363..9.571 rows=10005 loops=1)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..55.90 rows=2040 width=36) (actual time=0.362..0.369 rows=5 loops=1)
               ->  Seq Scan on t2  (cost=0.00..30.40 rows=2040 width=4) (actual time=0.050..0.053 rows=5 loops=1)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..261.00 rows=10000 width=9) (actual time=0.031..7.743 rows=10000 loops=1)
               ->  Seq Scan on t1  (cost=0.00..161.00 rows=10000 width=5) (actual time=0.029..4.478 rows=10000 loops=1)
 Planning time: 0.555 ms
 Execution time: 13.911 ms

6. SUMMARY

We have seen in this article how a semi join logical operation is triggered when using IN, ANY, SOME, EXISTS and DISTINCT in several modern relational databases. We’ve outlined as well that, in contrast to Oracle and PostgreSQL, SQL-Server can use a semi join during an INTERSECT set operation. We’ve demonstrated that while a regular joins can duplicate rows, semi joins do not produce any duplicates. They are defined to return (or not) rows only from one table of the join. We’ve also shown that, as long as the CBO is sure that the right table has no duplicate values in the join column, it will fall back to a simple inner join where extra transformations become possible and where the left table might not be fully scanned.

Advertisements

2 Comments »

  1. Mohamed,

    Nice article.

    Regarding the SQL-Server comment, though, Oracle has been able to handle conversion from intersect (and minus) to semi join (and anti join) since 10.1.x.x but the feature is not enabled unless you change the value of parameter _convert_set_to_join to true. It’s been around for such a long time that it still puzzles me that the default hasn’t changed.

    I did a quick google search because someone was blogging about it recently – but the first hit I found was in CBO-F, page 261 😉
    The earliest I found on the web was from Riyaj Shamsudeen: https://orainternals.wordpress.com/2010/05/01/query-transformation-part-1/
    And an interesting limitation from Stefan Koehler: http://www.soocs.de/public/research/170723_sjc_set_join_conversion_limitation_incl_workaround.txt

    Comment by Jonathan Lewis — March 6, 2018 @ 8:51 am | Reply

  2. Jonathan

    Thanks for your comment. I have indeed read a tweet from Nenad Noveljic about a concern he has with a wrong cardinality which he worked arroud using distinct. http://nenadnoveljic.com/blog/cardinality-estimation-minus-operator/

    But I haven’t been so smart to link this parameter with the absence of SEMI JOIN during an INTERSECT set operation.

    Nice spot on as always

    Best regards
    Mohamed

    Comment by hourim — March 6, 2018 @ 7:09 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at WordPress.com.

Tony's Oracle Tips

Tony Hasler's light hearted approach to learning about Oracle

Richard Foote's Oracle Blog

Focusing Specifically On Oracle Indexes, Database Administration and Some Great Music

Hatem Mahmoud Oracle's blog

Just another Oracle blog : Database topics and techniques

Mohamed Houri’s Oracle Notes

Qui se conçoit bien s’énonce clairement

Oracle Diagnostician

Performance troubleshooting as exact science

Raheel's Blog

Things I have learnt as Oracle DBA

Coskan's Approach to Oracle

What I learned about Oracle

So Many Oracle Manuals, So Little Time

“Books to the ceiling, Books to the sky, My pile of books is a mile high. How I love them! How I need them! I'll have a long beard by the time I read them”—Lobel, Arnold. Whiskers and Rhymes. William Morrow & Co, 1988.

EU Careers info

Your career in the European Union

Carlos Sierra's Tools and Tips

Tools and Tips for Oracle Performance and SQL Tuning

Oracle Scratchpad

Just another Oracle weblog

OraStory

Dominic Brooks on Oracle Performance, Tuning, Data Quality & Sensible Design ... (Now with added Sets Appeal)

%d bloggers like this: