An interesting academic situation happened last week which I, honestly believe, is worth a blog article as far as experienced DBA have spent time trying to solve it without success. An overnight job was running for hours in the night from 01/04 to 02/04. The on call DBA spent all the night killing and re-launching the job (sql_id) several attempts without any success. When I arrived at work the next day I was asked to help. As far as this job was still running, I generated the Real Time SQL monitoring report (RTSM) for the corresponding sql_id which showed the classical NESTED LOOP having a huge outer data row set driving an inner data set in which at least 50 different operations have been started 519K times while one operation has been executed 2M times. The corresponding execution plan contains 213 operations. The underlying query uses 628 user bind variables and 48 system generated bind variables (thanks to cursor sharing set to FORCE)
SQL Plan Monitoring Details (Plan Hash Value=1511784243) Global Information ------------------------------ Status : EXECUTING Instance ID : 2 Session : xxxxx (350:9211) SQL ID : dmh5vhkcm877v SQL Execution ID : 33554436 Execution Started : 04/02/2015 07:52:03 First Refresh Time : 04/02/2015 07:52:47 Last Refresh Time : 04/02/2015 10:04:28 Duration : 7947s Module/Action : wwwww Service : zzzzz Program : wwwww DOP Downgrade : 100% Global Stats =================================================================================== | Elapsed | Cpu | IO | Application | Concurrency | Buffer | Read | Read | | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Gets | Reqs | Bytes | ==================================================================================== | 7900 | 7839 | 20 | 0.00 | 0.82 | 243M | 5946 | 659MB | ====================================================================================
The above 7,839 seconds spent consuming CPU with almost no user wait time represents the classical wrong NESTED LOOP operation starting many times several inner operations as mentioned above.
The job was running without any sign of improvement, the client was waiting for its critical report and I have a query with 700 bind variables honored via an execution plan of 213 operations to figure out how to make this report finishing smoothly as soon as possible.
I was dissecting the execution plan when the end user send me an e-mail saying that the same job ran successfully yesterday within 6 min. With that information in mind I have managed to get the RTSM of the yesterday successful job. The first capital information was that the yesterday query and the today not ending one used the same plan_hash_value (same execution plan). Comparing the 628 input bind variable values of both runs, I found that the yesterday job ran for a one month period (monthly job) while the current job is running for a one day interval (daily job).Of course the end user has not supplied any information about the kind of job they are currently running compared to the previous one. All what I have been told is the yesterday job completed in 6 minutes. It is only until I’ve found the difference in the input bind variable values that the end user said “the current run is for the daily job while the previous one was for the monthly job”.
And the sun starts rising. I was able to figure out that the two set of bind variables are not doing the same amount of work and sharing the same execution plan is probably not a good idea. This is why I have suggested the DBA to do the following:
- Kill the not ending session
- Purge the sql_id from the shared pool
- Ask the end user to re-launch the job
- Cross fingers 🙂
And you know what? The job completed within a couple of hundreds of seconds:
SQL Plan Monitoring Details (Plan Hash Value=2729107228) Global Information ------------------------------ Status : DONE (ALL ROWS) Instance ID : 2 Session : xxxxx (1063:62091) SQL ID : dmh5vhkcm877v SQL Execution ID : 33554437 Execution Started : 04/02/2015 10:43:17 First Refresh Time : 04/02/2015 10:43:20 Last Refresh Time : 04/02/2015 10:47:38 Duration : 261s Module/Action : wwwww Service : zzzzz Program : wwwww Fetch Calls : 57790 Global Stats ============================================================================== | Elapsed | Cpu | IO | Concurrency | Fetch | Buffer | Read | Read | | Time(s) | Time(s) | Waits(s) | Waits(s) | Calls | Gets | Reqs | Bytes | ============================================================================== | 134 | 107 | 7.10 | 17 | 57790 | 18M | 7857 | 402MB | ==============================================================================
This is the dark side of using bind variable: when sharing resource we share also execution plan. The current daily job was running using the plan optimized for the monthly job. The solution was to force the CBO compiling a new execution plan for the new input bind variable. The new plan (2729107228) is still showing 200 operations and several operations started 578K times. I have the intention to study both execution plans to know exactly from where the enhancement is coming. The clue here might be that in the first shared monthly execution plan the query, for a reason I am unable to figure out, run serially
DOP Downgrade : 100%
While the new hard parsed execution has been executed in parallel:
Parallel Execution Details (DOP=4 , Servers Allocated=20)
Bottom Line: when you have the intention to run a critical report once per day (and once per month) then it is worth to let the CBO compiling a new execution plan for each execution. All what you will have is one hard parse for one execution. This will never hurt from a memory and CPU point of view