Having returned to work after some well-spent holidays, the New Year started with a quite interesting issue for me. A vast index calculation process that commonly takes 5 to 20 minutes to run started to behave strangely executing in 7 to 30 hours(!). This, of course, was unacceptable, so the issue quickly took over as the highest priority. The process, written in PLSQL, runs in an Oracle database. The main part of this PLSQL procedure is a vast query that calculates and inserts the indexes.
What Happened?
In the last quarter of last year I was working on an index calculation process having to do with national price indexes. I managed to achieve a performance that was better than initially expected (5 to 20 minutes depending on the amount of data). After successful tests, the process was to be added to our inner application so that the appropriate team can launch it through a user-friendly UI. Then came a surprising turn; the process would take ages to run.
Road To Solution
First I did not believe it.
First, I thought it was not possible that the performance had changed so dramatically. Maybe the operation team had done something wrong or there was an issue with the server. Therefore, I ran the process myself to check the time. As I do not have the application, I tested the process from SQL Developer. (This information will be crucial later.) To my great satisfaction, it took only a few minutes, so I told my team lead about it and he tested it from the application as well. My assumption was correct; the execution took only a few minutes. If only it were the end …
It happened again.
I was shocked to hear that the performance issue had just happened again. It is important to mention that my index calculation process is part of a huge weekly aggregation process that creates a new schema for each week. Thus, the operation team can work in a separate schema having only the data related to that period. Therefore, after the performance issue had happened again in a new schema, I tested the process again - from SQL Developer. As in the first case, it took only a few minutes for me. Then my team lead ran it through the application and the process finished in a few minutes.
As it had been the second time already, I had to come up with an explanation instead of just closing the ticket again. There was a tendency that I discovered:
- A new schema is created
- The process is launched through the application
- Performance issue happens
- I launch the process from SQL Developer
- No performance issue
- The process is launched again from the application
- No performance issue
This could have gone on and on like an infinite loop, however, I did not leave it just there.
Investigation
First, I checked if the Oracle optimizer version is the same in both cases. It was, why would it be different? After giving it a lot of thinking, the most rational idea I had was to check whether the explain plans differ. I monitored the process in the Sessions window of SQL Developer and exported the explain plan of each case. Having run some tests in different scenarios, I discovered that the explain plan is much worse when the procedure is called from the application compared against SQL Developer. However, after executing the procedure from SQL Developer, the process ran in a normal speed through the application. After monitoring this case, I noticed that even though I was running the process through the application, the explain plan was good. This is because the good explain plan has already been generated, so when the application calls the procedure, the good plan is used.
The Root Cause
The conclusion of my investigation was that we get a different explain plan depending on the executor; application or SQL Developer. However, when the “good” explain plan has already been generated, we can call the procedure from any tool, the “correct” explain plan will be used, meaning the performance of the query will be good.
The Solution
Knowing the root cause, I consulted with the DBA team who suggested that we can force the good explain plan. This would have definitely fixed the issue; however, I was so excited to find another solution that I kept thinking. The main reason of the “bad” explain plan was the quite badly estimated number of rows in a few tables. It occurred to me that maybe I could try to gather the stats before running the query. I had no idea if it would work, but I gave it a shot. Luckily it worked! After gathering the stats the optimizer could estimate properly and create the good explain plan. Irrespective of the executor, may it be the application, SQL Developer or anything else.
Probably this will remain forever carved into my mind, so the next time it will be less “fun”.
You can read more about Oracle stats here and here.
You can gather stats in different levels (schema, table, index, etc).
An example to calculate the stats for a table:
DBMS_STATS.GATHER_TABLE_STATS(
ownname => 'SCHEMA',
tabname => 'TABLE'
--partname => 'PARTITION'
--you can even specify the partition if you wish so
);
Explanation
I was really happy to have found the root cause and solution, however, there was still one question chasing me: why?
My colleague in the DBA team told me that the behaviour of the Oracle optimizer can change sometimes based on the tool we use to connect to the database. There had been issues that only happened through ODBC connection, but not in sqlplus
for example.
I tried to find a more satisfying answer, I could not find a better one. At least not yet.
Looks like you might have a different session setting for one of the query optimizer settings which influences how Oracle goes about hard parsing and prepping a new query plan (on a query that has cached out of buffer). See what is set for SHOW PARAMETER OPTIMIZER_MODE, for example, between your SQL developer session and the session in which your application executes.
ReplyDeleteBtw, got to your article from reading your Oracle to PostreSQL comparison. I recently went through a migration of ORCL -> PostreSQL on fairly large application, it was fun and worked out well!
Thanks a lot, tomorrow at work I will check out your idea. It sounds familiar, but not sure I checked it back then. I remember I was googling a lot. :)
DeleteGreat to hear about the migration! Probably we'll go through it during the second half of the year. Someone suggested that the optimizer in Postgre is not as smart. Did you have some bad experiences about that?
Yes perhaps PostgreSQL optimizer is less sophisticated from what I can gather?, but honestly this hasn't been an issue on my project, dealing with quite a bit of data (4TB) and lots and lots of distinct queries, mostly OLTP (so well indexed and optimized joins to begin with, so perhaps wasn't pushing the optimizer).
DeleteBut imo perhaps that simplicity might be an advantage actually :) Over the years I found Oracle query planner, although very powerful with a ton of features and configurations, also quite finicky, where environmental settings (perhaps even your case in this article) or other factors contributing to unpredictable behaviors, such as high water mark in tables that are now empty etc. To the point where our dev team really had take upon themselves to always gather stats, analyze tables after any significant operation on the tables in the pl/sql code or straight up adding query hints in the queries on the order to join, etc, not leaving it up to the chance for the plan optimizer to screw things up and take the system down. This was on Oracle version 10 and 11 so quite a few years back, but I remember many many occasions where things (unpredictably) went awry. Here 1 year+ operation on PostgreSQL and not a single such instance :)
Wow, the Oracle part of your comment is like if I wrote it. I have worked with Oracle for over 5 years only, but my experience is the same. Especially in my current job, where we have a huge amount of data, talking about TBs as well. In many cases Oracle's optimizer is just "unpredictable" and depends on other settings, circumstances. As you rightly said, the same thing is described in this article.
DeleteIn our processes we also make sure to gather the stats if we think is necessary (wasn't the case here, but we there was no sign of necessity :) ), but more importantly, almost in all DML statements concerning big tables, we have to use hints to force a good performance. And of course, tables are partitioned and indexed.
And yes, I also have experience with very poor performance with regards to tables that are empty but are expected to have a lot of data.
Thanks for your input about Postgre optimizer, probably there will be no issue then. Oracle has trained us for this. :)
Thanks again, all the best.