An alternate title for this post: “is it a bug or a feature”? I was setting up a 21c database on Exadata when I ran into a GROUP BY bug with an existing released query. Fortunately I was able to reproduce the issue with a very compact test case — and best of all, no data in the tables!
The bug does not show itself in 19c, and there’s a good reason for that. There are a couple of workarounds for the issue until a bug report is filed and the issue fixed in a future RU for 21c.
Test Environment
My 21c (innovation release!) database is at RU 21.5 from January 2022. The environment where the query worked fine is 19c (19.14 from January 2022). No fix controls have been enabled with DBMS_OPTIM_BUNDLE in either environment, and while I considered doing that as part of my investigation, it turns out that it was related to an already enabled fix control.
The setup to reproduce the issue was distilled into a few tables, a query with MAX and GROUP BY, and no test data is required.
create table t1 (c1 number);
create table t2 (c1 number);
create table t3 (c1 number, c2 number);
create table t4 (c1 number, c2 varchar2(1));
select t1.c1,
max (CASE
WHEN t3.c2 IS NULL THEN NULL -- t3.c2 is NUMBER
WHEN t4.c2 IS NULL THEN 1 -- t4.c2 is VARCHAR2
ELSE 0
END)
from
t1
inner join t2 on t1.c1=t2.c1
left outer join t3 on t1.c1=t3.c1
left outer join t4 on t3.c1=t4.c1
group by t1.c1;
Error at Command Line : 6 Column : 2
Error report -
SQL Error: ORA-00932: inconsistent datatypes: expected CHAR got NUMBER
I’ve seen ORA-00932 messages in previous database versions from very complex aggregate queries using GROUP BY GROUPING SETS, but never from anything this simple. The query did not generate the error under 19c.
Workaround #1: Datatype Adjustments
The error message implies that I have a type mismatch, since my MAX with the CASE statement evaluates columns with two different datatypes. When I changed the result of the CASE statement to return a character constant instead of a numeric in lines 4 and 5, the error message is gone:
select t1.c1,
max (CASE
WHEN t3.c2 IS NULL THEN NULL -- t3.c2 is NUMBER
WHEN t4.c2 IS NULL THEN '1' -- t4.c2 is VARCHAR2
ELSE '0'
END)
from
t1
inner join t2 on t1.c1=t2.c1
left outer join t3 on t1.c1=t3.c1
left outer join t4 on t3.c1=t4.c1
group by t1.c1;
no rows selected
Yes, it’s a possible workaround, but the statement worked before in 19c as expected. It’s a stretch to think that the optimizer in 21c is enforcing some obscure rule of ANSI SQL regarding datatype conversion.
Workaround #2: Fix Control Analysis
Since I could solve the issue by setting OPTIMIZER_FEATURES_ENABLE to ‘19.1.0’, I suspected that there was a more specific parameter or fix control that might be behind this anomaly. Using a tool such as Mauro Pagano’s Pathfinder or something similar, I could automatically enable and disable most hidden parameters and fix controls under 21c and see which of those changes might bypass an optimizer code path and allow my query to parse with no errors.
After only a couple of hours, the culprit was identified. Here is the entry in V$SYSTEM_FIX_CONTROL:
select * from v$system_fix_control where bugno='30235691';

As I expected, the issue is definitely related to aggregates, and by disabling this fix control the query parses. In this case I disabled the fix control at the hint level, but it can also be disabled system wide:
select /*+ OPT_PARAM('_fix_control' '30235691:0') */ t1.c1,
. . .
no rows selected
Problem solved? Perhaps. Notice that this optimizer fix control was enabled by default starting with 20c (the version of Oracle Database that most of us barely knew) and is now causing errors with existing code.
Next Steps
I’ll be filing a bug report shortly. Further regression testing will determine if the fix from 20c was merely a query rewrite optimization or fixed an actual optimizer bug (ORA- error or incorrect results). Stay tuned!
Addendum 1, 2/16/2022
Is Oracle Database 21c self-healing? Well, not in the way that Oracle Autonomous does. What happened the next day is that I ran the problematic query again, and even with the fix control enabled, it was working with no ORA-00932.
select t1.c1,
max (CASE
WHEN t3.c2 IS NULL THEN NULL -- t3.c2 is NUMBER
WHEN t4.c2 IS NULL THEN 1 -- t4.c2 is VARCHAR2
ELSE 0
END)
from
t1
inner join t2 on t1.c1=t2.c1
left outer join t3 on t1.c1=t3.c1
left outer join t4 on t3.c1=t4.c1
group by t1.c1;

Nothing had changed since the day before, since this was a copy of a database that no one else was using. Well, some things had changed… during the default maintenance window, 3 jobs ran. But nothing like auto-index creation.

It was the stats collection job that “solved” the problem. With or without data, once statistics were collected on the table, the optimizer parse phase worked and generated the plan I expected.

To verify this behavior, I dropped and recreated the tables, and the ORA-00932 appeared again. When I gathered statistics immediately, the ORA-00932 error disappeared again.

There is still a new (or old) bug somewhere, but knowing that good statistics can bring out the best in the Oracle optimizer, I’ll treat that as a truce for now. But I’m still going to file the bug report!