
Oracle’s Autonomous AI Database runs on Exadata with built-in AI functionality. For vector processing in the database, you can choose your LLM and add vector datatypes to any database table alongside your existing business data in several different formats. But how much storage do these vector datatypes occupy in the database, and how well does Exadata Hybrid Columnar Compression (HCC) work with vector datatypes? I’ll present one comparison across HCC levels on a table with scalar datatypes and vector datatypes in the Autonomous Database.
Choosing an Existing ONNX model
Oracle provides a number of existing ONNX-format machine language models that you can use for in-database vector processing. A loaded ONNX model can be used to create VECTOR column data alongside your existing business data. For this scenario, I’m loading the model all_MiniLM_L12_v2.onnx using a code snippet similar to this.
-- loading the ONNX model
declare
onnx_model_file varchar2(100) := 'all_MiniLM_L12_v2.onnx';
modname varchar2(500) := 'ALL_MINILM_L12_V2';
begin
dbms_vector.load_onnx_model(
model_name => modname
,directory => 'DATA_PUMP_DIR'
,file_name => onnx_model_file);
end;
/
More details on loading and using this ONNX model for similarity searches were in my previous post on creating and using vectors in Oracle Database 23ai. Once loaded, I can use it to create vectors for existing table data and leverage similarity searches for text prompts in a chatbot.
Source Data
I’m using the PART table in the SSB schema (provided with every Autonomous 26ai Database) which has 2,000,000 rows of fabricated part data from a parts distributor. The SSB schema tables are also useful to test Exadata offloading and in-memory performance. I’m creating a table with a subset of rows and columns from SSB.PART along with a vector column based on several of the columns in SSB.PART. The full CREATE TABLE and INSERT for populating my temporary MY_PARTS table is as follows:
drop table if exists my_parts purge;
create table my_parts
(
p_partkey number
,p_name varchar2(22)
,p_mfgr varchar2(6)
,p_brand1 varchar2(9)
,p_color varchar2(11)
,p_type varchar2(25)
,p_size number
,p_container varchar2(10)
,p_vec vector(384,float32)
)
--row store nocompress
--column store compress for query low
--column store compress for query high
;
insert /*+ append enable_parallel_dml parallel(2) */
into my_parts
(p_partkey, p_name, p_mfgr, p_brand1, p_color,
p_type, p_size, p_container
, p_vec
)
select p_partkey, p_name, p_mfgr, p_brand1, p_color,
p_type, p_size, p_container
, vector_embedding(
all_minilm_l12_v2 using
(p_name || ' ' || p_mfgr || ' ' || p_brand1 || ' '
|| p_color || ' ' || p_type || ' ' || p_size || ' '
|| p_container) as data)
from ssb.part
fetch first 50000 rows only
;
commit;
I ran this script a total of nine times with three different compression levels (no compression, HCC query low, HCC query high), and three different sets of table columns: only scalar columns, only the vector column, and all columns. (I’ve highlighted the vector related DDL and DML lines). The goal was to get an idea for how much a vector column would or would not compress with Exadata HCC query low and query high compression levels.
Compression Results
The compression levels across all three table formats were mostly as expected.
| Column Type(s) | Compression Type | Total Bytes | Size Reduction |
|---|---|---|---|
| NUMBER/VARCHAR | |||
| NOCOMPRESS | 5,177,344 | ||
| HCC QUERY LOW | 1,638,400 | 68.4% | |
| HCC QUERY HIGH | 1,048,576 | 79.7% | |
| VECTOR(384,FLOAT32) | |||
| NOCOMPRESS | 104,005,632 | ||
| HCC QUERY LOW | 84,082,688 | 19.2% | |
| HCC QUERY HIGH | 77,791,232 | 25.2% | |
| BOTH | |||
| NOCOMPRESS | 104,005,632 | ||
| HCC QUERY LOW | 79,888,384 | 23.2% | |
| HCC QUERY HIGH | 78,839,808 | 24.2% |
The dense VECTOR column compressed surprisingly well given the mix of data in the columns, though not nearly as well as the numeric and VARCHAR2 columns which had repeatable and highly compressable content. Mixing both column types increased the compression level primarily from the non-VECTOR columns only with QUERY LOW, and actually decreased with QUERY HIGH.
Because I’m going to create vector indexes for this table in a subsequent post, I explicitly created the P_VEC column as VECTOR(384,FLOAT32) to match the format of the vectors generated by the ALL_MINILM_L12_V2 ONNX model. What that means is that each vector column is 384*4=1536 bytes plus overhead. At most about 640 vectors will reside in an HCC compression unit (CU), and unless the PARTS data has a very high amount of duplicate data, the number of identical vectors in the dataset will be somewhat low, and therefore limiting the amount of compression that will occur on a VECTOR column of type FLOAT32 with that many dimensions.
Conclusions and Next Steps
Augmenting business data with vector columns for similarity searches has many uses, but don’t expect Exadata HCC compression to save storage on dense embedding columns. However, will Exadata smart scans kick in when using these VECTOR columns in a WHERE clause? In addition, will either type of vector index (HNSW or IVF) benefit from Exadata features? I’ll explore these two questions in subsequent posts.