Although I work for VAST Data, these notes are my own personal notes and are not authoritative. They may be wrong.

VAST implements vector database functionality as part of its structured interface, VAST DataBase. This vector query capability falls under the VAST DataBase (and VAST DataEngine?) branding.

This vector database is actually implemented in two parts:

  1. Vectors are stored as a vector column type within VAST DataBase
  2. Vectors are searched using a VAST SDK which provides an ADBC driver that offloads queries to a VAST Query Engine running inside the VAST cluster.

Developer experience

Writes/Inserts

Vectors are inserted using pyarrow (pyarrow.table.insert).1 For example,

table = pyarrow.table(
	schema=columns, # columns is a pyarrow.schema
	data=[
		[ # index column of int64
			1,
			2,
			3,
		], 
		[ # vector column with five-element lists of float32
		    [0.732, 0.914, 0.059, 0.427, 0.106],
		    [0.839, 0.245, 0.601, 0.913, 0.758],
		    [0.317, 0.564, 0.129, 0.987, 0.405],
		],
		[ # timestamp column of type timestamp
		    datetime.datetime(2013,  2, 17,  8,  6),
		    datetime.datetime(2016, 11,  3, 19, 42),
		    datetime.datetime(2019,  7, 28, 14, 15),
		],
	])
table.insert(arrow_table)

The flow is as follows:

  1. vastdb.connect instantiates a session
  2. The session spawns a transaction context
  3. The transaction retrieves/creates a bucket object which scopes the table
  4. The bucket provides access to a schema namespace
  5. The schema is used to retrieve/create a table handle. PyArrow is used to define the schema if a new table is being created.
  6. PyArrow tables (pyarrow.table) are constructed with the payload and inserted via the VAST table handle’s insert() method

Reads/Selects

Vector similarity is queried via the SQL dialect exposed by VAST’s special Arrow Database Connectivity (ADBC) driver.2 A vector query looks something like this:1

SELECT * from table
WHERE some_column > some_criteria
ORDER BY
  array_distance(vector_column, [0.123, 0.456, 0.789]::FLOAT[3])
LIMIT 2

The following similarity functions are provided and offloaded to the VAST query engine:

  • Cosine similarity (array_cosine_distance)
  • Euclidean distance (array_distance)
  • Negative inner product (array_negative_inner_product)

By integrating these similarity functions into SQL, queries can include both similarity and categorical or bounded criteria.

Architecture

The VAST vector database uses hierarchical clustering instead of the common HNSW approach. This allows the vector index to be stored entirely on storage, not DRAM, without compromising query performance. To address the performance impact of clustering on every insert, the VAST solution lands new vectors in the same write buffers used to handle file I/O and row inserts. Clustering happens asynchronously.3

Footnotes

  1. https://vast-data.github.io/data-platform-field-docs/vast_vectordb/quickstart.html 2

  2. https://vast-data.github.io/data-platform-field-docs/vast_vectordb/overview/overview.html#getting-started

  3. The Architecture Behind Our 11× Vector Benchmark - VAST Data