ThalamusDB is an approximate processing engine supporting SQL queries extended with semantic operators on multimodal data. Find the full ThalamusDB documentation here: https://itrummer.github.io/thalamusdb/.
To get a first impression of ThalamusDB, try it on Google Colab here. Execute the code cell, enter your OpenAI API key when asked, then enter your queries in the ThalamusDB console.
Install ThalamusDB using pip:
ThalamusDB can use language models from various providers, including OpenAI and Google. Store the access key of the provider you plan to use in an environment variable. For instance, if using OpenAI, set the OPENAI_API_KEY environment variable using the following command on Linux platforms:
Now you can run the ThalamusDB console using the following command:
For instance, try out the example database in this repository:
The cars database contains a single table with the following schema:
The description column contains a text description of images, and the pic column contains the path to the associated image file. Run the following command in the ThalamusDB console to see the picture paths:
You will see relative paths of JPEG images, located in the images sub-folder. Now, you can try semantic queries such as the following:
After less than a minute, ThalamusDB should produce the correct answer (1). You may try more complex queries that require a certain degree of commonsense knowledge to evaluate, e.g.:
ThalamusDB supports other semantic operators beyond simple filters and performs semantic analysis on audio files as well as text. Consult the ThalamusDB documentation for more details.
ThalamusDB operates on a standard DuckDB database. ThalamusDB supports semantic operators on three types of unstructured data: text, images, and sound files.
To represent images, create a column of SQL type text in your table and store paths to images. ThalamusDB automatically recognizes the most common image file formats (PNG, JPEG, JPG) and treats table cells containing paths to such files as images. Similarly, to represent audio data, include paths to audio files (WAV or MP3 files) in a text column.
ThalamusDB supports SQL queries with semantic filter predicates. Specifically, ThalamusDB supports two types of semantic filters (both must appear in the SQL WHERE clause):
| NLfilter([Column], [Condition]) | Filters rows based on a condition in natural language |
| NLjoin([Column in Table 1], [Column in Table2], [Condition]) | Filters row pairs using the join condition in natural language |
ThalamusDB works with models of various providers. Users specify the models to use on specific data types in a model configuration file. Also, the configuration file enables users to configure models for specific operators (e.g., by setting the temperature parameter or reasoning_effort). You can find an example configuration file in this repository at config/models.json.
The model configuration file contains a dictionary with a single field, models, that stores a list of model configurations. Each list entry is a dictionary with three fields:
- modalities: a list of data modalities the model can process (a subset of "text", "image", and "audio").
- priority: if multiple models can be used to serve a request, ThalamusDB prefers the ones with higher priority.
- kwargs: describes the parameter settings used for each semantic operator (parameters include the model ID).
The kwargs field is a dictionary that contains two fields: filter and join. Each field contains the settings (mapping from parameter names to values) that are used when calling the language model for the corresponding semantic operator (semantic filter or join). The following entry is an example model configuration, setting up both semantic operators to use the GPT-5 Mini model:
ThalamusDB is designed for approximate processing. During query processing, ThalamusDB periodically displays approximate results. These results are calculated based on evaluating semantic operators on a subset of the data. When displaying approximate results, ThalamusDB distinguishes two query types:
- Aggregation Queries Aggregation queries produce one single result row with one or multiple numerical aggregates. For such queries, ThalamusDB displays lower and upper bounds for the possible values of each aggregate.
- Retrieval Queries All other queries are considered retrieval queries, producing possibly multiple result rows with possibly non-numeric cells. For such queries, ThalamusDB displays rows that appear in all possible results.
In both cases, ThalamusDB obtains possible results by replacing the values for un-evaluated semantic predicates with True or False values. To give users a sense of how far we are from an exact result, ThalamusDB calculates an error bound. Once the error reaches a value of zero, the result is exact.
- For aggregation queries, the error is the sum of differences between lower and upper aggregates, summing over all query aggregates.
- For retrieval queries, denoting by max_rows the maximal number of rows in any possible result and by intersection_rows the number of rows that appear in all possible results, the error is calculated as max_rows/intersection_rows - 1 (0 if max_rows=intersection_rows=0).
You can configure stopping criteria for query execution. If any of the stopping criteria are satisfied, ThalamusDB terminates execution with the current approximate query result.
The following properties are available to define stopping criteria:
| max_seconds | Maximal number of seconds for query execution | 600 |
| max_calls | Maximal number of calls to the LLM | 100 |
| max_tokens | Maximal number of input and output tokens | 1000000 |
| max_error | Terminate once error below this threshold | 0.0 |
You can set each of these properties using the following command:
.png)


