Encrypting vector embeddings prior to data ingestion (Redpanda, Cyborg)

3 weeks ago 1

Enterprise AI adoption faces a critical security gap. Organizations are streaming sensitive data like transaction logs, customer interactions, and proprietary metrics into vector databases for RAG and semantic search.

But here's the problem: traditional vector databases operate on vector embeddings in plaintext, creating a honeypot of concentrated organizational knowledge. A single breach can expose years of business intelligence, customer data, and trade secrets.

The stakes are especially high in regulated industries. Financial institutions processing millions of transactions, healthcare systems analyzing patient data, and government agencies handling classified information all need real-time AI capabilities. Yet current solutions force them to choose between innovation and compliance. Stream processing for AI often means exposing vectors that can be inverted to reconstruct original sensitive content.

Cyborg has partnered with Redpanda to solve this with a streaming pipeline that encrypts vectors before they're stored, enabling semantic search and RAG applications on encrypted data. No more plaintext embeddings sitting in databases waiting to be breached.

In this post, you'll learn how to add CyborgDB to your Redpanda Connect pipeline, enabling semantic search and RAG applications while keeping your vectors encrypted. We'll also highlight example use cases, security best practices, and how to deploy this powerful duo in production.

The technologies

Redpanda Connect

Think of Redpanda Connect as your Swiss Army knife for streaming data. It's a lightweight, Apache Kafka^®-compatible streaming platform that moves data between systems without the operational overhead of traditional Kafka deployments. Teams love it because it starts in seconds (not minutes), uses 10x less memory, and comes with 300+ built-in connectors.

For AI workloads, Redpanda Connect shines at ingesting high-volume event streams — transaction logs, sensor data, or user interactions — and routing them to downstream processors.

CyborgDB

CyborgDB is the first vector encryption proxy that keeps your embeddings encrypted during search operations. While traditional vector databases need embeddings in plaintext to perform similarity searches (creating a security nightmare), CyborgDB uses cryptographic techniques, including forward-privacy SHA3 hashing and AES-256 symmetric encryption, to search directly on encrypted vectors. (You can read more about CyborgDB’s encryption schemes.)

Rather than storing vectors directly, CyborgDB transforms your existing database infrastructure (PostgreSQL, Redis, or other supported backends) into an encrypted vector store. This means leveraging your existing database investments and operational expertise while adding encrypted vector search capabilities. Your vectors are encrypted client-side before being persisted to your chosen backend, ensuring they remain protected at rest, in transit, and during use.

Vector embeddings aren't just random numbers — they're compressed representations of your data that can be inverted to reconstruct the original content. In regulated industries like healthcare and finance, exposed embeddings mean compliance violations and breach notifications. CyborgDB eliminates this risk.

Note that while CyborgDB provides end-to-end encryption for stored vectors, the data flowing through Redpanda Connect itself follows standard streaming security practices. The encryption Cyborg provides kicks in when data is transformed into vectors and stored in CyborgDB.

CyborgDB + Redpanda = speed and security

The CyborgDB output in Redpanda Connect is ready and available in Cloud and Self-Managed deployments. Redpanda Connect handles the real-time data ingestion and transformation, while CyborgDB provides the encrypted vector storage and search. Together, they create a streaming AI pipeline that's both blazing fast and cryptographically secure.

Financial firms use this for real-time fraud detection, healthcare systems for patient monitoring, and retailers for instant personalization — all without exposing sensitive data in vector form.

How to add CyborgDB to your pipeline

Step 1. Install CyborgDB

First, set up CyborgDB using Docker:

# Pull and run CyborgDB with PostgreSQL backend docker run -d -p 8000:8000 \ -e CYBORGDB_DB_TYPE=postgres \ -e CYBORGDB_CONNECTION_STRING="host=postgres port=5432 dbname=cyborgdb user=cyborgdb password=secure_password" \ -e CYBORGDB_API_KEY="your_cyborgdb_api_key" \ cyborginc/cyborgdb-service:latest # Or with Redis backend docker run -d -p 8000:8000 \ -e CYBORGDB_DB_TYPE=redis \ -e CYBORGDB_CONNECTION_STRING="host=redis,port=6379,db=0" \ -e CYBORGDB_API_KEY="your_cyborgdb_api_key" \ cyborginc/cyborgdb-service:latest

Generate a secure encryption key for your index:

# Quick start: Generate a 32-byte key and encode as base64 export CYBORGDB_INDEX_KEY=$(openssl rand -base64 32) echo "Save this key securely: $CYBORGDB_INDEX_KEY"Important: For production deployments, we strongly recommend using a Key Management Service (KMS) instead of storing raw keys. Redpanda Connect supports integration with AWS Secrets Manager, Azure Key Vault, HashiCorp Vault, and other KMS providers. See Redpanda's secrets management documentation for configuration details.

Step 2. Configure the CyborgDB output

Add CyborgDB to your existing Redpanda Connect pipeline. Here's a complete example showing how to stream data with embeddings into encrypted storage:

# Your existing input and processors remain unchanged input: kafka: addresses: ["localhost:9092"] topics: ["your_topic"] consumer_group: "your_consumer_group" pipeline: processors: # Your existing processors... - label: "generate_embedding" # Your embedding generation logic # Add CyborgDB as the output for encrypted vector storage output: cyborgdb: host: "localhost:8000" # CyborgDB service endpoint api_key: "${CYBORGDB_API_KEY}" # Your CyborgDB API key index_name: "production_vectors" # Name for your encrypted index index_key: "${CYBORGDB_INDEX_KEY}" # Base64-encoded 32-byte encryption key create_if_missing: true # Auto-create index on first write operation: "upsert" # upsert or delete # Map your document ID id: "${! json(\"id\") }" # Extract ID from your message # Map your embedding vector vector_mapping: "root = this.embedding" # Path to embedding array # Optional: Include metadata for filtering metadata_mapping: | root = { "timestamp": this.timestamp, "category": this.category, "user_id": this.user_id } # Batching for optimal performance batching: count: 100 # Batch size period: "1s" # Max wait time

Configuration options explained

Essential settings:

host: Your CyborgDB service endpoint
api_key: Authentication key from cyborg.co
index_name: Unique name for your encrypted vector collection
index_key: Base64-encoded 32-byte encryption key (use the Redpanda Secrets Guide)

Data mappings:

id: Unique identifier for each vector (required)
vector_mapping: Bloblang expression to extract the embedding array
metadata_mapping: Optional metadata for filtering during search

Performance tuning:

batching.count: Number of vectors to batch (100-500 recommended)
batching.period: Maximum time to wait before sending a partial batch

Use case examples

‍Fraud detection pipeline:

output: cyborgdb: host: "${CYBORGDB_HOST}" api_key: "${CYBORGDB_API_KEY}" index_name: "fraud_patterns" index_key: "${FRAUD_INDEX_KEY}" operation: "upsert" id: "${! json(\"transaction_id\") }" vector_mapping: "root = this.transaction_embedding" metadata_mapping: | root = { "amount": this.amount, "merchant_category": this.merchant_category, "risk_score": this.risk_score }

RAG document pipeline:

output: cyborgdb: host: "${CYBORGDB_HOST}" api_key: "${CYBORGDB_API_KEY}" index_name: "knowledge_base" index_key: "${KB_INDEX_KEY}" operation: "upsert" id: "${! json(\"doc_id\") }" vector_mapping: "root = this.content_embedding" metadata_mapping: | root = { "source": this.source, "department": this.department, "last_updated": this.timestamp, "access_level": this.access_level }

Performance and security considerations

Performance metrics

Cyborg sees these numbers in production deployments:

Throughput: 50,000+ vectors/second with proper batching
Encryption overhead: <1% latency increase vs. plaintext storage
Search latency: Sub-10ms for similarity search on millions of encrypted vectors
Index size: ~1.2x the size of unencrypted index with same configuration parameters

Security best practices

Key management:

Generate unique 32-byte keys for each index
Store keys in secure key management systems (AWS KMS, HashiCorp Vault, Azure Key Vault)
Never commit keys to version control
Implement key rotation policies for long-lived indexes

Network security:

Use TLS for all connections to CyborgDB
Deploy CyborgDB within your VPC/private network
Implement API key rotation schedules

Compliance benefits:

Vectors remain encrypted at rest in the database
Vectors remain encrypted during search operations
No plaintext exposure in logs, caches, or memory dumps
Meets HIPAA, GDPR, and SOC 2 requirements for data encryption

Production deployment with Docker Compose

For production environments, you can deploy both services together:

version: '3.8' services: cyborgdb: image: cyborginc/cyborgdb-service:latest ports: - "8000:8000" environment: - CYBORGDB_DB_TYPE=postgres - CYBORGDB_CONNECTION_STRING=host=postgres port=5432 dbname=cyborgdb user=cyborgdb password=${DB_PASSWORD} - CYBORGDB_API_KEY=${CYBORGDB_API_KEY} - SSL_CERT_PATH=/certs/server.crt # For HTTPS - SSL_KEY_PATH=/certs/server.key volumes: - ./certs:/certs depends_on: - postgres postgres: image: postgres:15 environment: - POSTGRES_DB=cyborgdb - POSTGRES_USER=cyborgdb - POSTGRES_PASSWORD=${DB_PASSWORD} volumes: - postgres_data:/var/lib/postgresql/data redpanda-connect: image: redpandadata/connect:latest volumes: - ./pipeline.yaml:/pipeline.yaml command: run /pipeline.yaml environment: - CYBORGDB_API_KEY=${CYBORGDB_API_KEY} - CYBORGDB_INDEX_KEY=${CYBORGDB_INDEX_KEY} depends_on: - cyborgdb volumes: postgres_data:

Build secure pipelines with CyborgDB in Redpanda Connect

Cyborg and Redpanda have created a streaming pipeline that solves a critical enterprise need: real-time AI that keeps your vectors encrypted even during search operations. By adding CyborgDB to your Redpanda Connect pipeline, you can finally deploy AI in regulated environments without compromising on security or performance.

The integration is straightforward: add the CyborgDB output to your existing pipeline configuration, generate an encryption key, and your vectors are automatically encrypted before storage and in use. Your compliance team gets the security they need, your engineering team gets a simple integration, and your data scientists get the real-time AI capabilities they want.

Ready to secure your streaming AI pipeline? Join the Redpanda Community Slack to discuss your use case, or get your CyborgDB API key to start building. Questions about compliance or enterprise features? Contact the Cyborg team.

Resources

‍

Read Entire Article