Colorfield logo

Drupalicious

Published on

Semantic search with Drupal and Typesense

Authors
Typesense search results with InstantSearch

Following insights from a Search API Typesense maintainer and this previous post about Typesense full text + faceted search, this one will compare lexical, semantic and hybrid search before showing how to configure Search API Typesense for RAG (Retrieval-Augmented Generation).

Typesense is a fast alternative to Solr (~10x faster) and a cost-effective alternative to Algolia (Typesense offers both self-hosted and cloud options). It's also an open-source alternative to Algolia and Pinecone, as Typesense combines these two tools — it functions as both a search engine and a vector database.

The API integrates with many languages and frameworks like LangChain, Symfony, Laravel, and Drupal.

The project has a public roadmap and aims to provide a new release every 3 months.

Unlike Solr, it doesn't come with a dashboard/GUI out of the box, but a contributed project provides this functionality.

The main difference with other Search API implementations is that it fully skips Drupal for querying (similar to using the Solarium library for Solr). This means Search API is only used for backend indexing, not frontend querying. While this prevents using Views out of the box, the benefits are improved performance and the ability to easily implement soft-decoupled or fully decoupled UIs.

Typesense use cases

We already explored facets and full-text search (keyword/lexical) in the previous post. Now let's check how Typesense delivers semantic and hybrid search.

  • Searches for exact word matches, typo tolerant
  • Finds documents with specific terms, proper nouns
  • Uses embeddings to understand the meaning and context
  • Gets synonyms, related concepts and similarities

Both searches run on the same query:

  • Keyword search returns results ranked by lexical relevance
  • Vector search returns results ranked by semantic similarity

Benefits

  • Keyword search misses synonyms and related concepts → Semantic search fills this gap
  • Semantic search can be "too creative" and can miss exact matches → Keyword search ensures precision
  • Vector search struggles with proper nouns, acronyms or very specific terms → Keyword search handles these well

Keyword, semantic, hybrid search demo

Under the hood, querying for e.g. "firefox" will be done based on these parameters (using a single word here for easier comparison).

Keyword

{
	"searches": [
		{
			"collection": "my-collection",
			"exclude_fields": "embedding",
			"facet_by": "by",
			"highlight_full_fields": "text",
			"max_facet_values": 20,
			"page": 1,
			"per_page": 15,
			"q": "firefox",
			"query_by": "text"
		}
	]
}

Semantic

{
	"searches": [
		{
			"collection": "my-collection",
			"exclude_fields": "embedding",
			"facet_by": "by",
			"highlight_full_fields": "embedding",
			"max_facet_values": 20,
			"page": 1,
			"per_page": 15,
			"q": "firefox",
			"query_by": "embedding",
			"vector_query": "embedding:([], k:200)"
		}
	]
}

Hybrid

{
	"searches": [
		{
			"collection": "my-collection",
			"exclude_fields": "embedding",
			"facet_by": "by",
			"highlight_full_fields": "text,embedding",
			"max_facet_values": 20,
			"page": 1,
			"per_page": 15,
			"q": "firefox",
			"query_by": "text,embedding",
			"vector_query": "embedding:([], k:200)"
		}
	]
}

How does it work with Drupal?

Indexing

  • Drupal stores content in its relational database
  • Content is sent to an embedding LLM to create vector representations
  • Data is extracted and processed
  • Vector embeddings are stored in the Typesense vector database

Querying

  • User submits a search query
  • Vector search is performed against Typesense to find relevant context
  • This provides context to a RAG LLM
  • LLM generates a response combining the query and context

Drupal integration

Search API Typesense integrates with the Search API and AI ecosystems. It's already fully functional for stop words, scoped keys, synonyms, curation and — at the time of writing — the following features are in the roadmap:

Configuration

Install and enable the Search API Typesense and the AI modules.

AI

Go to the Provider settings (/admin/config/ai/providers) and add a new API key.

Typesense

  1. Make sure that you have Typesense installed and create a new Typesense Search API Server (see previous blog)
  2. Configure the Server - go to the Conversation models tab
  • Id: typesense
  • Model: pick your favourite one from the list
  • System prompt: copy the prompt below
You are an assistant for question-answering. You can only make conversations based on the provided context. If a response cannot be formed strictly using the provided context, politely say you do not have knowledge about that topic.
  • Max bytes: depends on your model, example for gpt-3.5-turbo: 16385
  1. Create a new Content Search API Index in /admin/config/search/search-api that makes use of this server (for more details, see also the previous blog), then in the Schema tab, below AI features
  • Check Enable embedding
  • Check fields to be used for embedding (e.g. text and title)
  • Select the LLM embedding model, it can be a Typesense or external provider one, let's use ts: all-MiniLM-L12-v2
  • Add fields to prepend to all chunks (e.g. a taxonomy term)
  • Set the chunk size to 1000 and the chunk overlap size to 10
  1. Re-index

  2. You should now be able to use the Converse tab of your index, as a starter, and then build on top of this with InstantSearch widgets.

Typesense documentation

Photo from A Chosen Soul on Unsplash