Skip to content

Draft: Feature: Signals to update vector index on page publish#30

Open
Morsey187 wants to merge 2 commits into
mainfrom
feature/update_index_on_publish_signal
Open

Draft: Feature: Signals to update vector index on page publish#30
Morsey187 wants to merge 2 commits into
mainfrom
feature/update_index_on_publish_signal

Conversation

@Morsey187
Copy link
Copy Markdown
Collaborator

@Morsey187 Morsey187 commented Dec 22, 2023

Adds a new env WAGTAIL_VECTOR_INDEX_UPDATE_ON_PUBLISH to enable registering all pages with a VectorIndexedMixin to wagtail's page_published signal.

  • Issue: Signals are part of the request cycle and updating indexes can be time consuming, we should add support for a task queue and consider whether we'd want to allow using these signals without one at all.
  • Issue: Currently requires rebuilding the whole index, instead of updating, we'd need to figure out:
    • Which indexes a model is in (so we can update the right indexes)
    • A way to remove documents from an index that match a given set of metadata (the object id and content type ID in this case)
    • An easier way to generate embeddings on a per-document level, instead of at the rebuild index stage

@Morsey187 Morsey187 changed the title Feature: Signals to update vector index on page publish Draft: Feature: Signals to update vector index on page publish Dec 22, 2023
@tomusher
Copy link
Copy Markdown
Member

tomusher commented Jan 8, 2024

Discussed this with Ben separately but copying some of those comments here for reference.

This implementation has raised a few potential challenging points we might need to address before we can finalise this. Right now, this does a full index rebuild on save, which as Ben identified can potentially be very slow.

Ideally, we would only update the current page when saving, but to do this we need to:

  • Identify what indexes the page is part of
  • Add an abstraction for upserting pages - while we can upsert documents right now, a re-indexed page may return a different set of documents, so we need to identify what documents belong to a page, delete them, and then reinsert the new documents.
  • Have an easier way to generate documents on a per-page level - at the moment it's in only doable when the whole index is rebuilt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants