Add async search and find_similar APIs by tomusher · Pull Request #90 · wagtail/wagtail-vector-index

tomusher · 2024-09-26T11:51:39Z

This PR adds async methods afind_similar and asearch to the public API exposed by a Vector Index.

This involved creating various async variations of methods further up the chain. I've also broke up the bulk_generate_documents methods in to smaller functions to both reduce the size of that function body and so we can reuse as much as possible between the async and non-async versions of the function.

Added a workaround for typing issues with django-types

…entManager

emilytoppm

Generally looking good, but I have a few questions, particularly on transactions/atomicity

emilytoppm · 2024-10-03T13:28:51Z

-        documents.delete()
+    # Asynchronous document generation methods

+    async def _acreate_new_documents(


Hmm - this kind of feels like this and the sync version above should be atomic, due to the deletion + replacement? Should this be wrapped in a transaction and sync_to_async? You could always work out the embeddings outside the transaction

emilytoppm · 2024-10-03T13:34:52Z

            chain(*[obj["chunks"] for obj in objects_to_rebuild.values()])
        )
-
        embedding_vectors = list(embedding_backend.embed(all_chunks))


How long does this take typically? Would it make sense to move it outside the transaction, or do it asynchronously?

emilytoppm · 2024-10-03T14:07:41Z

+            for idx, returned_embedding in documents:
+                all_keys = self._keys_for_instance(objects_by_key[object_key])
+                chunk = all_chunks[idx]
+                await Document.objects.acreate(


Do we want to be creating them one at a time rather than than in bulk? This feels like it'll do a lot of moving work onto a sync thread per acreate call when we could just use bulk_create once

emilytoppm · 2024-10-03T14:09:28Z

+
+        yield from self._create_new_documents(object, chunks, embedding_backend)
+
+    async def ato_documents(


We're losing the transaction here - would it be possible to restructure so we generate the document instances async, then save them later? Just feels like we're risking some inconsistent state

tomusher · 2024-12-06T16:55:07Z

Thanks for the feedback @emilytoppm and sorry for the delay in updating this. I've just done a bit of a rework to divide the generation of documents and saving of documents in to two stages so we can keep the transaction small and ensure it's usable in an async context.

tomusher added 9 commits September 26, 2024 09:41

Typing fixes

e0b9d51

Fix for_keys when using Postgres

d1f0024

Revert Document manager to be derived from QuerySet

652ef60

Added a workaround for typing issues with django-types

Add as_manager on DocumentQuerySet which casts returned type to Docum…

cd4448a

…entManager

Add afind_similar and other async supporting async methods

b65345a

Add/regroup async tests

c8518ff

Add async search API

ef7784f

Add type signature for async methods to DocumentManager

b81f039

Add afrom_document to FromDocumentOperator protocol

66c9d24

tomusher mentioned this pull request Sep 26, 2024

Async support across all public APIs #85

Open

emilytoppm suggested changes Oct 3, 2024

View reviewed changes

tomusher added 2 commits November 22, 2024 11:41

Use transaction for document generation from models

84d4296

Move preparation of documents and saving of documents to separate stages

5f96824

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add async search and find_similar APIs#90

Add async search and find_similar APIs#90
tomusher wants to merge 11 commits into
mainfrom
feature/async-public-api

tomusher commented Sep 26, 2024

Uh oh!

emilytoppm left a comment

Uh oh!

emilytoppm Oct 3, 2024

Uh oh!

emilytoppm Oct 3, 2024

Uh oh!

emilytoppm Oct 3, 2024

Uh oh!

emilytoppm Oct 3, 2024

Uh oh!

tomusher commented Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		yield from self._create_new_documents(object, chunks, embedding_backend)

		async def ato_documents(

Conversation

tomusher commented Sep 26, 2024

Uh oh!

emilytoppm left a comment

Choose a reason for hiding this comment

Uh oh!

emilytoppm Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

emilytoppm Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

emilytoppm Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

emilytoppm Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

tomusher commented Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants