Add async search and find_similar APIs#90
Conversation
Added a workaround for typing issues with django-types
emilytoppm
left a comment
There was a problem hiding this comment.
Generally looking good, but I have a few questions, particularly on transactions/atomicity
| documents.delete() | ||
| # Asynchronous document generation methods | ||
|
|
||
| async def _acreate_new_documents( |
There was a problem hiding this comment.
Hmm - this kind of feels like this and the sync version above should be atomic, due to the deletion + replacement? Should this be wrapped in a transaction and sync_to_async? You could always work out the embeddings outside the transaction
| chain(*[obj["chunks"] for obj in objects_to_rebuild.values()]) | ||
| ) | ||
|
|
||
| embedding_vectors = list(embedding_backend.embed(all_chunks)) |
There was a problem hiding this comment.
How long does this take typically? Would it make sense to move it outside the transaction, or do it asynchronously?
| for idx, returned_embedding in documents: | ||
| all_keys = self._keys_for_instance(objects_by_key[object_key]) | ||
| chunk = all_chunks[idx] | ||
| await Document.objects.acreate( |
There was a problem hiding this comment.
Do we want to be creating them one at a time rather than than in bulk? This feels like it'll do a lot of moving work onto a sync thread per acreate call when we could just use bulk_create once
|
|
||
| yield from self._create_new_documents(object, chunks, embedding_backend) | ||
|
|
||
| async def ato_documents( |
There was a problem hiding this comment.
We're losing the transaction here - would it be possible to restructure so we generate the document instances async, then save them later? Just feels like we're risking some inconsistent state
|
Thanks for the feedback @emilytoppm and sorry for the delay in updating this. I've just done a bit of a rework to divide the generation of documents and saving of documents in to two stages so we can keep the transaction small and ensure it's usable in an async context. |
This PR adds async methods
afind_similarandasearchto the public API exposed by a Vector Index.This involved creating various async variations of methods further up the chain. I've also broke up the
bulk_generate_documentsmethods in to smaller functions to both reduce the size of that function body and so we can reuse as much as possible between the async and non-async versions of the function.