Quick and simple catalog of the home library,
based on telergam-bot with AI-recognition
of covers and annotations
Telegram bot to photo the covers and annotations of your books and create a catalog of your home library. Share the catalog with your friends. No manual input: AI will do everything for you, and it will take no more than 10 seconds to enter one book. If necessary, you can always download your collection to a file. @home_library_ai_bot
Just start chatting with @home_library_ai_bot in telegram. We use Telegram ID to identificate user and store it's books. Your telegram ID is permanent and does not change when you changing mobile number or telegram nickname.
Cathegories are not stored in database as a separate entity. They collect everytime from stroed books of the user. Then user add new category, it stored in users variable and applied to the new book. Then just one book with this category saved, we will start receiving this category from selection cathegories of all users books.
We store all your photos with unique anonymous identifiers is S3 file storage. So if someone known the photo identificator - they can see it. Access to other people's photos is unlikely, but try not to photograph things that you would not like to allow for public review.
For each book we need two photos:
- Photo of book's cover - for extract cover's picture and store them in library databas
- Photo of first book's page with annotation - for extract from them text imformation about the book
Photo of the book cover on plain surface, for example, on the desk. Use the desktop lamp or mobilephone flash to illuminate the book. Avoid mirrored or glass surfaces. Try not to use tables with a colorful surface, such as wooden ones. Hold the phone at a right angle to the table to avoid trapezoidal deformations.
We use U-2-Net Salient Object Detection AI-model of the python rembg library to remove background. Next we use OpenCV library to find the cover's rectanlge and them align it in the form of a vertical rectangle.
| Source photo | Extracted cover |
|---|---|
![]() |
![]() |
Step by step example:
| Source photo | → | Photo without backgroud | → | Convert to grayscale and add some blur | → |
|---|---|---|---|---|---|
![]() |
→ | ![]() |
→ | ![]() |
→ |
| → | Highlight all edges | → | Dilate edges | → | Erode edges | → |
|---|---|---|---|---|---|---|
| → | ![]() |
→ | ![]() |
→ | ![]() |
→ |
| → | Find all contours | → | Select the largest rectangled contour | → | Do perspective transformation |
|---|---|---|---|---|---|
| → | ![]() |
→ | ![]() |
→ | ![]() |
Take a picture of the first annotation page of the book in the same way. Inevitably, your fingers holding the book will get into the photo. If they don't obscure the text, it's okay.
We use an external LLM model for OCR the first annotation page of the book to text and extract from them some important fields about the book.
We exctract these fields:
| Field | Value |
|---|---|
title |
Book title |
authors |
Authors of the book |
authors_full_names |
Full names of the book's authors |
pages |
Pages count in the book |
publisher |
Organiozation name of book's publisher |
year |
Year of publish the book |
ISBN |
International Standard Book Number, if exists |
annotation |
Full text of annotation, extracted from the page |
brief |
Brief summary of the annotation |
The following prompt for processing a photo by AI-model I found the best:
The photo contains a page with the book's annotation.
Your response should only consist of the [book] section of an ini file without any introductory or concluding phrases.
The section must always contain 9 parameters. If a parameter is missing, its value should be empty.
Provide the parameter values in the same language as the photographed page.
If the parameter value contains newlines or equal signs, replace them with spaces.
Output only plain text of ini file content. Do not output markdown.
Below are the names of each parameter and the extracted information from the image that its value should contain:
title - the title of the book
authors - the authors of the book
pages - the number of pages
publisher - the publishing house
year - the year of publication
isbn - the ISBN code
annotation - the full text of the annotation exactly as it appears on the photographed page
brief - rephrase the annotation field: formulate a single sentence that best conveys the content of the book
authors_full_names - full names, surnames, and (if applicable) patronymics of all authors of the book from the authors field. The page likely contains mentions of their full names. Find and provide them in this field
Despite all efforts, models regularly:
- leave markdown formatting in the response
- forget to remove newlines from long field values
- forget to include the title section [book] in the response.
All these options should be considered during the response processing process.
During the experiments, it turned out that the result of the model depends on the order of enumeration of the extraced parameters. Such derived parameters as brief and authors_full_names should go after all other parameters:
- If we try to extract
briefbefore fullanotation, we couldn't get the model to get really full text of book's annotation inannotationfield. - When model construct
authors_full_names, it can add their own knowledge and append middle name or decipher the initials - even there are no such information on the source page.
Interesting observation:
- in 4 out of 5 cases AI-model return
Dale Carnegieauthors name, but in 1 case it returnCarnegie Dale.
We test 28 LLM's which were relevant in May 2025. To each model we three pages:
I try to ask each of 28 GPT vision models, aviable on May 2025 to process each of these 3 pages.
For this operation I decided to write python script without any coding - only using GPT-models. I use Clause 3.7 Sonnet model and start from simple prompts. I try to generate code, found and analyze errors, add some lines to prompt, try again, etc..
Resulting prompt, witch generate full-functionaly successefull worked script is:
1. Read GPT_URL, GPT_API_TOKEN, GPT_MODEL from environment variables.
2. Use AsyncOpenAI for GPT client
3. Read models list from models.txt. Sort them asc
4. Read prompt from prompt.txt
5. Send all *.jpg pictures to each model with readed prompt
6. Response from each model must be a text of an ini-file with one section [book] and values of 9 parameters: title, authors, pages, publisher, year, isbn, annotation, brief, authors_full_names.
7. Create excel file with sheet for each of 9 parameters. In rows of the sheet must be models in columns - pictures, in cells - readed from model values of these parameters.
8. If model does not get response, or response is not right ini-file or you can not find parameter in response - put "-" char to the appropriate cell.
9. Remove markdown start and and of code blocks from model's result.
10. If you found line withoput equals character - ignore them.
11. Pay attention: for .txt, .jpg and result excel file you must use not current system folder, but folder of your python program
12. Write response of each model on each picture to .txt file in 'debug' subfolder with name of file contains model and picture names. Note that model names can include characters, witch can not be in file names. Create the 'debug' subfolder if they does not exists.
13. Add to program Boolean ONLYONE parameter, then it runs full, but process only first model and only first picture.
14. Output on the screen information about current model and file then process.
15. Store time (count of seconds) of answer for each model and each picture and store them in such sheet 'time' in result excel table.
Generated script you can see in compare_models.py
Then I calculate count of right answers for each of 3 pages and each of 9 parameters. If the answer is right and full model get +1 score. Maximum models can have 27 scores. It means, that this model can be used for page recognition. Only 11 of 28 models have never made a mistake. Also I count elapsed time (in seconds) for each request and summarize them for models. And I count manual cost of use models ($USD of 1000 request).
Final results of this comparsion are here:
Summary:
- We found two best models to use:
Gemini Flash 1.5andGemini Flash 2.0by Google. They are the cheapest and allways right. Gemini Flash 2.5is yet on preview stage and have errors. Normal version of this model, once returned answer not on book language. And thinking version get an hallutination on ISBN field - but where are no such field in book, published in 1983.Gemini Pro 1.5is the good model, but a little expensive. Early I use its cheaper instanceGemini Pro Vision Preview- but Google closed it at May 2025Claude 3 Opus,Claude 3.5 SonnetandClause 3.7 Sonnetby Anthropic are good, but fantastic expensive models.- OpenSource
Gemma 3by Google showed the worst result, as well asLlama 3by Meta. - Surprisingly good result get
Llama 4 Maverick. It is a little slow, but you can run it on your own server, and do not pay for using GTP's for more! A good option, if you don't have a money but have one free NVidia H100 GPU-card. GPT-4 Turboby OpenAI failed this test. But all other OpenAI's models works fine.- I recommend to use only
GPT-4o Minimodel by OpenAI. It has an equivalent in price toGemini Pro 1.5model. The modelsGPT-4oando4by OpenAI are too expensive.
Resume: In my production I will use Gemini Flash 1.5 model.
Here is the first handwritten edition of this algorithm.
wait_for_command- waiting for the one of global bot's commandsselect_lang- waiting for user to select one of languagesselect_category- waiting for user to select a category or enter the new onewait_for_cover_photo- waiting for user to send a photo of the book coverwait_reaction_on_cover- waiting for user's reaction of extracted book coverwait_for_brief_photo- waiting for user to send a photo of the annotation pagewait_reaction_on_brief= waiting for user's reaction of extracted book information
There are 5 global bot's commands, whitch can be executate from any bot state. You don't need to ask @BotFather add these commands to main menu button. They added automaticaly by the bot itself:
add- Add bookfind- Searchedit- Edit bookcat- Cathegoriesexport- Exportlang- Language
Also bot process /start command - for the first run of each user.
Common data:
locale: str - prefered language by user's selectioninline: int or array of int - last ID of message with inline keyboard. Used to remember remove these keyboard, when they are no longer needed
Category selection:
action: str - name of the action to run after the user selects a categorycan_add: bool - can the user enter the name of a new category when he select itcategory: str - name of the category, selected by user
Book data:
photo_filename: str - relative path to file with original photo on S3 storage of the book covercover_filename: str - relative path to file with extracted photo on S3 storage of the book coverbrief_filename: str - relative path to file with original photo on S3 storage of the annotation pagetitle: str - book's titleauthors: str - authors of the bookauthors_full_names: str - full names of authors of the bookpages: str - count of pages in the bookpublisher: str - the publishing houseyear: str - year of the book publicationisbn: str - ISBNbrief: str - single sentence that best conveys the content of the bookannotation: str - full text of the book's annotationbook_id: int - ID of added/edited bookuser_id: int - ID of current telegram-user (library owner)
Field selection:
field: str - name of selected book's field for editing
| Name | Description | Usage | Example |
|---|---|---|---|
| REGISTRY_HOST | Hostname of Container Registry to push docker image | CI/CD | 192.168.110.157 |
| REGISTRY_USERNAME | Login of Container Registry | CI/CD | 76ee9321-2... |
| REGISTRY_PASSWORD | Password of Container Registry | CI/CD | 1287a999-f... |
| KUBECONFIG | YAML text config of production Kubernates cluster to deploy docker container | CI/CD | apiVersion: v1clusters:- cluster:... |
| TELEGRAM_TOKEN | Strint token for production telegram-bot @home_library_ai_bot | Production | 25461226:Fjkld876ww2... |
| POSTGRES_HOST | Hostname or IP-address of Postgres database server | Production | 127.0.0.1 |
| POSTGRES_PORT | IP-Port of Postgres database server | Production | 5432 |
| POSTGRES_DATABASE | Database name of Postgres database | Production | homelib |
| POSTGRES_USERNAME | Login of Postgres database | Production | user |
| POSTGRES_PASSWORD | Password of Postgres database | Production | my_super_password |
| AWS_ENDPOINT_URL | URL of S3 storage | Production | https://s3.ru-7.storage.selcloud.ru |
| AWS_BUCKET_NAME | Bucket name in S3 storage | Production | homelibrary |
| AWS_ACCESS_KEY_ID | Access key to S3 storage | Production | e8793d292328x... |
| AWS_SECRET_ACCESS_KEY | Secret key to S3 storage | Production | a8632409c821... |
| AWS_EXTERNAL_URL | HTTPs external URL to access content of S3 storage | https://cdn.homelib.pro/ |
|
| GPT_URL | URL for access to GPR API | Production | https://api.vsegpt.ru/v1 |
| GPT_API_TOKEN | Secret token for GPT API | Production | sk-f3-wm-15a2432133... |
| GPT_MODEL | GPT model name | Production | vis-google/gemini-flash-1.5 |
Add production secrets they are delivered to the application via environment variables. If you use Visual Studio Code to develop, you can add them to .vscode/launch.json file:
"configurations": [
{
"env": {
"TELEGRAM_TOKEN": "25461226:Fjkld876ww2...",
"POSTGRES_HOST": "127.0.0.1",
"POSTGRES_PORT": "5432",
"POSTGRES_DATABASE": "homelib",
"POSTGRES_USERNAME": "user",
"POSTGRES_PASSWORD": "my_super_password",
"AWS_ACCESS_KEY_ID": "e8793d292328x...",
"AWS_SECRET_ACCESS_KEY": "a8632409c821...",
"AWS_ENDPOINT_URL": "https://s3.ru-7.storage.selcloud.ru",
"AWS_BUCKET_NAME": "homelibrary",
"GPT_URL": "https://api.vsegpt.ru/v1",
"GPT_API_TOKEN": "sk-f3-wm-15a2432133...",
"GPT_MODEL": "vis-google/gemini-flash-1.5"
}
}
]
Where are two application tables in the PostgreSQL database:
- Table
logsfor log of starting the bot:
CREATE TABLE IF NOT EXISTS logs (
user_id BIGINT,
nickname TEXT,
username TEXT,
datetime TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)- Table
booksfor store information of all books, added by users:
CREATE TABLE IF NOT EXISTS books (
user_id BIGINT,
book_id BIGINT,
category TEXT,
photo_filename TEXT,
cover_filename TEXT,
brief_filename TEXT,
title TEXT,
authors TEXT,
authors_full_names TEXT,
pages TEXT,
publisher TEXT,
year TEXT,
isbn TEXT,
annotation TEXT,
brief TEXT,
datetime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (user_id, book_id)
)And two internal tables for continuous operation of the bot: one to store current state of conversation, and another to store current information, received from user. These tables used by PostrgreSQL-storage plugin, written by me for aiorgam library.
- Table
aiogram_statesfor store current state of bot-user conversation
CREATE TABLE IF NOT EXISTS aiogram_states(
chat_id BIGINT NOT NULL,
user_id BIGINT NOT NULL,
state TEXT,
PRIMARY KEY (chat_id, user_id)
)- Table
aiogram_datafor store information, received from user during the conversation:
CREATE TABLE IF NOT EXISTS aiogram_data(
chat_id BIGINT NOT NULL,
user_id BIGINT NOT NULL,
data JSON,
PRIMARY KEY ("chat_id", "user_id")
)You don't need to create these tables manualy. Then telegram-bot connect to postgres, it try to create these tables, if they are not exists.
Program files:
homelib.py- core of telegram-botmodules\environment.py- prepare environment variables, classes and connectionsmodules\databasecreation.py- script to create tables in Postgres database on the first run of scriptmodules\handle_addbook.py- handlers to for processing bot messages in adding book mode
Deployment scripts:
requirements.txt- python's library dependenciesdockerfile- instructions: how to build Docker containerdeployment.yaml- instructions: how to deploy it on Kubernates cluster.gitignore- hide my python cache, debug environment variables with sectets, certificates, etc..github\workflows\- instructions: automatization CI/CD with GitHub Actions
Documentation:
README.md- current descriptionimages\- floder with images for current description
System administration and development organization:
- CI/CD with docker container generation and deploy it on kubernates cluster
- Using requirements.txt to build docker container with python's program
- Using GitHub secrets to deploy container and for transmission in it database coordinates and API keys
- Using Visual studio code launch.json to emulate environment variables with GitHub secrets
- Arrange branches: main to deploy on production and develop to store development changes
Cloud storages:
- Push files in S3 store, including asynchronous mode and get links to them
AI models:
- Use U-2-Net Salient Object Detection AI-model as local library. $0.1 - $1 per image economy!
- Automaticaly compare big count of different GPT models to find the cheapest and most efficient one
- Make requests to GPT AI-models, including images and parse their responses
- Write prompts for GPT to get scructurized data and get derived fields
- Write prompt on one language to get response on another language
AI code generation:
- Try to generate tiny but full-functionaly successefull worked program by one structurized request to GPT-model (see
generate_prompt_sonnet_3.7.txtandcompare_models.py)
Images processing:
- Find contoures on the image
- Affine transformation of the image
Intenational localization:
- Create multilanguage application and add or change localization without edit source code
- I realized that the best way is to store in program code only shortcuts for string literals. And all full strings, included default english language, should be stored in local files
- Plural numbers localizaion
Telegram development:
- Comparing different pythons framework for telegram bots. I chose aiogram for its beautiful structure and thoughtful concept
- Using aiogram FSMstore to remember state of conversation and user data, entered on previous steps of the dialogue
- Send formated messages in telegram
- Using inline buttons in telegram chat
- Timely removing of old, outdated inline buttons
- Put emoji reaction on messages and delete messages
Opensource collaboration:
- Wrote my own library to use PostgreSQL database to store state and user data on aiogram FSMstore

















