diff --git a/API_DOCUMENTATION.md b/API_DOCUMENTATION.md new file mode 100644 index 0000000..ab2ffbc --- /dev/null +++ b/API_DOCUMENTATION.md @@ -0,0 +1,578 @@ +# API Documentation + +## Overview + +This is a Flask-based web application that generates advertising ideas using AI-powered image analysis and Unsplash image search. The application processes uploaded images, generates captions, creates ad ideas using OpenAI, and finds relevant images from Unsplash. + +## Table of Contents + +1. [Flask Application Routes](#flask-application-routes) +2. [Core Classes](#core-classes) +3. [Image Processing Components](#image-processing-components) +4. [AI/ML Components](#aiml-components) +5. [External API Integrations](#external-api-integrations) +6. [Installation and Setup](#installation-and-setup) +7. [Usage Examples](#usage-examples) + +--- + +## Flask Application Routes + +### Main Application (`app.py`) + +The main Flask application provides the following REST API endpoints: + +#### `GET /` +**Description**: Default access page +**Returns**: HTML template (`index.html`) +**Usage**: Entry point for the web application + +```python +# Example usage +import requests +response = requests.get('http://localhost:5000/') +``` + +#### `POST /upload` +**Description**: Upload and process an image +**Parameters**: +- `file`: Image file (multipart/form-data) +- `url`: Image URL (optional, string) + +**Returns**: HTML template (`processing.html`) with image name +**Status Codes**: 200 (success), 400 (invalid input) + +```python +# Example usage +import requests + +# Upload from file +with open('image.jpg', 'rb') as f: + files = {'file': f} + response = requests.post('http://localhost:5000/upload', files=files) + +# Upload from URL +data = {'url': 'https://example.com/image.jpg'} +response = requests.post('http://localhost:5000/upload', data=data) +``` + +#### `POST /ideas` +**Description**: Generate ad ideas for an uploaded image +**Parameters**: +- `image`: Image filename (string) + +**Returns**: Merged image file containing generated ideas +**Process**: +1. Analyzes the image using AI +2. Generates ad ideas using OpenAI +3. Searches Unsplash for relevant images +4. Merges results into a single image + +```python +# Example usage +import requests + +data = {'image': 'input-20231201-143022.jpg'} +response = requests.post('http://localhost:5000/ideas', data=data) +# Returns the merged ideas image +``` + +#### `GET /static/images/` +**Description**: Retrieve image files from static directory +**Parameters**: +- `filename`: Name of the image file (string) + +**Returns**: Image file +**Usage**: Serves generated images to the frontend + +```python +# Example usage +import requests +response = requests.get('http://localhost:5000/static/images/ideas-20231201-143022.jpg') +``` + +--- + +## Core Classes + +### GenerateIdeas (`ml.py`) + +Main orchestrator class that coordinates image analysis and idea generation. + +#### Constructor +```python +GenerateIdeas(filename="/path/to/image.jpg", results_count=3) +``` + +**Parameters**: +- `filename` (str): Path to the input image file +- `results_count` (int): Number of ideas to generate (default: 3) + +#### Methods + +##### `generate_caption(filename)` +**Description**: Generates a caption for the input image using Azure Computer Vision +**Parameters**: +- `filename` (str): Path to the image file + +**Returns**: `str` - Generated caption + +```python +# Example usage +worker = GenerateIdeas() +caption = worker.generate_caption("/path/to/image.jpg") +print(caption) # "two dogs playing in snow" +``` + +##### `openai_ideas(description)` +**Description**: Generates ad ideas using OpenAI based on image description +**Parameters**: +- `description` (str): Image description/caption + +**Returns**: `list` - List of generated ad ideas + +```python +# Example usage +worker = GenerateIdeas() +ideas = worker.openai_ideas("two dogs playing in snow") +print(ideas) # ["Ad idea 1", "Ad idea 2", "Ad idea 3"] +``` + +##### `run()` +**Description**: Main execution method that orchestrates the entire process +**Returns**: `list` - List of prompts for Unsplash search + +```python +# Example usage +worker = GenerateIdeas(filename="/path/to/image.jpg") +prompts = worker.run() +print(prompts) # ["original caption", "idea 1", "idea 2", "idea 3"] +``` + +--- + +## Image Processing Components + +### MergeImages (`merge_images.py`) + +Utility class for combining multiple images into a single image. + +#### Constructor +```python +MergeImages() +``` + +#### Methods + +##### `horizontal(img_list, save_file)` +**Description**: Merges images horizontally into a single image +**Parameters**: +- `img_list` (list): List of image file paths +- `save_file` (str): Output file path + +**Returns**: `None` - Saves the merged image to file + +```python +# Example usage +merger = MergeImages() +image_list = ["image1.jpg", "image2.jpg", "image3.jpg"] +merger.horizontal(image_list, "merged_horizontal.jpg") +``` + +##### `vertical(img_list, save_file)` +**Description**: Merges images vertically into a single image +**Parameters**: +- `img_list` (list): List of image file paths +- `save_file` (str): Output file path + +**Returns**: `None` - Saves the merged image to file + +```python +# Example usage +merger = MergeImages() +image_list = ["image1.jpg", "image2.jpg", "image3.jpg"] +merger.vertical(image_list, "merged_vertical.jpg") +``` + +--- + +## AI/ML Components + +### ImageCaption (`image_caption.py`) + +Azure Computer Vision integration for generating image captions. + +#### Constructor +```python +ImageCaption() +``` + +**Note**: Requires Azure Computer Vision API credentials to be configured. + +#### Methods + +##### `generate_captions(filename)` +**Description**: Generates captions for an image using Azure Computer Vision +**Parameters**: +- `filename` (str): Path to the image file + +**Returns**: `list` - List of generated captions with confidence scores + +```python +# Example usage +caption_generator = ImageCaption() +captions = caption_generator.generate_captions("/path/to/image.jpg") +print(captions) # ["two dogs playing in snow"] +``` + +### OpenAICompletion (`openai_completion.py`) + +OpenAI API integration for generating creative ad ideas. + +#### Constructor +```python +OpenAICompletion( + engine="babbage", + temperature=0.77, + max_tokens=50, + top_p=0.95, + best_of=3, + frequency_penalty=0.7, + presence_penalty=0.57 +) +``` + +**Parameters**: +- `engine` (str): OpenAI model to use (default: "babbage") +- `temperature` (float): Creativity level (0.0-1.0, default: 0.77) +- `max_tokens` (int): Maximum tokens in response (default: 50) +- `top_p` (float): Nucleus sampling parameter (default: 0.95) +- `best_of` (int): Number of responses to generate (default: 3) +- `frequency_penalty` (float): Penalty for repetition (default: 0.7) +- `presence_penalty` (float): Penalty for new topics (default: 0.57) + +**Note**: Requires OpenAI API key to be configured. + +#### Methods + +##### `suggestions(prompt)` +**Description**: Generates creative suggestions based on a prompt +**Parameters**: +- `prompt` (str): Input prompt for idea generation + +**Returns**: `list` - List of generated suggestions + +```python +# Example usage +openai_worker = OpenAICompletion() +suggestions = openai_worker.suggestions("two dogs playing in snow") +print(suggestions) # ["Ad idea 1", "Ad idea 2", "Ad idea 3"] +``` + +--- + +## External API Integrations + +### Unsplash1 (`unsplash1.py`) + +Advanced Unsplash image search using CLIP model for semantic similarity. + +#### Constructor +```python +Unsplash1() +``` + +**Features**: +- Uses CLIP (Contrastive Language-Image Pre-training) model +- Semantic image search based on text descriptions +- GPU acceleration support +- Pre-computed feature vectors for fast search + +#### Methods + +##### `encode_search_query(search_query)` +**Description**: Encodes text query using CLIP model +**Parameters**: +- `search_query` (str): Text description to encode + +**Returns**: `torch.Tensor` - Encoded feature vector + +```python +# Example usage +unsplash = Unsplash1() +features = unsplash.encode_search_query("dogs playing in snow") +``` + +##### `find_best_matches(text_features)` +**Description**: Finds best matching images using cosine similarity +**Parameters**: +- `text_features` (torch.Tensor): Encoded text features + +**Returns**: `str` - Photo ID of best match + +```python +# Example usage +unsplash = Unsplash1() +photo_id = unsplash.find_best_matches(features) +``` + +##### `save_photo(photo_id, filename)` +**Description**: Downloads and saves an Unsplash photo +**Parameters**: +- `photo_id` (str): Unsplash photo ID +- `filename` (str): Local file path to save image + +**Returns**: `None` - Saves image to file + +```python +# Example usage +unsplash = Unsplash1() +unsplash.save_photo("abc123", "downloaded_image.jpg") +``` + +##### `search_unslash(search_query)` +**Description**: Performs semantic search on Unsplash +**Parameters**: +- `search_query` (str): Text description to search for + +**Returns**: `str` - Best matching photo ID + +```python +# Example usage +unsplash = Unsplash1() +photo_id = unsplash.search_unslash("dogs playing in snow") +``` + +##### `run(search_query, results_count=3)` +**Description**: Main method to search and return multiple results +**Parameters**: +- `search_query` (list): List of search queries +- `results_count` (int): Number of results to return + +**Returns**: `list` - List of photo IDs + +```python +# Example usage +unsplash = Unsplash1() +queries = ["dogs playing", "snow scene", "pet photography"] +photo_ids = unsplash.run(queries, results_count=3) +``` + +### Unsplash3 (`unsplash3.py`) + +Simplified Unsplash image search using CLIP model. + +#### Constructor +```python +Unsplash3() +``` + +**Features**: +- Similar to Unsplash1 but with simplified interface +- Returns multiple results in a single search +- Direct search without query combination + +#### Methods + +##### `run(search_query, results_count=3)` +**Description**: Main method to search and return multiple results +**Parameters**: +- `search_query` (str): Single search query +- `results_count` (int): Number of results to return + +**Returns**: `list` - List of photo IDs + +```python +# Example usage +unsplash = Unsplash3() +photo_ids = unsplash.run("dogs playing in snow", results_count=3) +``` + +--- + +## Installation and Setup + +### Prerequisites + +1. Python 3.7+ +2. CUDA-compatible GPU (optional, for faster CLIP processing) +3. Azure Computer Vision API key +4. OpenAI API key + +### Installation + +1. Clone the repository +2. Install dependencies: +```bash +pip install -r requirements.txt +``` + +3. Configure API keys: + - Set Azure Computer Vision credentials in `image_caption.py` + - Set OpenAI API key in `openai_completion.py` + +4. Download required datasets: + - Unsplash dataset with photo IDs and features + - Place in `unsplash-dataset/` directory + +### Configuration + +#### Environment Variables +```bash +export AZURE_VISION_KEY="your_azure_key" +export AZURE_VISION_ENDPOINT="your_azure_endpoint" +export OPENAI_API_KEY="your_openai_key" +``` + +#### File Structure +``` +project/ +├── app.py # Main Flask application +├── ml.py # Core ML orchestration +├── unsplash1.py # Advanced Unsplash search +├── unsplash3.py # Simple Unsplash search +├── merge_images.py # Image merging utilities +├── image_caption.py # Azure Computer Vision +├── openai_completion.py # OpenAI integration +├── ocr.py # OCR functionality +├── requirements.txt # Python dependencies +├── static/ +│ └── images/ # Generated images +└── templates/ # HTML templates +``` + +--- + +## Usage Examples + +### Complete Workflow Example + +```python +# 1. Initialize the main worker +worker = GenerateIdeas(filename="input_image.jpg", results_count=3) + +# 2. Generate ideas +prompts = worker.run() + +# 3. Search Unsplash for relevant images +unsplash = Unsplash1() +photo_ids = unsplash.run(prompts, results_count=3) + +# 4. Download and save images +image_files = [] +for idx, photo_id in enumerate(photo_ids): + filename = f"idea_{idx}.jpg" + unsplash.save_photo(photo_id, filename) + image_files.append(filename) + +# 5. Merge images into single output +merger = MergeImages() +merger.horizontal(image_files, "final_ideas.jpg") +``` + +### Web API Usage + +```python +import requests + +# 1. Upload an image +with open('product_image.jpg', 'rb') as f: + files = {'file': f} + response = requests.post('http://localhost:5000/upload', files=files) + +# 2. Generate ideas +data = {'image': 'input-20231201-143022.jpg'} +response = requests.post('http://localhost:5000/ideas', data=data) + +# 3. Download the result +with open('generated_ideas.jpg', 'wb') as f: + f.write(response.content) +``` + +### Custom Integration Example + +```python +# Custom idea generation with different parameters +openai_worker = OpenAICompletion( + engine="davinci", + temperature=0.9, + max_tokens=100 +) + +# Generate ideas for specific product +suggestions = openai_worker.suggestions("modern smartphone with camera") + +# Use advanced Unsplash search +unsplash = Unsplash1() +photo_ids = unsplash.run(suggestions, results_count=5) + +# Create custom image layout +merger = MergeImages() +image_files = [f"idea_{i}.jpg" for i in range(len(photo_ids))] +merger.vertical(image_files, "custom_layout.jpg") +``` + +--- + +## Error Handling + +### Common Issues and Solutions + +1. **API Key Errors** + - Ensure Azure and OpenAI API keys are properly configured + - Check API quotas and billing status + +2. **Image Processing Errors** + - Verify image format (JPG, PNG, BMP supported) + - Check file permissions and disk space + +3. **CLIP Model Loading** + - Ensure sufficient RAM/VRAM for model loading + - Use CPU fallback if GPU unavailable + +4. **Unsplash Dataset** + - Verify dataset files are in correct location + - Check file permissions for CSV and NPY files + +### Status Codes + +- `200`: Success +- `400`: Bad Request (invalid input) +- `500`: Internal Server Error + +--- + +## Performance Considerations + +### Optimization Tips + +1. **GPU Usage**: Enable CUDA for faster CLIP processing +2. **Batch Processing**: Process multiple images in sequence +3. **Caching**: Cache generated captions and ideas +4. **Image Resizing**: Resize large images before processing +5. **API Limits**: Respect rate limits for external APIs + +### Resource Requirements + +- **RAM**: Minimum 4GB, Recommended 8GB+ +- **GPU**: Optional but recommended for CLIP processing +- **Storage**: 2GB+ for datasets and generated images +- **Network**: Stable internet for API calls + +--- + +## Contributing + +When contributing to this project: + +1. Follow the existing code structure +2. Add proper error handling +3. Include docstrings for new functions +4. Update this documentation for new features +5. Test with various image formats and sizes + +--- + +## License + +This project is licensed under the MIT License. See LICENSE file for details. diff --git a/OCR_DOCUMENTATION.md b/OCR_DOCUMENTATION.md new file mode 100644 index 0000000..0e440fe --- /dev/null +++ b/OCR_DOCUMENTATION.md @@ -0,0 +1,389 @@ +# OCR and Additional Components Documentation + +## Overview + +This document covers the OCR (Optical Character Recognition) functionality and additional utility components in the codebase. + +--- + +## OCR Component (`ocr.py`) + +### Overview + +The OCR component uses Keras-OCR to extract text from images. This is useful for analyzing images that contain text elements for advertising purposes. + +### Dependencies + +```bash +pip install keras-ocr +pip install matplotlib +``` + +### Usage + +#### Basic OCR Processing + +```python +import matplotlib.pyplot as plt +import keras_ocr + +# Initialize the OCR pipeline +pipeline = keras_ocr.pipeline.Pipeline() + +# Read images (can be from URLs or local files) +images = [ + keras_ocr.tools.read(url) for url in [ + 'https://upload.wikimedia.org/wikipedia/commons/b/bd/Army_Reserves_Recruitment_Banner_MOD_45156284.jpg', + 'https://upload.wikimedia.org/wikipedia/commons/b/b4/EUBanana-500x112.jpg' + ] +] + +# Perform OCR recognition +prediction_groups = pipeline.recognize(images) + +# Extract text from predictions +for i, predictions in enumerate(prediction_groups): + text = " ".join([prediction[0] for prediction in predictions]) + print(f"Image {i+1}: {text}") +``` + +#### Visualizing Results + +```python +# Plot the predictions with bounding boxes +fig, axs = plt.subplots(nrows=len(images), figsize=(20, 20)) +for ax, image, predictions in zip(axs, images, prediction_groups): + print(" ".join([predictions[i][0] for i in range(len(predictions))])) + keras_ocr.tools.drawAnnotations(image=image, predictions=predictions, ax=ax) + +plt.show() +``` + +### API Reference + +#### `keras_ocr.pipeline.Pipeline()` + +**Description**: Main OCR pipeline for text recognition +**Returns**: Pipeline object for processing images + +#### `pipeline.recognize(images)` + +**Description**: Performs OCR on a list of images +**Parameters**: +- `images` (list): List of image arrays + +**Returns**: `list` - List of prediction groups, where each group contains (text, bounding_box) tuples + +#### `keras_ocr.tools.read(url_or_path)` + +**Description**: Reads an image from URL or local path +**Parameters**: +- `url_or_path` (str): URL or local file path + +**Returns**: `numpy.ndarray` - Image array + +#### `keras_ocr.tools.drawAnnotations(image, predictions, ax)` + +**Description**: Draws bounding boxes and text on image +**Parameters**: +- `image` (numpy.ndarray): Image to annotate +- `predictions` (list): List of (text, bounding_box) tuples +- `ax` (matplotlib.axes.Axes): Matplotlib axes for plotting + +--- + +## Integration with Main Application + +### Potential Integration Points + +The OCR component can be integrated into the main application for: + +1. **Text Analysis**: Extract text from uploaded images to understand content +2. **Brand Detection**: Identify brand names and slogans in images +3. **Content Filtering**: Filter inappropriate or unwanted text content +4. **Enhanced Captioning**: Combine visual and textual analysis for better captions + +### Example Integration + +```python +# In ml.py - Enhanced GenerateIdeas class +from ocr import pipeline +import keras_ocr + +class GenerateIdeas: + def __init__(self, filename, results_count=3): + self.filename = filename + self.results_count = results_count + self.ocr_pipeline = keras_ocr.pipeline.Pipeline() + + def extract_text(self, filename): + """Extract text from image using OCR""" + image = keras_ocr.tools.read(filename) + predictions = self.ocr_pipeline.recognize([image]) + + if predictions and predictions[0]: + text = " ".join([pred[0] for pred in predictions[0]]) + return text + return "" + + def enhanced_caption(self, filename): + """Generate enhanced caption using both vision and OCR""" + # Get visual caption + visual_caption = self.generate_caption(filename) + + # Get text content + text_content = self.extract_text(filename) + + # Combine for enhanced description + if text_content: + enhanced_caption = f"{visual_caption} with text: {text_content}" + else: + enhanced_caption = visual_caption + + return enhanced_caption +``` + +--- + +## Additional Utility Functions + +### Image Processing Utilities + +#### Supported Image Formats + +The application supports the following image formats: +- JPEG (.jpg, .jpeg) +- PNG (.png) +- BMP (.bmp) + +#### Image Validation + +```python +def validate_image_format(filename): + """Validate if image format is supported""" + supported_formats = ['.jpg', '.jpeg', '.png', '.bmp'] + ext = os.path.splitext(filename)[1].lower() + return ext in supported_formats +``` + +#### Image Resizing + +```python +from PIL import Image + +def resize_image(input_path, output_path, max_size=(800, 600)): + """Resize image while maintaining aspect ratio""" + with Image.open(input_path) as img: + img.thumbnail(max_size, Image.Resampling.LANCZOS) + img.save(output_path, quality=95) +``` + +### File Management + +#### Timestamp-based Naming + +```python +import time + +def generate_timestamped_filename(prefix="image"): + """Generate timestamped filename""" + timestr = time.strftime("%Y%m%d-%H%M%S") + return f"{prefix}-{timestr}.jpg" +``` + +#### Directory Management + +```python +import os + +def ensure_directory_exists(directory_path): + """Create directory if it doesn't exist""" + if not os.path.isdir(directory_path): + os.makedirs(directory_path) +``` + +--- + +## Error Handling for OCR + +### Common OCR Issues + +1. **Low Quality Images** + - Blurry or low-resolution images may result in poor text recognition + - Solution: Implement image preprocessing (sharpening, contrast adjustment) + +2. **Complex Backgrounds** + - Text on complex backgrounds may be difficult to recognize + - Solution: Use background removal or contrast enhancement + +3. **Multiple Languages** + - Mixed language content may affect recognition accuracy + - Solution: Implement language detection and appropriate model selection + +### Error Handling Example + +```python +def safe_ocr_processing(image_path): + """Safely perform OCR with error handling""" + try: + image = keras_ocr.tools.read(image_path) + predictions = pipeline.recognize([image]) + + if predictions and predictions[0]: + return [pred[0] for pred in predictions[0]] + else: + return [] + + except Exception as e: + print(f"OCR processing failed: {e}") + return [] +``` + +--- + +## Performance Optimization + +### OCR Performance Tips + +1. **Image Preprocessing** + - Resize large images before OCR processing + - Apply contrast enhancement for better text recognition + - Remove noise and artifacts + +2. **Batch Processing** + - Process multiple images in a single pipeline call + - Use GPU acceleration if available + +3. **Caching** + - Cache OCR results for frequently processed images + - Store results in database or file system + +### Example Optimized Processing + +```python +def optimized_ocr_batch(image_paths, batch_size=4): + """Process multiple images efficiently""" + results = [] + + for i in range(0, len(image_paths), batch_size): + batch = image_paths[i:i + batch_size] + + # Preprocess images + processed_images = [] + for path in batch: + image = keras_ocr.tools.read(path) + # Apply preprocessing here + processed_images.append(image) + + # Process batch + batch_predictions = pipeline.recognize(processed_images) + results.extend(batch_predictions) + + return results +``` + +--- + +## Testing OCR Functionality + +### Unit Tests + +```python +import unittest +import tempfile +from PIL import Image, ImageDraw, ImageFont + +class OCRTestCase(unittest.TestCase): + def setUp(self): + self.pipeline = keras_ocr.pipeline.Pipeline() + + def test_text_recognition(self): + # Create test image with text + img = Image.new('RGB', (200, 100), color='white') + draw = ImageDraw.Draw(img) + draw.text((10, 40), "Test Text", fill='black') + + # Save temporary file + with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as f: + img.save(f.name) + temp_path = f.name + + try: + # Perform OCR + image = keras_ocr.tools.read(temp_path) + predictions = self.pipeline.recognize([image]) + + # Assert results + self.assertTrue(len(predictions) > 0) + self.assertTrue(len(predictions[0]) > 0) + + finally: + os.unlink(temp_path) +``` + +--- + +## Future Enhancements + +### Potential Improvements + +1. **Multi-language Support** + - Implement language detection + - Use language-specific OCR models + +2. **Advanced Text Analysis** + - Sentiment analysis of extracted text + - Brand name detection + - Keyword extraction + +3. **Real-time Processing** + - WebSocket integration for real-time OCR + - Streaming video OCR support + +4. **Custom Model Training** + - Fine-tune OCR models for specific use cases + - Domain-specific text recognition + +--- + +## Troubleshooting + +### Common Issues + +1. **Model Loading Errors** + - Ensure sufficient disk space for model downloads + - Check internet connection for model downloads + - Verify CUDA installation for GPU acceleration + +2. **Memory Issues** + - Reduce batch size for large images + - Implement image downsampling + - Use CPU-only mode if GPU memory is insufficient + +3. **Accuracy Issues** + - Improve image quality before processing + - Use appropriate preprocessing techniques + - Consider model fine-tuning for specific domains + +### Debug Mode + +```python +import logging + +# Enable debug logging +logging.basicConfig(level=logging.DEBUG) + +# Process with detailed output +predictions = pipeline.recognize(images, verbose=True) +``` + +--- + +## License and Attribution + +The OCR functionality uses Keras-OCR, which is based on the following research papers: + +- "EAST: An Efficient and Accurate Scene Text Detector" by Zhou et al. +- "A Single Shotted Text Spotter with Explicit Alignment and Attention" by He et al. + +Please refer to the Keras-OCR documentation for licensing information and proper attribution. \ No newline at end of file diff --git a/README.md b/README.md index d2dbee1..770bbff 100644 --- a/README.md +++ b/README.md @@ -1,78 +1,295 @@ -## Ad Ideas Machine +# AI-Powered Ad Idea Generator -# Motivation +## Overview -For businesses, advertisement is a crucial part of creating brand awareness and increasing sales. Every business — small, large, local, or a startup — desires to advertise its products to the targeted audience. A large fraction of these advertisements are on digital platforms, such as Facebook, Instagram, Websites, etc, where for effective advertising, businesses not only need to advertise frequently, but also need a lot of variations of their ads. Different ads leave a long lasting impression on the audience while seeing the same ad again and again can get boring. Thus, the advertisements demand creativity. Going into pandemic, the rate of shift to the digital world has only accelerated, creating a huge demand for creative advertising. +This is a Flask-based web application that generates advertising ideas using AI-powered image analysis and Unsplash image search. The application processes uploaded images, generates captions using Azure Computer Vision, creates creative ad ideas using OpenAI, and finds relevant images from Unsplash using CLIP-based semantic search. -Now, developing advertisements, particularly creative ones, can get very expensive. Small businesses don’t usually have dedicated staff for running ads. They end up hiring advertising agencies for their ad campaigns, and these agencies charge a lot of money. Alternatively, they don’t advertise and their business suffers. +## Features -Machine learning has transformed so many things around us: search, recommendations, predications, even music generation and creating real-looking paintings! Could we do something to reimagine the advertising industry as well? +- **Image Upload**: Support for local file uploads and image URLs +- **AI-Powered Analysis**: Azure Computer Vision for image captioning +- **Creative Idea Generation**: OpenAI integration for ad idea creation +- **Semantic Image Search**: CLIP-based Unsplash search for relevant images +- **Image Merging**: Automatic combination of generated ideas into visual layouts +- **OCR Support**: Text extraction from images using Keras-OCR +- **Web Interface**: User-friendly Flask web application -In this project, I wanted to explore exactly that. I wanted to explore the possibility of generating ad ideas for users for their ad campaigns. Specifically, I wanted to utilize existing state-of-the-art Machine Learning models in Computer Vision and Natural Language Processing, integrate them in a pipeline and finetune them to provide advertisement ideas for users. +## Quick Start +### Prerequisites -# Implementation Flow +- Python 3.7+ +- Azure Computer Vision API key +- OpenAI API key +- CUDA-compatible GPU (optional, for faster CLIP processing) -There isn’t a good dataset of advertisement images and their text descriptions, so developing an ad suggestion model using it seemed a long and arduous process. But engineering is all about simplifying the process and building stuff from what we know, what we have, and what we can innovate. I approached my project as following: +### Installation -1. I want to understand users requirements, so I’ll take some required advertisement related image input from users as reference. These could be existing ad posters users can find on Google. -2. Understand theme of advertisement from scene understanding of image, understand context from optical character recognition of taglines from image, and obtain a text description of the user requirement -3. In recent times, the capability of language models has evolved exponentially thanks to Transformer based models. Utilising this capability, particularly using a OpenAI’s GPT-3 API that I luckily got hold of as a beta tester, I would generate more ideas from the text description I generated in step 2. -4. After this step, ideally I wanted to generate images using another OpenAI model DALL-E that can generate some really cool images. However, the DALL-E API isn’t released to the public yet, so I decided to search a database of images from the ideas I generated in step 3. -5. Finally, score the best images obtained from search using another captioning based metric, and show the selected images to the user as ad ideas. +1. Clone the repository: +```bash +git clone +cd +``` +2. Install dependencies: +```bash +pip install -r requirements.txt +``` -# The Pipeline +3. Configure API keys: + - Set Azure Computer Vision credentials in `image_caption.py` + - Set OpenAI API key in `openai_completion.py` -Now, I’ll start to delve into the technicalities of implementation. First, here’s a simplified block diagram of the Machine Learning system. -![image](https://user-images.githubusercontent.com/62667772/111731412-a8fea300-8830-11eb-9778-1fcb3d24a10c.png) +4. Download required datasets: + - Unsplash dataset with photo IDs and features + - Place in `unsplash-dataset/` directory -I have five models in total in the Ad Ideas Machine ML system: -1. Optical Character Recognition Model for extracting texts from images -2. Image Captioning Model for scene understanding -3. GPT-3 API for generating ideas from image caption and OCR descriptions of the image -4. Profanity Filter Model for filtering ideas generated from GPT-3 that may be inappropriate -5. Unsplash Search Model for searching relevant images from unsplash.com image dataset +5. Run the application: +```bash +python app.py +``` -As described earlier, I generate a description of requirements from the image input that the user provides, and these are done using two computer vision models: Imaging Captioning and OCR. The text description output is fed into GPT-3 API for “Idea completion” using it’s Babbage engine. It extends the text description into multiple dimensions beyond simple imagination. Now why do we have the profanity filter? It turns out that sometimes, not so frequently though, the ideas generated by GPT-3 aren’t “appropriate.” So, I pass the output of GPT-3 model to Profanity Filter model that analyses it, and if only the filter model greelights the GPT-3 output, the generated idea is used for image search. If it isn’t, GPT-3 is triggered again to generate more ideas which are again passed through the Profanity Filter model. +6. Open your browser and navigate to `http://localhost:5000` -I ideally wanted to use DALL-E API for ad ideas and that would have been pretty cool, but in absence of DALL-E’s API, I used a pre-trained Unsplash search model which searches generates image id of unsplash.com images that have the closest word description embedding to GPT-3 ideas. +## Usage -Now comes the training part. Most of the models worked pretty well and/or I didn’t have any scope of finetuning. For example, I don’t have access to the model parameters for GPT-3; I can only use the API provided to me. Similarly, Unsplash, OCR, and Profanity filter models can’t be specifically improved for the task in hand. +### Web Interface -For image captioning, I first started with a model on GitHub, and I tried to create a small (about 150 images) ad images, text description data, and finetune the pretrained model on this dataset. But this dataset was too small to make a difference on models trained with large MS-COCO dataset. Next, while standalone training, done in Google Colab, was fine, when I tried to incorporate the model into my pipeline, it ran into a lot of dependency issues with the other models. This image caption model is about an year old, and in the ML world, an year can be too old. So, I decided to switch gears and use another pretrained image captioning model from Microsoft, which too was accessed as an API. The code for different models can be found in the GitHub repository of this project. +1. **Upload Image**: Use the web interface to upload an image or provide an image URL +2. **Generate Ideas**: Click "Generate Ideas" to process the image +3. **View Results**: The application will return a merged image containing generated ad ideas +### API Usage -# User Interface +```python +import requests -There’s a web of intricate models running in the background, but we have to keep them to not overwhelm the user with all complexities. So, for demo purposes, I decided to provide a simple yet elegant user interface using Python’s Flask library. It allowed me to create a design website which people could use to interact with the model running in the backend. ![image](https://user-images.githubusercontent.com/62667772/111731566-f9760080-8830-11eb-8cfe-4116afa403dd.png) +# Upload an image +with open('product_image.jpg', 'rb') as f: + files = {'file': f} + response = requests.post('http://localhost:5000/upload', files=files) -Here, users could either upload a reference image from their personal computer or provide the URL of an image they find interesting on the internet. Once, they hit Upload or Go, the image is rendered on the screen. ![image](https://user-images.githubusercontent.com/62667772/111731576-0135a500-8831-11eb-9998-980037800184.png) +# Generate ideas +data = {'image': 'input-20231201-143022.jpg'} +response = requests.post('http://localhost:5000/ideas', data=data) +``` -Afterwards, when they hit Generate Ideas, the model processes the image in the backend to generate ideation images, concatenates them into a single image and uploads on the web page to display to the users. ![image](https://user-images.githubusercontent.com/62667772/111731592-08f54980-8831-11eb-9486-32c42d5e5759.png) +### Programmatic Usage -Showing multiple images in a fancier format was something too complicated for Flask, that’s why I decided to concatenate images in the backend before displaying to the user. +```python +from ml import GenerateIdeas +from unsplash1 import Unsplash1 +from merge_images import MergeImages +# Initialize components +worker = GenerateIdeas(filename="input_image.jpg", results_count=3) +unsplash = Unsplash1() +merger = MergeImages() -# Results +# Generate ideas +prompts = worker.run() -The ML system provided interesting suggestions for many images. They varied from simple representations to very intricate images. ![image](https://user-images.githubusercontent.com/62667772/111731620-17436580-8831-11eb-8863-3904fb57a93d.png) +# Search for relevant images +photo_ids = unsplash.run(prompts, results_count=3) -At the same time, the quality of these images is low (ignoring image scaling part, which arises due to concatenating all unsplash output images to same size), and these arise for two reasons: -1. Unsplash image dataset: Unsplash image dataset contains images from unsplash website. These images are user captured images, they are raw and unpolished. They were not intended for advertisement purposes per se. -In addition, the Unsplash ML model does a bag of word search, which loses all the semantics and context of image caption and ideas generated by GPT-3, and thus resulting in much lower quality images. -This brings back my original point that having the DALL-E API would have been pretty neat. One, it captures semantics and contexts of text description, and two, it “generates” images, not searches from existing images. But for now, Unsplash search was the best I could do. -2. Image captioning: A major bottle of my ML system is how well the image captioning model works on the user provided images. Sometimes, it fully disappoints, and in those cases, the rest of the models in the pipeline can’t do anything. For example, for the image below, the image captioning model output caption as “diagram.” That’s all. -To overcome this limitation, a nice add-on would be to take some text inputs from the user as well, which would be strictly considered in the final image search. However, I decided against doing it for the demo purpose to not make it too complicated. +# Download and merge images +image_files = [] +for idx, photo_id in enumerate(photo_ids): + filename = f"idea_{idx}.jpg" + unsplash.save_photo(photo_id, filename) + image_files.append(filename) -In conclusion, I developed an ML system that provided advertisement ideas from minor inputs from users. It was an end-to-end system consisting of different computer vision models (OCR, Image Captioning) and natural language processing models (GPT-3, Profanity Filter). It was a super fun project with a moderate amount of success, fitting for the complexity of the problem and lack of relevant datasets and models. +# Create final output +merger.horizontal(image_files, "final_ideas.jpg") +``` +## Architecture -# References +### Core Components -1. [hasnainroopawalla/Image-Captioning-Scene-Descriptor: A CNN-LSTM model to generate image captions](https://github.com/hasnainroopawalla/Image-Captioning-Scene-Descriptor) -2. [Quickstart: Computer Vision client library - Azure Cognitive Services](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts-sdk/client-library?tabs=cli&pivots=programming-language-python) -3. [faustomorales/keras-ocr: A packaged and flexible version of the CRAFT text detector and Keras CRNN recognition model](https://github.com/faustomorales/keras-ocr#egg=keras-ocr) -4. [profanity-check · PyPI](https://pypi.org/project/profanity-check/) -5. [haltakov/natural-language-image-search: Search photos on Unsplash using natural language](https://github.com/haltakov/natural-language-image-search) -6. [OpenAI GPT-3 Idea Generation](https://beta.openai.com/docs/examples/idea-generation) +- **`app.py`**: Main Flask application with REST API endpoints +- **`ml.py`**: Core ML orchestration and idea generation +- **`unsplash1.py`**: Advanced Unsplash search using CLIP model +- **`unsplash3.py`**: Simplified Unsplash search implementation +- **`merge_images.py`**: Image merging utilities +- **`image_caption.py`**: Azure Computer Vision integration +- **`openai_completion.py`**: OpenAI integration +- **`ocr.py`**: OCR functionality using Keras-OCR + +### Data Flow + +1. **Image Upload** → Image validation and storage +2. **Caption Generation** → Azure Computer Vision analysis +3. **Idea Generation** → OpenAI creative suggestions +4. **Image Search** → CLIP-based Unsplash search +5. **Image Download** → Unsplash photo retrieval +6. **Image Merging** → Layout creation and output + +## API Documentation + +For comprehensive API documentation, see: +- [API Documentation](API_DOCUMENTATION.md) - Complete API reference +- [OCR Documentation](OCR_DOCUMENTATION.md) - OCR functionality details + +## Configuration + +### Environment Variables + +```bash +export AZURE_VISION_KEY="your_azure_key" +export AZURE_VISION_ENDPOINT="your_azure_endpoint" +export OPENAI_API_KEY="your_openai_key" +``` + +### File Structure + +``` +project/ +├── app.py # Main Flask application +├── ml.py # Core ML orchestration +├── unsplash1.py # Advanced Unsplash search +├── unsplash3.py # Simple Unsplash search +├── merge_images.py # Image merging utilities +├── image_caption.py # Azure Computer Vision +├── openai_completion.py # OpenAI integration +├── ocr.py # OCR functionality +├── requirements.txt # Python dependencies +├── API_DOCUMENTATION.md # Complete API documentation +├── OCR_DOCUMENTATION.md # OCR component documentation +├── static/ +│ └── images/ # Generated images +└── templates/ # HTML templates +``` + +## Dependencies + +### Core Dependencies + +- **Flask**: Web framework +- **Pillow**: Image processing +- **OpenCV**: Computer vision +- **Keras-OCR**: Text recognition +- **CLIP**: Semantic image search +- **PyTorch**: Deep learning framework +- **Azure Computer Vision**: Image analysis +- **OpenAI**: Creative text generation + +### Installation + +```bash +pip install -r requirements.txt +``` + +## Performance Considerations + +### Optimization Tips + +1. **GPU Usage**: Enable CUDA for faster CLIP processing +2. **Batch Processing**: Process multiple images in sequence +3. **Caching**: Cache generated captions and ideas +4. **Image Resizing**: Resize large images before processing +5. **API Limits**: Respect rate limits for external APIs + +### Resource Requirements + +- **RAM**: Minimum 4GB, Recommended 8GB+ +- **GPU**: Optional but recommended for CLIP processing +- **Storage**: 2GB+ for datasets and generated images +- **Network**: Stable internet for API calls + +## Error Handling + +### Common Issues + +1. **API Key Errors**: Ensure Azure and OpenAI API keys are properly configured +2. **Image Processing Errors**: Verify image format and file permissions +3. **CLIP Model Loading**: Ensure sufficient RAM/VRAM for model loading +4. **Unsplash Dataset**: Verify dataset files are in correct location + +### Status Codes + +- `200`: Success +- `400`: Bad Request (invalid input) +- `500`: Internal Server Error + +## Contributing + +When contributing to this project: + +1. Follow the existing code structure +2. Add proper error handling +3. Include docstrings for new functions +4. Update documentation for new features +5. Test with various image formats and sizes + +## Testing + +### Unit Tests + +```bash +python -m pytest tests/ +``` + +### Integration Tests + +```bash +python test_integration.py +``` + +## Deployment + +### Local Development + +```bash +python app.py +``` + +### Production Deployment + +```bash +gunicorn -w 4 -b 0.0.0.0:5000 app:app +``` + +### Docker Deployment + +```dockerfile +FROM python:3.9-slim +WORKDIR /app +COPY requirements.txt . +RUN pip install -r requirements.txt +COPY . . +EXPOSE 5000 +CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "app:app"] +``` + +## License + +This project is licensed under the MIT License. See LICENSE file for details. + +## Acknowledgments + +- **Azure Computer Vision**: Image analysis and captioning +- **OpenAI**: Creative text generation +- **Unsplash**: High-quality stock photography +- **CLIP**: Semantic image understanding +- **Keras-OCR**: Text recognition capabilities + +## Support + +For issues and questions: + +1. Check the [API Documentation](API_DOCUMENTATION.md) +2. Review the [OCR Documentation](OCR_DOCUMENTATION.md) +3. Open an issue on the project repository +4. Contact the development team + +## Changelog + +### Version 1.0.0 +- Initial release with core functionality +- Flask web application +- Azure Computer Vision integration +- OpenAI idea generation +- Unsplash image search +- Image merging capabilities +- OCR functionality