-> Documentation was generated by AI as per prompt: This project is a custom RPA developed to show how RPA can be used even to play games. It uses Selenium libraries, OpenCV, Numpy. Some the limitatons are:
- the images used to match (find needle) must be same resolution which the game is played
- any changes to the UI will make the RPA to fail as it's looking for positions on the screen.
- I've used on the past screenscraper techniques for mainframe or applications which didn't have APIs for interaction, all they had was the UI. Generate a comprehensive readme file for this project demonstrating the benefits and issues/limitations
A Python-based Robotic Process Automation (RPA) framework demonstrating computer vision-driven automation for applications without API access
This framework showcases Robotic Process Automation (RPA) techniques that have been fundamental to enterprise automation for decades. By combining computer vision with browser automation, it demonstrates how to automate applications that lack programmatic interfaces β a common challenge in legacy system integration.
Key Demonstration: Just as enterprises have automated mainframe "green screens" and legacy applications through visual automation, this framework applies the same proven techniques to browser-based applications, proving that RPA principles are universally applicable across any visual interface.
In the real world, not every application has an API:
- Legacy Systems: Mainframe applications from the 1980s-90s still run critical business processes
- Third-Party Applications: Vendor software without automation interfaces
- Dynamic Web Apps: Canvas-based interfaces where traditional DOM selectors fail
- Rapid Prototyping: Faster than waiting for official API development
This framework demonstrates the foundational techniques that power commercial RPA platforms like UiPath, Automation Anywhere, and Blue Prism.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AUTOMATION LAYER β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Browser β β Computer β β GUI β β
β β Controller β β Vision β β Automation β β
β β (Selenium) β β (OpenCV) β β (PyAutoGUI) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PROCESSING LAYER β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Template Matching & Image Analysis β β
β β (OpenCV + NumPy) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DECISION LAYER β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Workflow Engine & State Management β β
β β (Pattern Matching, Decision Trees) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MONITORING LAYER β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Logging, Error Handling & Debug Output β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Technology | Purpose |
|---|---|---|
| Core Language | Python 3.7+ | Framework foundation |
| Computer Vision | OpenCV (cv2) | Template matching & image processing |
| Browser Automation | Selenium WebDriver | Web navigation & interaction |
| GUI Automation | PyAutoGUI | Cross-platform mouse/keyboard control |
| Image Processing | NumPy | Array operations & numerical processing |
| Image Handling | Pillow (PIL) | Image format conversion |
| Logging | Python logging | Activity tracking & debugging |
- Legacy System Integration: Automate mainframe terminals, AS/400, green screens
- Third-Party Software: Interact with vendor applications lacking APIs
- Desktop Application Testing: QA automation for GUI applications
- Report Generation: Extract data from visual interfaces
- Canvas-Based Dashboards: Automate Power BI, Tableau, or custom data visualizations
- Dynamic Web Apps: Handle applications where DOM selectors are unreliable
- Visual Validation: Verify UI rendering in automated tests
- Cross-Platform Automation: Work across web, desktop, and hybrid applications
- Understanding computer vision fundamentals
- Learning browser automation patterns
- Implementing state machines and decision logic
- Practicing event-driven programming
- Python 3.7 or higher
- Google Chrome browser
- ChromeDriver (matching your Chrome version)
- 2GB RAM minimum (4GB recommended)
- Clone the repository
git clone <repository-url>
cd visual-automation-framework- Install dependencies
pip install -r requirements.txtrequirements.txt:
opencv-python>=4.5.0
numpy>=1.19.0
selenium>=4.0.0
pillow>=8.0.0
pyautogui>=0.9.50
-
Download ChromeDriver
- Visit: https://chromedriver.chromium.org/
- Download version matching your Chrome browser
- Place
chromedriver.exein project root or add to PATH
-
Configure Settings
Update config.yaml (or create it):
browser:
user_data_dir: "C:\\Users\\<YOUR_USER>\\AppData\\Local\\Google\\Chrome\\User Data"
window_width: 1920
window_height: 1080
automation:
screenshot_delay: 0.5
click_duration: 0.3
confidence_threshold: 0.7
logging:
level: INFO
file: automation.logfrom src.browser import Browser
from src.vision import Vision
from src.bot import AutomationBot
# Initialize components
browser = Browser(url="https://example.com", width=1920, height=1080)
vision = Vision(template_folder="templates/")
bot = AutomationBot(browser, vision)
# Run automation workflow
bot.execute_workflow()from src.vision import Vision
import cv2
# Initialize vision module
vision = Vision()
# Find single element
location = vision.find("button_submit.png", confidence=0.8)
if location:
print(f"Found at: {location}")
# Find multiple elements
locations = vision.find_multiple("icon_notification.png", confidence=0.7)
print(f"Found {len(locations)} instances")import time
def execute_with_retry(action, max_attempts=3, delay=2):
"""Execute action with retry logic"""
for attempt in range(max_attempts):
try:
result = action()
if result:
return result
except Exception as e:
logging.warning(f"Attempt {attempt + 1} failed: {e}")
time.sleep(delay)
return None- H: Stop automation and exit
- P: Pause automation
- O: Resume automation
# Selenium captures full browser content
screenshot = browser.driver.get_screenshot_as_png()
image = Image.open(io.BytesIO(screenshot))# OpenCV searches for visual patterns
result = cv2.matchTemplate(screenshot, template, cv2.TM_CCOEFF_NORMED)
locations = np.where(result >= confidence_threshold)# Convert image coordinates to screen coordinates
screen_x = window_x + image_x
screen_y = window_y + image_y# PyAutoGUI performs the interaction
pyautogui.moveTo(screen_x, screen_y, duration=0.3)
pyautogui.click()Works with any application that has a visual interface, regardless of whether APIs exist.
Based on 40+ years of screen scraping methodology used in mission-critical business automation.
Same approach works for:
- Web applications (via Selenium)
- Desktop applications (via PyAutoGUI)
- Terminal/mainframe interfaces
- Virtual desktop environments (Citrix, RDP)
- No reverse engineering required
- No protocol analysis needed
- Visual debugging with screenshot analysis
- Quick proof-of-concept creation
Teaches fundamental concepts:
- Computer vision basics (template matching, image processing)
- Browser automation patterns
- State machine design
- Event-driven architecture
- Error handling strategies
The same techniques power:
- UiPath: Commercial RPA platform ($7B+ valuation)
- Automation Anywhere: Enterprise automation leader
- Blue Prism: Intelligent automation platform
- Legacy integrations: Still used extensively in Fortune 500 companies
Problem: Template images must match the exact display resolution.
Original Resolution: 1920Γ1080
Template Created: 1920Γ1080 β
Display Changes To: 2560Γ1440 β BREAKS
Impact: Changing resolution, zoom level, or DPI scaling breaks all detection.
Mitigations:
- Create template sets for multiple resolutions
- Use scale-invariant features (SIFT, SURF, ORB)
- Implement multi-scale template matching
- Store templates at multiple sizes
Problem: Any visual update breaks automation.
Examples:
- Button position changes
- Color scheme updates
- Font changes
- Icon redesigns
- Seasonal themes
Impact: Requires constant template recapture and testing.
Mitigations:
- Use OCR for text-based elements (Tesseract)
- Create multiple template variants
- Implement fuzzy matching
- Combine with DOM selectors where possible
Bottlenecks:
- Screenshot capture: 50-200ms per frame
- Template matching: 100-500ms per template per screenshot
- Mouse movements: 2-5 seconds with safety delays
- Total cycle time: 5-10 seconds per action
Impact: Much slower than API-based automation or human interaction.
Optimizations:
- Cache screenshots when possible
- Use region-of-interest (ROI) cropping
- Implement parallel template matching
- Optimize confidence thresholds
False Positives:
# Similar buttons may match incorrectly
"Save" button matches "Save As" button (visual similarity)False Negatives:
# Variations break matching
- Hover state (different color)
- Loading animations
- Transparency effects
- Shadows or gradientsTiming Problems:
- Network latency causes UI delays
- Hardcoded
sleep()calls are brittle - Race conditions during page loads
Mitigations:
- Implement smart waiting (wait for specific elements)
- Use multiple confidence thresholds
- Add context-aware matching (check surrounding elements)
- Implement retry logic with exponential backoff
Static Logic:
- Cannot adapt to unexpected UI states
- Follows predefined decision trees only
- No learning from failures
Poor Error Recovery:
# Example: Gets stuck if unexpected popup appears
if not find_button("ok"):
# No fallback strategy defined
# Automation hangs indefinitelySolutions:
- Implement state recovery mechanisms
- Add timeout-based failsafes
- Use ML models for adaptive recognition (future enhancement)
Setup Requirements:
β Chrome profile paths (OS-specific)
β Game/app credentials
β Template image creation (manual, tedious)
β Coordinate calibration per screen
β Confidence threshold tuning per template
β ChromeDriver version matchingImpact: High barrier to entry; difficult for non-technical users.
When using for automation of online services:
β οΈ May violate Terms of Serviceβ οΈ Could be considered unauthorized accessβ οΈ Risk of account suspension/banningβ οΈ Potential legal consequences
Use responsibly: Only automate applications you own or have explicit permission to automate.
- Windows-centric: PyAutoGUI behavior varies across operating systems
- Browser-specific: Current implementation only supports Chrome
- Single-threaded: Cannot run multiple automation instances easily
# β οΈ Current implementation issues:
username = "admin" # Plain text in code
password = "P@ssw0rd" # No encryption
browser_profile = "Default" # Full access to user dataRisks:
- Credentials exposed in source code
- Requires access to user's browser profile
- No secure credential storage
- Potential data exposure
Required Maintenance:
- Weekly: Check for UI changes
- Monthly: Update templates
- Quarterly: Adjust logic for new features
- Annually: Major refactoring for big updates
Cost: Can exceed initial development time significantly.
Before modern APIs, enterprises faced a critical challenge: How do you automate systems that only have visual interfaces?
βββββββββββββββββββββββββββββββββββ
β CUSTOMER RECORDS SYSTEM (CRS) β β Critical business application
β IBM Mainframe - Green Screen β β No API, no database access
β 3270 Terminal Protocol β β Only keyboard/display interface
βββββββββββββββββββββββββββββββββββ
Characteristics of legacy systems:
- Fixed-width text screens (80Γ24 characters)
- No mouse support (keyboard only)
- Position-based data (row 5, column 10 = customer name)
- No automation interface (human operators required)
Companies developed tools to:
- Capture terminal screens β Read text buffer
- Parse fixed positions β Extract data from known coordinates
- Identify fields β Recognize labels and values
- Automate keyboard entry β Send commands programmatically
- Extract for reporting β Export data to modern systems
Example Workflow:
Human Process:
1. Press F3 to access customer screen
2. Type customer ID at row 7, col 15
3. Press ENTER
4. Read name from row 9, col 20
5. Copy to Excel spreadsheet
6. Repeat for 1,000 customers Γ 8 hours = 3 days
Automated Process:
1. Script sends F3 key
2. Script types customer ID
3. Script sends ENTER
4. Script reads screen buffer position
5. Script writes to database
6. Complete 1,000 customers in 2 hours
Scenario: A bank needed to integrate a 1985 mainframe loan system with a new 2015 web portal.
Options:
- β Replace mainframe (cost: $50M, time: 3 years, risk: HIGH)
- β Develop API for legacy system (cost: $5M, time: 18 months)
- β Screen scraping integration (cost: $200K, time: 3 months)
Implementation:
# Pseudo-code for mainframe scraping
def get_customer_loan_balance(customer_id):
# Connect to 3270 emulator
terminal = connect_to_mainframe()
# Navigate using keyboard commands
terminal.send_key("F3") # Access loans menu
terminal.wait_for_screen("LOAN SYSTEM MAIN")
# Enter customer ID
terminal.move_cursor(7, 15)
terminal.type_text(customer_id)
terminal.send_key("ENTER")
# Read result from fixed position
terminal.wait_for_screen("CUSTOMER DETAILS")
loan_balance = terminal.read_position(12, 30, length=10)
return float(loan_balance)- 43% of banking systems run on COBOL (Reuter's 2017 survey)
- Average age of core enterprise systems: 12+ years
- Government agencies run software from the 1970s-80s
- Insurance companies still use green-screen mainframes
Scenario: Your company uses vendor software
Option A: Request API from vendor
βββ Response: "Not in roadmap"
βββ Timeline: 18-24 months (maybe)
βββ Cost: $$$$ enterprise licensing
Option B: Screen scraping
βββ Response: Immediate
βββ Timeline: 2-4 weeks
βββ Cost: Development time only
Modern use cases:
- Citrix/RDP Environments: Virtual desktops with no API access
- Third-Party SaaS: Vendors who won't provide APIs
- Legacy Desktop Apps: 20-year-old applications still in production
- Visual Testing: Verifying that UI renders correctly
- Capture at Target Resolution
# Game/app running at 1920Γ1080
# Screenshot must also be 1920Γ1080- Crop Precisely
# Include only the target element
# Too large = false positives
# Too small = false negatives- Save with Transparency
# PNG format with alpha channel preferred
# Helps with varying backgrounds- Test Multiple Thresholds
for confidence in [0.5, 0.6, 0.7, 0.8, 0.9]:
result = vision.find("button.png", confidence)
print(f"Confidence {confidence}: {result}")images/
βββ buttons/
β βββ ok.png
β βββ cancel.png
β βββ submit.png
βββ icons/
β βββ notification.png
β βββ settings.png
βββ indicators/
β βββ loading.png
β βββ complete.png
βββ fallbacks/
βββ ok_hover.png
βββ ok_disabled.png
# Strict matching (fewer false positives)
result = vision.find("critical_button.png", confidence=0.9)
# Lenient matching (catches variations)
result = vision.find("icon.png", confidence=0.6)
# Context-aware matching
button = vision.find("ok.png", confidence=0.7)
if button and vision.find_nearby("dialog_title.png", button, radius=100):
# Confirmed it's the right "OK" button
click(button)def smart_wait(template, timeout=30, poll_interval=0.5):
"""Wait for element to appear with timeout"""
start_time = time.time()
while time.time() - start_time < timeout:
location = vision.find(template)
if location:
return location
time.sleep(poll_interval)
raise TimeoutError(f"Element {template} not found after {timeout}s")- External Configuration: Move credentials to
config.yaml - Multi-Resolution Support: Scale templates automatically
- OCR Integration: Use Tesseract for text-based element detection
- Better Logging: Structured logging with JSON output
- Unit Tests: Test coverage for vision and browser modules
- Scale-Invariant Matching: Implement SIFT/SURF/ORB
- Machine Learning: Train models to recognize UI patterns
- Error Recovery: Automatic retry with fallback strategies
- Performance Optimization: Parallel template matching
- Dashboard: Web-based monitoring and control interface
- Cross-Platform Support: Native Linux/macOS support
- Cloud Deployment: Run headless in containers
- Visual Flow Builder: Drag-and-drop workflow design
- Adaptive Learning: Improve from historical successes/failures
- API Integration: Hybrid approach (API-first, vision fallback)
- Templates load correctly
- Screenshot capture works at target resolution
- Template matching finds elements with >90% accuracy
- Click coordinates are accurate (Β±5 pixels)
- Error handling triggers on missing elements
- Logging captures all significant events
# tests/test_vision.py
import unittest
from src.vision import Vision
class TestVision(unittest.TestCase):
def setUp(self):
self.vision = Vision()
def test_template_matching(self):
"""Test template matching accuracy"""
screenshot = cv2.imread("test_data/screenshot.png")
template = cv2.imread("test_data/button.png")
result = self.vision.find(screenshot, template, confidence=0.8)
self.assertIsNotNone(result)
self.assertTrue(0 <= result[0] <= screenshot.shape[1])- UiPath Academy (Free RPA training)
- Automation Anywhere University
- RPA Wikipedia
Contributions are welcome! Areas for improvement:
- Multi-resolution template support
- Better error handling and recovery
- Performance optimizations
- Documentation improvements
- Support for additional browsers (Firefox, Edge)
- OCR integration for text-based detection
- Machine learning-based element recognition
- Configuration GUI for non-technical users
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
FOR EDUCATIONAL PURPOSES ONLY
This project demonstrates automation techniques used in enterprise software development. It is intended for:
- β Learning computer vision and automation concepts
- β Experimenting with your own applications
- β Understanding legacy system integration patterns
- β Educational research and skill development
NOT intended for:
- β Violating Terms of Service of any application
- β Gaining unfair advantages in competitive environments
- β Accessing systems without authorization
- β Any activity that could be considered unethical or illegal
- Terms of Service: Automating online services may violate their ToS
- Account Risk: Could result in account suspension or permanent ban
- Legal Consequences: Unauthorized automation may have legal ramifications
- Ethical Considerations: Automation that harms others is not acceptable
The authors assume NO responsibility for consequences arising from the use of this software. Use at your own risk and only for lawful purposes.
This project is licensed under the MIT License - see the LICENSE file for details.
- β Commercial use allowed
- β Modification allowed
- β Distribution allowed
- β Private use allowed
β οΈ No warranty providedβ οΈ No liability accepted
- OpenCV Community - For the powerful computer vision library
- Selenium Project - For enabling browser automation
- PyAutoGUI Contributors - For cross-platform GUI automation
- Legacy System Pioneers - Who developed screen scraping techniques in the 1980s
- RPA Industry Leaders - UiPath, Automation Anywhere, Blue Prism for validating these approaches
Remember: This project represents 40+ years of enterprise automation history, demonstrating techniques that remain relevant today for legacy system integration and automation scenarios where APIs are unavailable. Use responsibly and ethically.