Das GS Crawler System besteht aus Docker Containern, die verschiedene Websites und Datenquellen der Region Goslar automatisiert crawlen und als JSON-Dateien zur Verfügung stellen.
# Local development
./scripts/dev.sh setup # First-time setup
./scripts/dev.sh up # Start containers
./scripts/dev.sh logs # View logs
# Health Monitor: http://localhost:5015All crawlers are defined in crawlers.yaml - the single source of truth.
To add/modify a crawler:
- Edit
crawlers.yaml - Run
./scripts/generate-all.sh - Commit the changes
Total Crawlers: 27 (18 custom containers, 9 config-driven)
System monitoring and management
| ID | Name | Type | Schedule | Output Files |
|---|---|---|---|---|
| 000_health_monitor | Health Monitor | flask_monitor | Always running | - |
Local news sources and media outlets
| ID | Name | Type | Schedule | Output Files |
|---|---|---|---|---|
| 001_senioren | Seniorenzeitung Goslar | XML Feed | Täglich 02:00 | 001_senioren_feed.xml |
| 002_gz | Goslarsche Zeitung | News Crawler | Stündlich | 002_goslarsche.json, 002_goslarsche-alle.json |
| 040_hp | Harzer Panorama | News Crawler | 2x täglich (02:00, 14:00) | 040_hp.json |
Event calendars and activities
| ID | Name | Type | Schedule | Output Files |
|---|---|---|---|---|
| 002_ferienpass | Ferienpass Events | JSON API | 2x täglich (02:00, 14:00) | 002_fepa_events.json |
| 014_kunst_in_ar | Kunst in AR | Event Crawler | Täglich 08:00 | 017-kunst-in-ar-single.json |
| 019_was_app | WasApp Community | Community Feed | Alle 3 Minuten | 019_was_app.json |
| 027_erster_freitag | Erster Freitag Events | Event Crawler | Täglich 09:00 | 027-erster-freitag.json |
| 070_wochenmarkt | Wochenmarkt Goslar | Market Crawler | Alle 2 Stunden | 070_wochenmarkt_card.json, 070_wochenmarkt_alle... |
Village and neighborhood news
| ID | Name | Type | Schedule | Output Files |
|---|---|---|---|---|
| 041_immenrode | Immenrode News | Local News | 2x täglich (02:00, 14:00) | 041_immenrode.json |
| 044_wiedelah | Wiedelah Events | Community Events | 2x täglich (02:00, 14:00) | 044-wiedelah.json, 044-wiedelah_alle.json |
| 048_jerstedt | Jerstedt News | Local News | 2x täglich (02:00, 14:00) | 048_jerstedt.json |
Community organizations and volunteer opportunities
| ID | Name | Type | Schedule | Output Files |
|---|---|---|---|---|
| 031_goslarer_geschichten | Goslarer Geschichten | Forum Crawler | Täglich 09:00 | 031-goslarer_geschichten.json |
| 042_freiwilligen | Freiwilligenagentur | Volunteer Portal | 2x täglich (02:00, 14:00) | 042-freiwilligenagentur.json, 042-freiwilligena... |
Weather, water, and environmental data
| ID | Name | Type | Schedule | Output Files |
|---|---|---|---|---|
| 035_talsperren | Talsperren Daten | Umwelt Monitor | Stündlich | 035-talsperren_alle.json |
| 045_naturgefahren | Naturgefahren Monitor | Weather Alert | Alle 15 Minuten | 045_naturgefahren_de.json |
| 047_bodenwasser | Bodenwasser Monitor | Umwelt Monitor | 2x täglich (02:00, 14:00) | 047_bodenwasser.json, 047_bodenwasser.gif |
Schools, courses, and educational opportunities
| ID | Name | Type | Schedule | Output Files |
|---|---|---|---|---|
| 050_tschuessschule_studium | TschüssSchule Studium | Education Portal | Täglich 06:00 | 050-tschuessschule-studium.json, 050-tschuesssc... |
| 051_vhs | VHS Kurse | Education Portal | Täglich 09:00 | 051_vhs.json, 051_vhs-alle.json |
| 052_vhs_kinderuni | VHS Kinderuni | Education Portal | Täglich 09:00 | 052_vhs_kinderuni.json, 052_vhs_kinderuni_alle.... |
| 053_tschuessschule_praktikum | TschüssSchule Praktikum | Education Portal | Täglich 06:00 | 053-tschuessschule-praktikum.json, 053-tschuess... |
| 054_tschuessschule_ausbildung | TschüssSchule Ausbildung | Education Portal | Täglich 06:00 | 054-tschuessschule-ausbildung.json, 054-tschues... |
Unique data sources requiring custom processing
| ID | Name | Type | Schedule | Output Files |
|---|---|---|---|---|
| 032_webcams_goslar | Webcams Goslar | Webcam Processor | Täglich 09:00 | 032_webcams_goslar.json, 032_webcams.gif |
| 033_goslar24-7 | Goslar24-7 Webcams | Webcam Processor | Stündlich | 033_gif_schuhhof.json, 033_gif_schuhhof.gif, 03... |
| 056_serviceportal | Serviceportal Goslar | Service Portal | Täglich 09:00 | 056-serviceportal.json, 056-serviceportal-alle.... |
API endpoints and services
| ID | Name | Type | Schedule | Output Files |
|---|---|---|---|---|
| 068_altstadtfest | Altstadtfest Goslar | API Endpoint | API Endpoint | - |
| 072_karriere | Karriere Goslar | Job Portal | Alle 2 Stunden | 072_karriere_card.json |
crawlers.yaml <- Single source of truth
|
+-> Health Monitor <- Reads crawler definitions at runtime
+-> compose.yaml <- Generated via ./scripts/generate-compose.py
+-> compose.dev.yaml <- Generated via ./scripts/generate-compose.py
+-> README.md tables <- Generated via ./scripts/generate-readme.py
| Type | Location | Description |
|---|---|---|
| custom | docker_instances/XXX_name/ |
Full Dockerfile + custom Python script |
| config | crawler_configs/simple/ |
YAML config for generic scraper |
| config | crawler_configs/tschuessschule/ |
YAML config for nested scraper |
See CONTRIBUTING.md for detailed setup instructions.
./scripts/dev.sh setup # Build base images locally
./scripts/dev.sh up # Start all containers
./scripts/dev.sh down # Stop all containers
./scripts/dev.sh logs # Follow logs
./scripts/dev.sh test # Run scraper tests
./scripts/generate-all.sh # Regenerate compose files and README| Port | Service |
|---|---|
| 5015 | Health Monitor Dashboard |
| 5016 | Altstadtfest API |
| 5017 | Karriere Goslar |
All crawlers write to httpdocs/crawler/ (mounted as /app/output in containers).
- Build Test: Validates all containers on push/PR
- Daily Health Check: Runs containers daily and validates output
Configuration: crawlers.yaml
Contributing: See CONTRIBUTING.md