WaterCrawl plugin is a Python package that provides a base for creating plugins for the WaterCrawl web crawling framework. It offers abstract classes and interfaces to standardize plugin development with support for input validation, pipeline processing, and middleware integration.
- Abstract base classes for plugin development
- JSON Schema-based input validation
- Pipeline processing support
- Spider and Downloader middleware integration
- Cached property utilities
- Type hints and comprehensive documentation
You can install the WaterCrawl plugin package using pip:
pip install watercrawl-pluginHere's a comprehensive guide on how to create a WaterCrawl plugin:
from watercrawl_plugin import AbstractInputValidator, AbstractPlugin, BasePipelineDefine your plugin's configuration schema using JSON Schema:
class MyInputValidator(AbstractInputValidator):
@classmethod
def get_json_schema(cls):
return {
"title": "My Plugin",
"description": "Plugin description",
"type": "object",
"properties": {
"model_name": {
"title": "Model Name",
"type": "string",
"default": "default-model",
"enum": ["model-1", "model-2"],
"ui": {
"widget": "select",
"options": [
{"label": "Model 1", "value": "model-1"},
{"label": "Model 2", "value": "model-2"},
]
},
},
"config": {
"title": "Configuration",
"type": "object",
"ui": {
"widget": "json-editor"
}
}
}
}
def get_model(self):
return self.data.get('model_name', 'default-model')
def get_config(self):
return self.data.get('config', {})Implement the processing logic:
class MyPipeline(BasePipeline):
def get_validator(self, spider):
return spider.plugin_validators[MyPlugin.plugin_key()]
def process_item(self, item, spider):
validator = self.get_validator(spider)
if not validator or not validator.data:
return item
try:
# Process the item using validator configuration
processed_data = self.process_data(
item,
model=validator.get_model(),
config=validator.get_config()
)
item['processed_data'] = processed_data
except Exception as e:
raise RuntimeError(f"Error processing item: {e}")
return itemDefine your main plugin class:
class MyPlugin(AbstractPlugin):
@classmethod
def plugin_key(cls) -> str:
return "my_plugin"
@classmethod
def get_pipeline_classes(cls) -> dict:
return {
'my_package.MyPipeline': 500, # Priority 500
}
@classmethod
def get_input_validator(cls) -> Type[MyInputValidator]:
return MyInputValidator
@classmethod
def extended_fields(cls):
return ["processed_data"]
@classmethod
def get_spider_middleware_classes(cls) -> dict:
return {}
@classmethod
def get_downloader_middleware_classes(cls) -> dict:
return {}
@classmethod
def get_author(cls) -> str:
return "Your Name"
@classmethod
def get_version(cls) -> str:
return "1.0.0"
@classmethod
def get_name(cls) -> str:
return "MyPlugin"
@classmethod
def get_description(cls) -> str:
return "Plugin description"Base class for plugins with required methods:
plugin_key(): Unique identifier for the pluginget_pipeline_classes(): Dictionary of pipeline classes with prioritiesget_input_validator(): Returns the input validator classextended_fields(): List of fields added by the pluginget_spider_middleware_classes(): Spider middleware classesget_downloader_middleware_classes(): Downloader middleware classesget_author(),get_version(),get_name(),get_description(): Plugin metadata
Base class for input validation:
get_json_schema(): Returns JSON Schema for configuration- Custom getter methods for configuration values
- Access to validation data through
self.data
Base class for item processing:
process_item(item, spider): Main processing methodget_validator(spider): Get plugin validator instance- Support for cached properties and error handling
We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.