Single pyspark API connector for all REST API's.
Declarative API ingestion wigh Pyspark. Uses the new Pyspark 4 custom data sources under the hood.
Define a config file manually or use the recommended, lightweight builder UI. Once you are happy with your config, all you need to do is register the Polymo reader and tell Spark where to find the config:
from pyspark.sql import SparkSession
from polymo import ApiReader
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ApiReader)
df = (
spark.read.format("polymo")
.option("config_path", "./config.yml") # YAML you saved from the Builder
.option("token", "YOUR_TOKEN") # Only if the API needs one
.load()
)
df.show()Streaming works too:
spark.readStream.format("polymo")Prefer everything in Python? Use the PolymoConfig model.
from pyspark.sql import SparkSession
from polymo import ApiReader, PolymoConfig
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ApiReader)
jp_posts = PolymoConfig(
base_url="https://jsonplaceholder.typicode.com",
path="/posts",
)
df = (
spark.read.format("polymo")
.option("config_json", jp_posts.config_json())
.load()
)
df.show()Polymo reads in batches and can read pages in parallel. Therefore Polymo can be much faster than row based solutions like UDFs.
Locally you probably want to install polymo along with the Builder UI:
pip install "polymo[builder]"This comes with all UI deps such as pyspark
Running Polymo on a spark cluster usually doesn't require these UI deps. In that case, just install the bare minimum deps with
pip install polymopolymo builderdocker compose up --build builder- The service listens on port
8000; open http://localhost:8000 once Uvicorn reports it is running.
Read the docs here
Other material:
- Step by step example: medium blogpost
It's still early days, but Polymo already supports a lot of features! Is there something missing? Raise an issue or contribute!
Contributions and early feedback welcome!
