JBang for UDFs #1489
Replies: 3 comments 2 replies
-
|
What about emphasizing Python for simple UDFs instead? Writing a simple python function in 1 file will always be simpler and more lightweight than Java. Also vectorized UDFs are a Python only in Flink, which is another cool thing. Well, of course this brings in the burden of managing a Python version in our environments and also configure it properly for Flink (not an easy thing on its own, been there, done that...), and when someone wants to get more serious and bring in custom dependencies and stuff like the user side can also become complex and hard to manage as well. But if we'd like to focus on the simple and lightweight stuff, IMO Python should be a first-class citizen. |
Beta Was this translation helpful? Give feedback.
-
|
Protoyping the solution out a bit, it gets more complex when you want to pull in dependencies that are not provided by the Flink cluster. It seems the best option is to do the following: The tricky bit is that this requires version alignment between flink and udf implementations. The positive of this approach is that for mutliple udfs we only use one copy of the dependencies. The goal for this is to cover use cases with a handful of UDFs that fairly straight forward - i.e. <200 loc with only a few generic dependencies. |
Beta Was this translation helpful? Give feedback.
-
|
I played with this a bit, and JBang does not track down transitive dependencies. So adding //DEPS com.datasqrl.flinkrunner:stdlib-utilsis not enough, it explicitly requires both Google auto service and the Flink table common deps to find //DEPS org.apache.flink:flink-table-common:1.19.3
//DEPS com.google.auto.service:auto-service:1.1.1Also, now there are a |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
To implement a user defined function for a DataSQRL pipeline the user currently has to setup an entire maven or gradle project and remember to compile it to a jar before compiling the pipeline. For simple UDFs that's a lot of work and source of error.
To support scripting of UDFs, I propose that we add a preprocessor for
*.javaand*.kt(kotlin) files which looks forwhich is the base java module for all datasqrl UDFs (to support auto-discovery and provide general utility functions).
Any file containing this is a UDF implementation and we compile it to a jar:
We then add the jar to the lib folder so it can be used by the compiler and pipeline.
This would compile the java and kotlin scripts with the pipeline compilation, eliminating any steps the user has to take and greatly simplifying UDF implementations.
To make this efficient, we need to install jbang into the docker image for the compiler and make sure we load all necessary JDK dependencies as well as the maven dependency for
com.datasqrl.flinkrunner:stdlib-utilsto avoid having to download it on every compile.Beta Was this translation helpful? Give feedback.
All reactions