Few things about building pipelines for ml purposes.
- look intomlflow for orchestration
- need to look into how to reuse custom artifact like dynamic data definition files during the mlflow call.
- imblearn’
pipeline
right now seems most compatible with sklearn and feature engine. - joblib can be use to to distribute sklearn’s module execution on frameworks likedask,ray,pyspark. (hence one need to check whether imblearn pipeline supports it)
- take advantage of sklearn feature union and column transformer to parallelize different feature extractor and encoder.
- sklearn pipeline also have a convenientdiagram when printing in jupyter notebook, need to research how to standardize code and integrate this intomodel-document