Best Practices: Developing a Reusable Pipeline Package
When building a new pipeline, you'll often want to develop an associated Python package. This allows you to version control your code, share it between pipeline nodes, and reuse it in future projects. This guide outlines the recommended strategy for creating and integrating a reusable package into your pipeline workflow.
Part 1: Creating Your Python Package
First, set up a new Python package. We recommend using uv
for initialization.
1. Initialize the Package
Run the following command to create a new package:
This command scaffolds a standard Python project structure for you:
<mypackage>/
├── src/
│ └── <mypackage>/
│ ├── __init__.py
│ └── your_function.py
└── pyproject.toml
pyproject.toml
: A configuration file where you can list your project's dependencies.src/<mypackage>
: The directory where your package's source code lives.
2. Set Up Version Control with Git
The uv init
command also initializes a Git repository. To publish your package, you'll need to create a remote repository on a platform like GitHub or GitLab.
-
Create an empty repository on GitHub or GitLab. Name it
<mypackage>
. -
Navigate into your package directory:
-
Link your local repository to the remote one:
-
Push your initial code to the remote repository:
Your package is now set up for version-controlled development.
Switching Default Branch from master
to main
If your remote repository uses master
as the default branch, you can rename it to main
for consistency. Run the following commands:
git branch -m master main
git push -u origin main
git symbolic-ref refs/remotes/origin/HEAD refs/remotes/origin/main
Then, update the default branch in your repository settings on GitHub or GitLab to main
, and delete the old master
branch if desired:
Part 2: Integrating the Package into a Pipeline Node
With your package created, you can now integrate it into a pipeline node.
1. Prepare the Node
First, create a new pipeline and a node within it. Then, navigate to the node's code directory:
2. Clone Your Package
Clone your newly created package into this directory:
The directory structure inside your node's code
folder should now look like this:
<node_id>/
├── code/
│ ├── <mypackage>/ # Your cloned package
│ │ ├── src/
│ │ └── pyproject.toml
│ ├── main.py # The node's main script
│ └── pyproject.toml # The node's environment
├── data/
└── reports/
3. Install the Package
To make your package's functions available to the node's main.py
script, install it in editable mode using uv
:
What is Editable Mode?
The --editable
flag creates a symbolic link from your node's Python environment to your package's source code. This means any changes you make to the package's code are immediately available to the node without needing to reinstall it.
Part 3: Developing and Using Your Code
Now you can develop your package's functionality and use it within the pipeline.
1. Add Code to Your Package
Write reusable functions inside your package's source directory (<mypackage>/src/<mypackage>
). For example, create a file named dataset.py
with a function to process data from parent nodes:
# In <mypackage>/src/<mypackage>/dataset.py
def print_parents(parent_folders):
for folder in parent_folders:
print(f"Found parent folder: {folder}")
2. Use Your Package in main.py
Import and call your package's functions from the node's main.py
script.
# In <node_id>/code/main.py
# Import the user API to interact with the pipeline
from python_user_utils.node_api import get_all_parent_node_folder_paths, get_node_id
# Import the function from your package
from mypackage.dataset import print_parents
if __name__ == "__main__":
node_id = get_node_id()
# Get the paths of all parent nodes
parent_folders = get_all_parent_node_folder_paths(node_id=node_id)
# Use the function from your package to process them
print_parents(parent_folders)
3. Commit and Push Your Changes
After developing a new feature, commit and push the changes to your package's repository:
# Navigate to your package directory
cd <mypackage>
# Stage, commit, and push your changes
git add src/
git commit -m "feat: Add print_parents function"
git push
Part 4: Testing Your Node
Before running the full pipeline, test your node's script in isolation.
-
Navigate to the node's code directory:
-
Run the main script:
Implement a preview
Mode for Faster Development
When working with large datasets, consider adding a preview
flag or option to your functions. This allows you to run them on a small subset of data for quick testing and iteration, saving you time.
Once you confirm the script runs correctly, you can execute the entire pipeline from the UI. For more details, see the Develop a Node guide.
Part 5: Reusing Your Package in Other Nodes
A key benefit of this approach is reproducibility. Each node has a distinct snapshot of your package. When creating new nodes that depend on your package, you have two main options:
a) Duplicate an Existing Node
Duplicate a node that already contains your package. This is the simplest method. After duplicating, navigate to the package directory and pull the latest changes to ensure it's up to date.
b) Clone the Package into a New Node
Alternatively, clone the package into the new node's code
directory and install it in editable mode, following the same steps outlined in Part 2.
By keeping a local copy of the package in each node, you ensure that your pipeline remains reproducible, even as your package evolves over time. For a deeper comparison of different dependency management strategies, see External Python Dependencies.