Lessons in the Research-to-Production Pipeline: From Data Science to Software Engineering

5.7 KiB

Raw Blame History

Improving the Research-to-Production Pipeline

This repository serves to host some examples of taking messy code that lives primarily in Jupyter Notebooks and turning it into something useful and re-usable.

There are many levels of maturity on the road from the procedural-programming style encouraged by notebooks towards production-grade software. The examples here do not aim to achieve "perfection" with each example, but rather demonstrate content that is difficult to consume and the process of improving it.

How to use this repository

In each folder there is a complete example complemented by an internal README.md file to explain the intended learning outcome and context for the example.

General Guidelines

In general, here are some pieces of advice to follow on the topic of writing high-quality code and notebooks.

Aim for short but meaningful filenames which avoid using spaces and other inconvenient punctuation
Pay attention to the complete path of a file: how much redundant information is present? (e.g. 2022/06/01/Data-From-2022-06-01-12:23.txt vs. 2022/06/01/12-23.txt)
Use titles and markdown cells in notebooks to help contextualize content
Use inline comments (beginning with two spaces and an octothorpe #) to explain "clever" or "unclear" lines of code (put yourself in someone else's shoes and clarify any ambiguities about choices you made in writing the code)
If you find yourself copy/pasting cells more than three times, strongly consider creating a function to accomplish the task.
Functions should limit the scope of what operations they perform and be named in a way that reflects the underlying behavior
Avoid references to (global) variables within the context of a function: the only variables in a function should be the parameters that the function takes as explicit inputs
Avoid functions which take "too many" arguments and think about how you can simplify things (e.g. a "settings dictionary" or "config class" can help)
Try to define all the functions you will need in a cell near the top of your notebook
Once you have about a dozen functions within a notebook, it may be time to consider migrating them to a module and importing it.
Once you have several modules (e.g. plotting-related utilities, image-transformation functions, etc. all separated in their own files), it may be time to consider creating a package
Use .gitignore files to be intentional about what files you track (notebooks generate A LOT of invisible files)

Next Steps

Moving towards packaging begins to touch on software engineering best practices which is a lengthy topic on its own. Here are some general themes of what tasks are invovled in moving from research to something that more closely resembles production-grade code:

You begin to reach for different tooling.
- It is very possible to develop software in JupyterLab but it is not exactly an ideal experience for beginners.
- You can technically do everything in a Terminal using the command-line since JupyterLab / Notebook provide one for you, an editor like vim or nano is often included in your environment.
- Some extensions in Jupyter such as jupyterlab-lsp bring in the same JavaScript libraries that VSCode uses for auto-completion, jump-to-definition, documentation-on-hover, etc. which can make developing software in JupyterLab more pleasant.
- Many people reach for PyCharm (will require Pro License for its container-runtime feature... BOO), or VSCode (free, can run in-browser in the same environment as JupyterLab, see jupyterhub-deploy-docker and the shell script in it when developing software. The debugging tools there are mature (but not strictly necessary) and sometimes helpful when dealing with complex code-bases.
- Git management becomes something you have to learn well in order to collaborate with others on software. Some IDEs (listed above) have good visual interfaces for staging git commits and resolving conflicts (in particular VSCode is a crowd-favorite).
Testing becomes paramount. If you write a function, you generally also write test cases for it.
- Unit tests: cover independent functionality
- Functional tests: cover how functions interact with one another
- Integration tests: test how your code interacts with external services / dependencies
- LEARN FROM OUR PRIOR PAIN: configure your tests to automatically ignore paths like .ipynb_checkpoints which Jupyter creates whenever you browse the filesystem through the web app. You do not want cached versions of files to be run as part of your test suites.
Documentation becomes increasingly important
- Functions get docstrings (particularly when their name and parameters have some ambiguity about usage)
- Programs parse your package for these docstrings to generate complete documentation for you (manpages, html, pdf, etc.)
You need to start validating your code in controlled environments to ensure its reproducibility
- Environment management (package versions, operating systems, underlying libraries) becomes a concern. You need to figure out a way to "lock" things into place for certain tests while also preventing your package from conflicting with newer versions of the language (or dependent packages) in the future.
- Cloud-based testing largely automates the testing, but it's up to you to configure it and then fix it when something inevitably breaks.
- Guaranteeing reproducibility is a time-consuming task, be aware of what standards you want to impose to "prevent problems" v.s. what you are willing to deal with "when it happens"

5.7 KiB Raw Blame History

Improving the Research-to-Production Pipeline

How to use this repository

General Guidelines

Next Steps

5.7 KiB

Raw Blame History