commit d3ac4f575d18b31e0e478ff5534c7dfda1d71ae4 Author: Michael Pilosov Date: Thu May 26 14:25:30 2022 +0000 intro and gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..3f68948 --- /dev/null +++ b/.gitignore @@ -0,0 +1,54 @@ +# Temporary and binary files +*~ +*.py[cod] +*.so +*.cfg +!.isort.cfg +!setup.cfg +*.orig +*.log +*.pot +__pycache__/* +.cache/* +.*.swp +*/.ipynb_checkpoints/* +.ipynb_checkpoints/* +*/*/.ipynb_checkpoints/* +.DS_Store + +# Project files +.ropeproject +.project +.pydevproject +.settings +.idea +.vscode +tags + +# Package files +*.egg +*.eggs/ +.installed.cfg +*.egg-info + +# Unittest and coverage +htmlcov/* +.coverage +.tox +junit.xml +coverage.xml +.pytest_cache/ + +# Build and docs folder/files +build/* +dist/* +sdist/* +docs/api/* +docs/_rst/* +docs/_build/* +cover/* +MANIFEST + +# Per-project virtualenvs +.venv*/ + diff --git a/README.md b/README.md new file mode 100644 index 0000000..c7edeab --- /dev/null +++ b/README.md @@ -0,0 +1,54 @@ +# Improving the Research-to-Production Pipeline + +This repository serves to host some examples of taking messy code that lives primarily in Jupyter Notebooks and turning it into something useful and re-usable. + +There are many levels of maturity on the road from the procedural-programming style encouraged by notebooks towards production-grade software. +The examples here do not aim to achieve "perfection" with each example, but rather demonstrate content that is difficult to consume and the process of improving it. + + +## How to use this repository + +In each folder there is a complete example complemented by an internal `README.md` file to explain the intended learning outcome and context for the example. + + +## General Guidelines + +In general, here are some pieces of advice to follow on the topic of writing high-quality code and notebooks. + +- Aim for short but meaningful filenames which avoid using spaces and other inconvenient punctuation +- Pay attention to the complete path of a file: how much redundant information is present? (e.g. `2022/06/01/Data-From-2022-06-01-12:23.txt` vs. `2022/06/01/12-23.txt`) +- Use titles and markdown cells in notebooks to help contextualize content +- Use inline comments (beginning with two spaces and an octothorpe ` #`) to explain "clever" or "unclear" lines of code (put yourself in someone else's shoes and clarify any ambiguities about choices you made in writing the code) +- If you find yourself copy/pasting cells more than three times, strongly consider creating a function to accomplish the task. +- Functions should limit the scope of what operations they perform and be named in a way that reflects the underlying behavior +- Avoid references to (global) variables within the context of a function: the only variables in a function should be the parameters that the function takes as explicit inputs +- Avoid functions which take "too many" arguments and think about how you can simplify things (e.g. a "settings dictionary" or "config class" can help) +- Try to define all the functions you will need in a cell near the top of your notebook +- Once you have about a dozen functions within a notebook, it may be time to consider migrating them to a module and importing it. +- Once you have several modules (e.g. plotting-related utilities, image-transformation functions, etc. all separated in their own files), it may be time to consider creating a package +- Use `.gitignore` files to be intentional about what files you track (notebooks generate **A LOT** of invisible files) + + +## Next Steps + +Moving towards packaging begins to touch on software engineering best practices which is a lengthy topic on its own. +Here are some general themes of what tasks are invovled in moving from research to something that more closely resembles production-grade code: + +- You begin to reach for different tooling. + - It is very possible to develop software in JupyterLab but it is not exactly an ideal experience for beginners. + - You can technically do everything in a Terminal using the command-line since JupyterLab / Notebook provide one for you, an editor like `vim` or `nano` is often included in your environment. + - Some extensions in Jupyter such as [jupyterlab-lsp](https://github.com/jupyter-lsp/jupyterlab-lsp) bring in the same JavaScript libraries that VSCode uses for auto-completion, jump-to-definition, documentation-on-hover, etc. which can make developing software in JupyterLab more pleasant. + - Many people reach for PyCharm (will require Pro License for its container-runtime feature... BOO), or VSCode (free, can run in-browser in the same environment as JupyterLab, see [jupyterhub-deploy-docker](https://github.com/ml-starter-packs/jupyterhub-deploy-docker.git) and [the shell script in it](https://github.com/ml-starter-packs/jupyterhub-deploy-docker/blob/main/singleuser/install_vscode.sh) when developing software. The debugging tools there are mature (but not strictly necessary) and sometimes helpful when dealing with complex code-bases. + - Git management becomes something you have to learn well in order to collaborate with others on software. Some IDEs (listed above) have good visual interfaces for staging git commits and resolving conflicts (in particular VSCode is a crowd-favorite). +- Testing becomes paramount. If you write a function, you generally also write test cases for it. + - Unit tests: cover independent functionality + - Functional tests: cover how functions interact with one another + - Integration tests: test how your code interacts with external services / dependencies + - LEARN FROM OUR PRIOR PAIN: configure your tests to automatically ignore paths like `.ipynb_checkpoints` which Jupyter creates whenever you browse the filesystem through the web app. You do not want cached versions of files to be run as part of your test suites. +- Documentation becomes increasingly important + - Functions get docstrings (particularly when their name and parameters have some ambiguity about usage) + - Programs parse your package for these docstrings to generate complete documentation for you (`manpages`, `html`, `pdf`, etc.) +- You need to start validating your code in controlled environments to ensure its reproducibility + - Environment management (package versions, operating systems, underlying libraries) becomes a concern. You need to figure out a way to "lock" things into place for certain tests while also preventing your package from conflicting with newer versions of the language (or dependent packages) in the future. + - Cloud-based testing largely automates the testing, but it's up to you to configure it and then fix it when something inevitably breaks. + - Guaranteeing reproducibility is a time-consuming task, be aware of what standards you want to impose to "prevent problems" v.s. what you are willing to deal with "when it happens"