Lessons in the Research-to-Production Pipeline: From Data Science to Software Engineering
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
Michael Pilosov dc8848b895 occupancy example 3 years ago
..
Draft.ipynb occupancy example 3 years ago
Edit.ipynb occupancy example 3 years ago
README.md occupancy example 3 years ago
datatest.csv occupancy example 3 years ago

README.md

Occupancy Classifier

The goal here is to predict a binary outcome: whether a room is occupied by people or not.

To do this, we will use two time-series based features: humidity and light measurements.

Learning Outcome

We present here a very "messy" notebook with almost no documentation (one that was actually written by the author and re-visited for the purposes of this pedagogical example) based on a very limited dataset found online.

The goal of this demonstration is to show an example of "un-shareable code" and how a very simple re-factor can dramatically improve the presentation of the content. The author of the [code in the] notebook is often the first person to benefit from such refactors, since revisiting the content in the future can largely feel like reading something for the first time, especially with enough time elapsed.

By making small changes such as variable names, titles and labels in plots, and some brief comments, what started out as a rough proof-of-concept can become a foundation for future work building a predictive product. For example, imagine a smart thermostat equipped with an inexpensive photoresistor to measure light and a moisture meter which needs to "learn" when people are home so as to best conserve energy when they are away. A notebook like the one presented here could very much be how such someone began investigating how to implement such a feature.

How to Read this Example

First start by opening the Draft.ipynb notebook and running it.

  • Notice how the environment isn't specified. The kernel the notebook is expecting may or may not exist on your machine.
  • You may run into import errors because certain packages don't exist. How do you handle it?
    • Do you open a terminal up and run pip install <missing-package> each time you encounter an import error?
    • Do you create a new cell at the top of the notebook with lines such as # pip install <missing-package-list> > /dev/null && echo "installed dependencies" to handle installation?
    • Do you create a requirements.txt file and list the dependencies you will need in there so you can at least run pip install -r requirements.txt?
    • Do you specify versions for any of the above?
  • Notice that variable names are often a single letter. Is this easy to read?
  • Notice that graphs are missing labels
  • Notice the lack of annotations: what is visualization v.s. data-exploration v.s. training v.s. testing and how this complicates reading the notebook from top-to-bottom.
  • The data is committed to git (granted it is a small file), do you see any potential future problems that can arise if this becomes the norm in a mature shared repository?

Then open Cleanup.ipynb and observe the changes made, reflect on the following:

  • What has been addressed and what is still missing? (Is what is missing worth doing?)
  • How much more readable is it?