diff --git a/Makefile b/Makefile index d7a8027..faee5e2 100644 --- a/Makefile +++ b/Makefile @@ -2,6 +2,8 @@ city_distances.csv: check generate_data.py @echo "Generating distance data..." @bash -c 'time python generate_data.py' +data: city_distances.csv + train: check train.py @echo "Training embeddings..." @bash -c 'time python train.py' @@ -23,4 +25,6 @@ check: lint clean: @echo "Removing outputs/ and checkpoints/ directories" @rm -rf output/ - @rm -rf checkpoints/ \ No newline at end of file + @rm -rf checkpoints/ + +.PHONY: data train eval lint check clean \ No newline at end of file diff --git a/README.md b/README.md index e53b4a7..165e233 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,56 @@ -# citybert +# CityBert -1. Generates dataset of cities (US only for now) and their pair-wise geodesic distances. -2. Uses that dataset to fine-tune a neural-net to understand that cities closer to one another are more similar. -3. Distances become `labels` through the formula `1 - distance/MAX_DISTANCE`, where `MAX_DISTANCE=20_037.5 # km` represents half of the Earth's circumfrence. +CityBert is a machine learning project that fine-tunes a neural network model to understand the similarity between cities based on their geodesic distances. +The project generates a dataset of US cities and their pair-wise geodesic distances, which are then used to train the model. +Note that this model only considers geographic distances and does not take into account other factors such as political borders or transportation infrastructure. -There are other factors that can make cities that are "close together" on the globe "far apart" in reality, due to political borders. -Factors like this are not considered in this model, it is only considering geography. +The project can be extended to include other distance metrics or additional data, such as airport codes, city aliases, or time zones. -However, for use-cases that involve different measures of distances (perhaps just time-zones, or something that considers the reality of travel), the general principals proven here should be applicable (pick a metric, generate data, train). +## Overview of Project Files -A particularly useful addition to the dataset here: -- airports: they (more/less) have unique codes, and this semantic understanding would be helpful for search engines. -- aliases for cities: the dataset used for city data (lat/lon) contains a pretty exhaustive list of aliases for the cities. It would be good to generate examples of these with a distance of 0 and train the model on this knowledge. -- time-zones: encode difference in hours (relative to worst-possible-case) as labels associated with the time-zone formatted-strings. +- `generate_data.py`: Generates a dataset of US cities and their pairwise geodesic distances. +- `train.py`: Trains the neural network model using the generated dataset. +- `eval.py`: Evaluates the trained model by comparing the similarity between city vectors before and after training. +- `Makefile`: Automates the execution of various tasks, such as generating data, training, and evaluation. +- `README.md`: Provides a description of the project, instructions on how to use it, and expected results. +- `requirements.txt`: Defines requirements used for creating the results. -# notes -- see `Makefile` for instructions. +## How to Use + +1. Install the required dependencies by running `pip install -r requirements.txt`. +2. Run `make city_distances.csv` to generate the dataset of city distances. +3. Run `make train` to train the neural network model. +4. Run `make eval` to evaluate the trained model and generate evaluation plots. + +## What to Expect + +After training, the model should be able to understand the similarity between cities based on their geodesic distances. +You can inspect the evaluation plots generated by the `eval.py` script to see the improvement in similarity scores before and after training. +These will randomly sample 10k rows from the training dataset and plot the distances before and after training. + +After five epochs: +![Evaluation plot](./plots/progress_35845_sm.png) + +After ten epochs: +![Evaluation plot](./plots/progress_680065_sm.png) + + +*The above plot is a placeholder for an actual evaluation plot showing the similarity between city vectors before and after training.* + +## Future Extensions + +There are several potential improvements and extensions to the current model: + +1. **Incorporate airport codes**: Train the model to understand the unique codes of airports, which could be useful for search engines and other applications. +2. **Add city aliases**: Enhance the dataset with city aliases, so the model can recognize different names for the same city. The `geonamescache` package already includes these. +3. **Include time zones**: Train the model to understand time zone differences between cities, which could be helpful for various time-sensitive use cases. The `geonamescache` package already includes this data, but how to calculate the hours between them is an open question. +4. **Expand to other distance metrics**: Adapt the model to consider other measures of distance, such as transportation infrastructure or travel time. +5. **Train on sentences**: Improve the model's performance on sentences by adding training and validation examples that involve city names in the context of sentences. +6. **Global city support**: Extend the model to support cities outside the US and cover a broader range of geographic locations. + + + +# Notes - Generating the data took about 13 minutes (for 3269 US cities) on 8-cores (Intel 9700K), yielding 2,720,278 records (combinations of cities). - Training on an Nvidia 3090 FE takes about an hour per epoch with an 80/20 test/train split. Batch size is 16, so there were 136,014 steps per epoch -- **TODO**`**: Need to add training / validation examples that involve city names in the context of sentences. _It is unclear how the model performs on sentences, as it was trained only on word-pairs. \ No newline at end of file +- **TODO**: Need to add training / validation examples that involve city names in the context of sentences. _It is unclear how the model performs on sentences, as it was trained only on word-pairs._ \ No newline at end of file diff --git a/eval.py b/eval.py index d2dc2dc..fb8c872 100644 --- a/eval.py +++ b/eval.py @@ -58,7 +58,7 @@ def make_plot(data): ) ax.set_xlabel("distance between cities (km)") ax.set_ylabel("similarity between vectors\n(cosine)") - fig.legend(loc="upper right") + ax.legend(loc="upper right") return fig @@ -69,7 +69,7 @@ if __name__ == "__main__": data = pd.read_csv("city_distances_full.csv") # data_sample = data.sample(1_000) checkpoint_dir = "checkpoints_absmax_split" # no slash - for checkpoint in sorted(glob.glob(f"{checkpoint_dir}/*"))[14::]: + for checkpoint in sorted(glob.glob(f"{checkpoint_dir}/*")): print(f"Evaluating {checkpoint}") data_sample = data.sample(1_000) trained_model = SentenceTransformer(checkpoint, device="cuda") diff --git a/plots/progress_35845_sm.png b/plots/progress_35845_sm.png new file mode 100644 index 0000000..304452a Binary files /dev/null and b/plots/progress_35845_sm.png differ diff --git a/plots/progress_680065_sm.png b/plots/progress_680065_sm.png new file mode 100644 index 0000000..42d6873 Binary files /dev/null and b/plots/progress_680065_sm.png differ