teaching a transformer to understand how far apart (common) cities are.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

68 lines
4.4 KiB

# CityBert
2 years ago
CityBert is a machine learning project that fine-tunes a neural network model to understand the similarity between cities based on their geodesic distances.
2 years ago
The project generates a dataset of US cities and their pair-wise geodesic distances, which are then used to train the model.
2 years ago
The project can be extended to include other distance metrics or additional data, such as airport codes, city aliases, or time zones.
2 years ago
> Note that this model only considers geographic distances and does not take into account other factors such as political borders or transportation infrastructure.
These factors contribute to a sense of "distance as it pertains to travel difficulty," which is not directly reflected by this model.
2 years ago
## Overview of Project Files
- `generate_data.py`: Generates a dataset of US cities and their pairwise geodesic distances.
- `train.py`: Trains the neural network model using the generated dataset.
- `eval.py`: Evaluates the trained model by comparing the similarity between city vectors before and after training.
- `Makefile`: Automates the execution of various tasks, such as generating data, training, and evaluation.
- `README.md`: Provides a description of the project, instructions on how to use it, and expected results.
- `requirements.txt`: Defines requirements used for creating the results.
## How to Use
1. Install the required dependencies by running `pip install -r requirements.txt`.
2. Run `make city_distances.csv` to generate the dataset of city distances.
3. Run `make train` to train the neural network model.
4. Run `make eval` to evaluate the trained model and generate evaluation plots.
**You can also just run `make` (i.e., `make all`) which will run through all of those steps.**
## What to Expect
After training, the model should be able to understand the similarity between cities based on their geodesic distances.
You can inspect the evaluation plots generated by the `eval.py` script to see the improvement in similarity scores before and after training.
After five epochs, the model no longer treats the terms as unrelated:
![Evaluation plot](./plots/progress_35845_sm.png)
After ten epochs, we can see the model has learned to correlate our desired quantities:
![Evaluation plot](./plots/progress_680065_sm.png)
*The above plots are examples showing the relationship between geodesic distance and the similarity between the embedded vectors (1 = more similar), for 10,000 randomly selected pairs of US cities (re-sampled for each image).*
*Note the (vertical) "gap" we see in the image, corresponding to the size of the continental United States (~5,000 km)*
---
## Future Improvements
There are several potential improvements and extensions to the current model:
1. **Incorporate airport codes**: Train the model to understand the unique codes of airports, which could be useful for search engines and other applications.
2. **Add city aliases**: Enhance the dataset with city aliases, so the model can recognize different names for the same city. The `geonamescache` package already includes these.
3. **Include time zones**: Train the model to understand time zone differences between cities, which could be helpful for various time-sensitive use cases. The `geonamescache` package already includes this data, but how to calculate the hours between them is an open question.
4. **Expand to other distance metrics**: Adapt the model to consider other measures of distance, such as transportation infrastructure or travel time.
5. **Train on sentences**: Improve the model's performance on sentences by adding training and validation examples that involve city names in the context of sentences. Can use generative AI to create template sentences (mad-libs style) to create random and diverse training examples.
6. **Global city support**: Extend the model to support cities outside the US and cover a broader range of geographic locations.
# Notes
- Generating the data took about 13 minutes (for 3269 US cities) on 8-cores (Intel 9700K), yielding 2,720,278 records (combinations of cities).
- Training on an Nvidia 3090 FE takes about an hour per epoch with an 80/20 test/train split. Batch size is 16, so there were 136,014 steps per epoch
- Evaluation on the above hardware took about 15 minutes for 20 epochs at 10k samples each.
- **WARNING**: _It is unclear how the model performs on sentences, as it was trained and evaluated only on word-pairs._ See improvement (5) above.