58 lines
3.7 KiB
Markdown
58 lines
3.7 KiB
Markdown
# CityBert
|
|
|
|
CityBert is a machine learning project that fine-tunes a neural network model to understand the similarity between cities based on their geodesic distances.
|
|
The project generates a dataset of US cities and their pair-wise geodesic distances, which are then used to train the model.
|
|
Note that this model only considers geographic distances and does not take into account other factors such as political borders or transportation infrastructure.
|
|
|
|
The project can be extended to include other distance metrics or additional data, such as airport codes, city aliases, or time zones.
|
|
|
|
## Overview of Project Files
|
|
|
|
- `generate_data.py`: Generates a dataset of US cities and their pairwise geodesic distances.
|
|
- `train.py`: Trains the neural network model using the generated dataset.
|
|
- `eval.py`: Evaluates the trained model by comparing the similarity between city vectors before and after training.
|
|
- `Makefile`: Automates the execution of various tasks, such as generating data, training, and evaluation.
|
|
- `README.md`: Provides a description of the project, instructions on how to use it, and expected results.
|
|
- `requirements.txt`: Defines requirements used for creating the results.
|
|
|
|
## How to Use
|
|
|
|
1. Install the required dependencies by running `pip install -r requirements.txt`.
|
|
2. Run `make city_distances.csv` to generate the dataset of city distances.
|
|
3. Run `make train` to train the neural network model.
|
|
4. Run `make eval` to evaluate the trained model and generate evaluation plots.
|
|
|
|
## What to Expect
|
|
|
|
After training, the model should be able to understand the similarity between cities based on their geodesic distances.
|
|
You can inspect the evaluation plots generated by the `eval.py` script to see the improvement in similarity scores before and after training.
|
|
|
|
After five epochs:
|
|

|
|
|
|
After ten epochs:
|
|

|
|
|
|
|
|
*The above plots are examples showing the similarity between city vectors for 10,000 randomly selected pairs of cities.*
|
|
|
|
---
|
|
|
|
## Future Improvements
|
|
|
|
There are several potential improvements and extensions to the current model:
|
|
|
|
1. **Incorporate airport codes**: Train the model to understand the unique codes of airports, which could be useful for search engines and other applications.
|
|
2. **Add city aliases**: Enhance the dataset with city aliases, so the model can recognize different names for the same city. The `geonamescache` package already includes these.
|
|
3. **Include time zones**: Train the model to understand time zone differences between cities, which could be helpful for various time-sensitive use cases. The `geonamescache` package already includes this data, but how to calculate the hours between them is an open question.
|
|
4. **Expand to other distance metrics**: Adapt the model to consider other measures of distance, such as transportation infrastructure or travel time.
|
|
5. **Train on sentences**: Improve the model's performance on sentences by adding training and validation examples that involve city names in the context of sentences. Can use generative AI to create template sentences (mad-libs style) to create random and diverse training examples.
|
|
6. **Global city support**: Extend the model to support cities outside the US and cover a broader range of geographic locations.
|
|
|
|
|
|
|
|
# Notes
|
|
- Generating the data took about 13 minutes (for 3269 US cities) on 8-cores (Intel 9700K), yielding 2,720,278 records (combinations of cities).
|
|
- Training on an Nvidia 3090 FE takes about an hour per epoch with an 80/20 test/train split. Batch size is 16, so there were 136,014 steps per epoch
|
|
- **WARNING**: _It is unclear how the model performs on sentences, as it was trained and evaluated only on word-pairs._ See improvement (5) above.
|