mm
2 years ago
5 changed files with 82 additions and 19 deletions
@ -1,21 +1,67 @@ |
|||
# citybert |
|||
# CityBert |
|||
|
|||
1. Generates dataset of cities (US only for now) and their pair-wise geodesic distances. |
|||
2. Uses that dataset to fine-tune a neural-net to understand that cities closer to one another are more similar. |
|||
3. Distances become `labels` through the formula `1 - distance/MAX_DISTANCE`, where `MAX_DISTANCE=20_037.5 # km` represents half of the Earth's circumfrence. |
|||
CityBert is a machine learning project that fine-tunes a neural network model to understand the similarity between cities based on their geodesic distances. |
|||
|
|||
There are other factors that can make cities that are "close together" on the globe "far apart" in reality, due to political borders. |
|||
Factors like this are not considered in this model, it is only considering geography. |
|||
The project generates a dataset of US cities and their pair-wise geodesic distances, which are then used to train the model. |
|||
|
|||
However, for use-cases that involve different measures of distances (perhaps just time-zones, or something that considers the reality of travel), the general principals proven here should be applicable (pick a metric, generate data, train). |
|||
The project can be extended to include other distance metrics or additional data, such as airport codes, city aliases, or time zones. |
|||
|
|||
A particularly useful addition to the dataset here: |
|||
- airports: they (more/less) have unique codes, and this semantic understanding would be helpful for search engines. |
|||
- aliases for cities: the dataset used for city data (lat/lon) contains a pretty exhaustive list of aliases for the cities. It would be good to generate examples of these with a distance of 0 and train the model on this knowledge. |
|||
- time-zones: encode difference in hours (relative to worst-possible-case) as labels associated with the time-zone formatted-strings. |
|||
> Note that this model only considers geographic distances and does not take into account other factors such as political borders or transportation infrastructure. |
|||
These factors contribute to a sense of "distance as it pertains to travel difficulty," which is not directly reflected by this model. |
|||
|
|||
# notes |
|||
- see `Makefile` for instructions. |
|||
- Generating the data took about 13 minutes (for 3269 US cities) on 8-cores (Intel 9700K), yielding 2,720,278 records (combinations of cities). |
|||
|
|||
## Overview of Project Files |
|||
|
|||
- `generate_data.py`: Generates a dataset of US cities and their pairwise geodesic distances. |
|||
- `train.py`: Trains the neural network model using the generated dataset. |
|||
- `eval.py`: Evaluates the trained model by comparing the similarity between city vectors before and after training. |
|||
- `Makefile`: Automates the execution of various tasks, such as generating data, training, and evaluation. |
|||
- `README.md`: Provides a description of the project, instructions on how to use it, and expected results. |
|||
- `requirements.txt`: Defines requirements used for creating the results. |
|||
|
|||
|
|||
## How to Use |
|||
|
|||
1. Install the required dependencies by running `pip install -r requirements.txt`. |
|||
2. Run `make city_distances.csv` to generate the dataset of city distances. |
|||
3. Run `make train` to train the neural network model. |
|||
4. Run `make eval` to evaluate the trained model and generate evaluation plots. |
|||
|
|||
**You can also just run `make` (i.e., `make all`) which will run through all of those steps.** |
|||
|
|||
|
|||
## What to Expect |
|||
|
|||
After training, the model should be able to understand the similarity between cities based on their geodesic distances. |
|||
You can inspect the evaluation plots generated by the `eval.py` script to see the improvement in similarity scores before and after training. |
|||
|
|||
After five epochs, the model no longer treats the terms as unrelated: |
|||
![Evaluation plot](./plots/progress_35845_sm.png) |
|||
|
|||
After ten epochs, we can see the model has learned to correlate our desired quantities: |
|||
![Evaluation plot](./plots/progress_680065_sm.png) |
|||
|
|||
|
|||
*The above plots are examples showing the relationship between geodesic distance and the similarity between the embedded vectors (1 = more similar), for 10,000 randomly selected pairs of US cities (re-sampled for each image).* |
|||
|
|||
*Note the (vertical) "gap" we see in the image, corresponding to the size of the continental United States (~5,000 km)* |
|||
|
|||
--- |
|||
|
|||
## Future Improvements |
|||
|
|||
There are several potential improvements and extensions to the current model: |
|||
|
|||
1. **Incorporate airport codes**: Train the model to understand the unique codes of airports, which could be useful for search engines and other applications. |
|||
2. **Add city aliases**: Enhance the dataset with city aliases, so the model can recognize different names for the same city. The `geonamescache` package already includes these. |
|||
3. **Include time zones**: Train the model to understand time zone differences between cities, which could be helpful for various time-sensitive use cases. The `geonamescache` package already includes this data, but how to calculate the hours between them is an open question. |
|||
4. **Expand to other distance metrics**: Adapt the model to consider other measures of distance, such as transportation infrastructure or travel time. |
|||
5. **Train on sentences**: Improve the model's performance on sentences by adding training and validation examples that involve city names in the context of sentences. Can use generative AI to create template sentences (mad-libs style) to create random and diverse training examples. |
|||
6. **Global city support**: Extend the model to support cities outside the US and cover a broader range of geographic locations. |
|||
|
|||
|
|||
# Notes |
|||
- Generating the data took about 13 minutes (for 3269 US cities) on 8-cores (Intel 9700K), yielding 2,720,278 records (combinations of cities). |
|||
- Training on an Nvidia 3090 FE takes about an hour per epoch with an 80/20 test/train split. Batch size is 16, so there were 136,014 steps per epoch |
|||
- **TODO**`**: Need to add training / validation examples that involve city names in the context of sentences. _It is unclear how the model performs on sentences, as it was trained only on word-pairs. |
|||
- Evaluation on the above hardware took about 15 minutes for 20 epochs at 10k samples each. |
|||
- **WARNING**: _It is unclear how the model performs on sentences, as it was trained and evaluated only on word-pairs._ See improvement (5) above. |
|||
|
After Width: | Height: | Size: 255 KiB |
After Width: | Height: | Size: 224 KiB |
Loading…
Reference in new issue