CityBert is a machine learning project that fine-tunes a neural network model to understand the similarity between cities based on their geodesic distances.
> Note that this model only considers geographic distances and does not take into account other factors such as political borders or transportation infrastructure.
These factors contribute to a sense of "distance as it pertains to travel difficulty," which is not directly reflected by this model.
*The above plots are examples showing the relationship between geodesic distance and the similarity between the embedded vectors (1 = more similar), for 10,000 randomly selected pairs of US cities (re-sampled for each image).*
*Note the (vertical) "gap" we see in the image, corresponding to the size of the continental United States (~5,000 km)*
---
## Future Improvements
There are several potential improvements and extensions to the current model:
1.**Incorporate airport codes**: Train the model to understand the unique codes of airports, which could be useful for search engines and other applications.
2.**Add city aliases**: Enhance the dataset with city aliases, so the model can recognize different names for the same city. The `geonamescache` package already includes these.
3.**Include time zones**: Train the model to understand time zone differences between cities, which could be helpful for various time-sensitive use cases. The `geonamescache` package already includes this data, but how to calculate the hours between them is an open question.
4.**Expand to other distance metrics**: Adapt the model to consider other measures of distance, such as transportation infrastructure or travel time.
5.**Train on sentences**: Improve the model's performance on sentences by adding training and validation examples that involve city names in the context of sentences. Can use generative AI to create template sentences (mad-libs style) to create random and diverse training examples.
6.**Global city support**: Extend the model to support cities outside the US and cover a broader range of geographic locations.
# Notes
- Generating the data took about 13 minutes (for 3269 US cities) on 8-cores (Intel 9700K), yielding 2,720,278 records (combinations of cities).