94 lines
6.6 KiB
Markdown
94 lines
6.6 KiB
Markdown
# CityBert
|
|
|
|
CityBert is a machine learning project that fine-tunes a neural network model to understand the similarity between cities based on their geodesic distances.
|
|
|
|
The project generates a dataset of US cities and their pair-wise geodesic distances, which are then used to train the model.
|
|
|
|
The project can be extended to include other distance metrics or additional data, such as airport codes, city aliases, or time zones.
|
|
|
|
> Note that this model only considers geographic distances and does not take into account other factors such as political borders or transportation infrastructure.
|
|
These factors contribute to a sense of "distance as it pertains to travel difficulty," which is not directly reflected by this model.
|
|
|
|
## But Why?
|
|
|
|
The primary goal of this project is to demonstrate that pre-trained language models can be fine-tuned to understand geographic relationships between cities.
|
|
|
|
By training a model on pairs of city names and their geodesic distances, we can improve its ability to infer the similarity between cities based on their names alone.
|
|
This can be beneficial in various applications, such as search engines, recommendation systems, or other natural language processing tasks that involve understanding the geographic context.
|
|
|
|
### Tradeoffs
|
|
This approach offers the advantage of understanding geographic relationships through language patterns alone, without relying on traditional methods like latitude/longitude lookups, enabling more efficient and context-aware processing of location-based information in natural language tasks.
|
|
|
|
Using a neural network for understanding geographic relationships provides a more robust and flexible representation of location-based information compared to traditional methods like latitude/longitude lookups.
|
|
This approach can capture complex patterns and relationships in natural language tasks that may not be easily modeled using traditional methods.
|
|
|
|
While there might be an initial computational overhead in training the model, the benefits include the ability to handle a variety of location-based queries (e.g., based on city landmarks), and better handling of aliases and alternate city names.
|
|
|
|
This tradeoff enables more efficient and context-aware processing of location-based information, making it a valuable choice in _certain_ applications.
|
|
|
|
In scenarios where high precision is required or where the problem can be solved quickly using traditional methods, table lookups and geodesic calculations might still be more suitable.
|
|
|
|
Ultimately, whether a neural network is more efficient than traditional methods depends on the specific problem being addressed and the desired trade-offs between accuracy, efficiency, and flexibility.
|
|
|
|
### Applicability to Other Tasks
|
|
Furthermore, the approach demonstrated in this project can be extended to other metrics or features beyond geographic distance.
|
|
|
|
By adapting the dataset generation and fine-tuning processes, you can train models to learn various relationships and similarities between different entities.
|
|
|
|
|
|
## Overview of Project Files
|
|
|
|
- `generate_data.py`: Generates a dataset of US cities and their pairwise geodesic distances.
|
|
- `train.py`: Trains the neural network model using the generated dataset.
|
|
- `eval.py`: Evaluates the trained model by comparing the similarity between city vectors before and after training.
|
|
- `Makefile`: Automates the execution of various tasks, such as generating data, training, and evaluation.
|
|
- `README.md`: Provides a description of the project, instructions on how to use it, and expected results.
|
|
- `requirements.txt`: Defines requirements used for creating the results.
|
|
|
|
|
|
## How to Use
|
|
|
|
1. Install the required dependencies by running `pip install -r requirements.txt`.
|
|
2. Run `make city_distances.csv` to generate the dataset of city distances.
|
|
3. Run `make train` to train the neural network model.
|
|
4. Run `make eval` to evaluate the trained model and generate evaluation plots.
|
|
|
|
**You can also just run `make` (i.e., `make all`) which will run through all of those steps.**
|
|
|
|
|
|
## What to Expect
|
|
|
|
After training, the model should be able to understand the similarity between cities based on their geodesic distances.
|
|
You can inspect the evaluation plots generated by the `eval.py` script to see the improvement in similarity scores before and after training.
|
|
|
|
After five epochs, the model no longer treats the terms as unrelated:
|
|

|
|
|
|
After ten epochs, we can see the model has learned to correlate our desired quantities:
|
|

|
|
|
|
|
|
*The above plots are examples showing the relationship between geodesic distance and the similarity between the embedded vectors (1 = more similar), for 10,000 randomly selected pairs of US cities (re-sampled for each image).*
|
|
|
|
*Note the (vertical) "gap" we see in the image, corresponding to the size of the continental United States (~5,000 km)*
|
|
|
|
---
|
|
|
|
## Future Improvements
|
|
|
|
There are several potential improvements and extensions to the current model:
|
|
|
|
1. **Incorporate airport codes**: Train the model to understand the unique codes of airports, which could be useful for search engines and other applications.
|
|
2. **Add city aliases**: Enhance the dataset with city aliases, so the model can recognize different names for the same city. The `geonamescache` package already includes these.
|
|
3. **Include time zones**: Train the model to understand time zone differences between cities, which could be helpful for various time-sensitive use cases. The `geonamescache` package already includes this data, but how to calculate the hours between them is an open question.
|
|
4. **Expand to other distance metrics**: Adapt the model to consider other measures of distance, such as transportation infrastructure or travel time.
|
|
5. **Train on sentences**: Improve the model's performance on sentences by adding training and validation examples that involve city names in the context of sentences. Can use generative AI to create template sentences (mad-libs style) to create random and diverse training examples.
|
|
6. **Global city support**: Extend the model to support cities outside the US and cover a broader range of geographic locations.
|
|
|
|
|
|
# Notes
|
|
- Generating the data took about 13 minutes (for 3269 US cities) on 8-cores (Intel 9700K), yielding 2,720,278 records (combinations of cities).
|
|
- Training on an Nvidia 3090 FE takes about an hour per epoch with an 80/20 test/train split. Batch size is 16, so there were 136,014 steps per epoch
|
|
- Evaluation on the above hardware took about 15 minutes for 20 epochs at 10k samples each.
|
|
- **WARNING**: _It is unclear how the model performs on sentences, as it was trained and evaluated only on word-pairs._ See improvement (5) above.
|