teaching a transformer to understand how far apart (common) cities are.
Go to file
Michael Pilosov fbe4503801 workers
2026-05-25 23:36:06 +00:00
.gitignore more training data + frozen layers options 2026-05-25 15:16:19 -06:00
.python-version first train/eval 2026-05-25 14:11:05 -06:00
data_utils.py Improve training signal and add honest eval metrics 2026-05-25 22:41:14 +00:00
debug_distance.py major bugfix 2023-05-05 03:34:56 +00:00
eval.py Improve training signal and add honest eval metrics 2026-05-25 22:41:14 +00:00
generate_data.py fix bug in city lookups 2023-05-05 21:48:16 +00:00
Makefile workers 2026-05-25 23:36:06 +00:00
prepare_training_data.py Improve training signal and add honest eval metrics 2026-05-25 22:41:14 +00:00
progress_sample.png fix bug in city lookups 2023-05-05 21:48:16 +00:00
README.md cuda 2026-05-25 21:25:24 +00:00
requirements.txt first train/eval 2026-05-25 14:11:05 -06:00
train.py Improve training signal and add honest eval metrics 2026-05-25 22:41:14 +00:00

Sign Coordinate Regressor

This project fine-tunes a sentence-transformer model to predict a latitude and longitude pair from text observed on signs at an intersection.

The raw dataset is expected at training_data_raw.csv. It may include the pandas-exported index column; the preprocessing script only uses:

  • intersection
  • text_on_sign_exact
  • latitude
  • longitude

Workflow

Prepare bootstrapped training rows:

source .venv/bin/activate
python prepare_training_data.py --seed 1992 --bag-size 5 --samples-per-intersection 50

This writes training.csv with rows shaped like:

intersection,sample_id,text,latitude,longitude,unique_sign_count,raw_sign_count

Each row is a deterministic bootstrap sample of sign texts from one intersection, joined into a single text field. This trains on "some signs seen at this coordinate" instead of one sign or every sign at that coordinate.

Train the model:

python train.py

Evaluate and write predictions:

python eval.py

This writes predictions.csv and a map-style diagnostic plot:

Prediction map

It also writes a coordinate calibration plot:

Predicted vs actual coordinates

Or run the full pipeline:

make

Useful Options

prepare_training_data.py:

  • --seed: deterministic bootstrap seed.
  • --bag-size: number of sign texts per sampled bag.
  • --samples-per-intersection: number of bags generated per intersection.

train.py:

  • --model-name: sentence-transformers base model.
  • --epochs: training epochs.
  • --batch-size: batch size.
  • --device: device override such as cpu, cuda, or mps. Defaults to cuda.

Outputs

  • training.csv: prepared bootstrapped dataset.
  • output/: saved sentence-transformer encoder, coordinate head, and coordinate normalization metadata.
  • predictions.csv: evaluation rows with predicted coordinates and error_km.
  • plots/prediction_map.png: actual vs predicted coordinates with line segments showing the prediction error.
  • plots/predicted_vs_actual.png: predicted vs actual latitude and longitude scatter plots.