teaching a transformer to understand how far apart (common) cities are.

Go to file

Michael Pilosov 099ac371c4 add training data		2026-05-28 16:51:17 +00:00
.gitignore	more training data + frozen layers options	2026-05-25 15:16:19 -06:00
.python-version	first train/eval	2026-05-25 14:11:05 -06:00
data_utils.py	Improve training signal and add honest eval metrics	2026-05-25 22:41:14 +00:00
debug_distance.py	major bugfix	2023-05-05 03:34:56 +00:00
eval.py	Improve training signal and add honest eval metrics	2026-05-25 22:41:14 +00:00
generate_data.py	fix bug in city lookups	2023-05-05 21:48:16 +00:00
Makefile	workers	2026-05-25 23:36:06 +00:00
prepare_training_data.py	Improve training signal and add honest eval metrics	2026-05-25 22:41:14 +00:00
progress_sample.png	fix bug in city lookups	2023-05-05 21:48:16 +00:00
README.md	cuda	2026-05-25 21:25:24 +00:00
requirements.txt	first train/eval	2026-05-25 14:11:05 -06:00
train.py	Improve training signal and add honest eval metrics	2026-05-25 22:41:14 +00:00
training_data_raw.csv	add training data	2026-05-28 16:51:17 +00:00

README.md

Sign Coordinate Regressor

This project fine-tunes a sentence-transformer model to predict a latitude and longitude pair from text observed on signs at an intersection.

The raw dataset is expected at training_data_raw.csv. It may include the pandas-exported index column; the preprocessing script only uses:

intersection
text_on_sign_exact
latitude
longitude

Workflow

Prepare bootstrapped training rows:

source .venv/bin/activate
python prepare_training_data.py --seed 1992 --bag-size 5 --samples-per-intersection 50

This writes training.csv with rows shaped like:

intersection,sample_id,text,latitude,longitude,unique_sign_count,raw_sign_count

Each row is a deterministic bootstrap sample of sign texts from one intersection, joined into a single text field. This trains on "some signs seen at this coordinate" instead of one sign or every sign at that coordinate.

Train the model:

python train.py

Evaluate and write predictions:

python eval.py

This writes predictions.csv and a map-style diagnostic plot:

It also writes a coordinate calibration plot:

Or run the full pipeline:

make

Useful Options

prepare_training_data.py:

--seed: deterministic bootstrap seed.
--bag-size: number of sign texts per sampled bag.
--samples-per-intersection: number of bags generated per intersection.

train.py:

--model-name: sentence-transformers base model.
--epochs: training epochs.
--batch-size: batch size.
--device: device override such as cpu, cuda, or mps. Defaults to cuda.

Outputs

training.csv: prepared bootstrapped dataset.
output/: saved sentence-transformer encoder, coordinate head, and coordinate normalization metadata.
predictions.csv: evaluation rows with predicted coordinates and error_km.
plots/prediction_map.png: actual vs predicted coordinates with line segments showing the prediction error.
plots/predicted_vs_actual.png: predicted vs actual latitude and longitude scatter plots.