| .gitignore | ||
| .python-version | ||
| data_utils.py | ||
| debug_distance.py | ||
| eval.py | ||
| generate_data.py | ||
| Makefile | ||
| prepare_training_data.py | ||
| progress_sample.png | ||
| README.md | ||
| requirements.txt | ||
| train.py | ||
Sign Coordinate Regressor
This project fine-tunes a sentence-transformer model to predict a latitude and longitude pair from text observed on signs at an intersection.
The raw dataset is expected at training_data_raw.csv. It may include the
pandas-exported index column; the preprocessing script only uses:
intersectiontext_on_sign_exactlatitudelongitude
Workflow
Prepare bootstrapped training rows:
source .venv/bin/activate
python prepare_training_data.py --seed 1992 --bag-size 5 --samples-per-intersection 50
This writes training.csv with rows shaped like:
intersection,sample_id,text,latitude,longitude,unique_sign_count,raw_sign_count
Each row is a deterministic bootstrap sample of sign texts from one intersection, joined into a single text field. This trains on "some signs seen at this coordinate" instead of one sign or every sign at that coordinate.
Train the model:
python train.py
Evaluate and write predictions:
python eval.py
This writes predictions.csv and a map-style diagnostic plot:
It also writes a coordinate calibration plot:
Or run the full pipeline:
make
Useful Options
prepare_training_data.py:
--seed: deterministic bootstrap seed.--bag-size: number of sign texts per sampled bag.--samples-per-intersection: number of bags generated per intersection.
train.py:
--model-name: sentence-transformers base model.--epochs: training epochs.--batch-size: batch size.--device: device override such ascpu,cuda, ormps. Defaults tocuda.
Outputs
training.csv: prepared bootstrapped dataset.output/: saved sentence-transformer encoder, coordinate head, and coordinate normalization metadata.predictions.csv: evaluation rows with predicted coordinates anderror_km.plots/prediction_map.png: actual vs predicted coordinates with line segments showing the prediction error.plots/predicted_vs_actual.png: predicted vs actual latitude and longitude scatter plots.

