full training process on US cities

add check for git status
details in readme
2023-05-04 19:05:46 +00:00 · 2023-05-04 19:05:33 +00:00 · 2023-05-04 19:05:12 +00:00
3 changed files with 12 additions and 7 deletions
--- a/16
+++ b/16
@ -1,16 +1,20 @@
-city_distances.csv: lint generate_data.py
+city_distances.csv: check generate_data.py
 	bash -c 'time python generate_data.py'

+train: check train.py
+	bash -c 'time python train.py'
+
+eval: check eval.py
+	bash -c 'time python eval.py'
+
 lint:
 	isort --profile=black .
 	black .
 	flake8 --max-line-length=88 .

-train: lint train.py
-	bash -c 'time python train.py'
-
-eval: lint eval.py
-	bash -c 'time python eval.py'
+check: lint
+	@echo "Checking for unstaged or untracked changes..."
+	@git diff-index --quiet HEAD -- || { echo "Unstaged or untracked changes detected!"; exit 1; }

 clean:
 	rm -rf output/
--- a/README.md
+++ b/README.md
@ -12,6 +12,7 @@ However, for use-cases that involve different measures of distances (perhaps jus
 A particularly useful addition to the dataset here:
 - airports: they (more/less) have unique codes, and this semantic understanding would be helpful for search engines.
 - aliases for cities: the dataset used for city data (lat/lon) contains a pretty exhaustive list of aliases for the cities. It would be good to generate examples of these with a distance of 0 and train the model on this knowledge.
+- time-zones: encode difference in hours (relative to worst-possible-case) as labels associated with the time-zone formatted-strings.

 # notes
 - see `Makefile` for instructions.
--- a/eval.py
+++ b/eval.py
@ -66,7 +66,7 @@ if __name__ == "__main__":
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    base_model = SentenceTransformer(model_name, device="cuda")

-    data = pd.read_csv("city_distances_sample.csv")
+    data = pd.read_csv("city_distances_full.csv")
    # data_sample = data.sample(1_000)
    checkpoint_dir = "checkpoints_absmax_split"  # no slash
    for checkpoint in sorted(glob.glob(f"{checkpoint_dir}/*")):
Author	SHA1	Message	Date
mm	f193018ac2	full training process on US cities	2023-05-04 19:05:46 +00:00
mm	282c0466d8	add check for git status	2023-05-04 19:05:33 +00:00
mm	e9adbed41a	details in readme	2023-05-04 19:05:12 +00:00