You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
generate_results: This contains the directory for the sentence idx and the memorization scores of this sentence.
The name of the file follow the format of memorization_evals_{model_size}_deduped-v0_{context size}_{continuation size}_143000.csv which has two columns idx and scores.
dedup_data: This contains the original deduplicated data.
dedup_merge: This contains the merged deduplicated data.
undeduped_data: This contains the original undeduplicated data.
undedup_merge: This contains the merged undeduplicated data.
pythia: means the pythia package
File expalanation
run_generate.sh: This initiatiaste the batch_generate.py script. The input parameters are model size, checkpoint (usually the last step), batch size (usually fixed), context size and continuation size.
data_download.py: Used to download the pre-train data. possibly do not have to use it again.
cluster.py: Sample different memorized/unmemorized data points and apply dimension reduction and show in a figure.
clmtraing.py: Trains a model on causal language modelling task.
embedding_obtain,py: A script shows how to obtain hiddent state embedding for Pythia or any other model.
generate.py, csv_process.py, csv_reformat.py are just some helper scripts or format conversion scripts may not be used again.
example_explore.py: A script to show to make a single example generation.