copy-paste-detective detects duplicated data within Excel spreadsheets.
npm run detect excel <folder> [fileIndex]- Detect anomalous data from Excel sheet stored on the file-system.- Example:
npm run detect excel benchmark-files/doi_10_5061_dryad_stqjq2cdp__v20250418 1(analyzes second Excel file in the folder) - Example:
npm run detect excel benchmark-files/doi_10_5061_dryad_stqjq2cdp__v20250418 1 -- --strategies duplicateRows,individualNumbers(runs only some strategies)
- Example:
npm run dryad-index- Index all datasets in Dryad with at least one Excel sheet that fulfils the inclusion criteria (duration: ~1 day).npm run dryad-download- Download Excel files of previously indexed Dryad datasets.npm run dryad-detect- Detect anomalous data from a single downloaded Dryad datasetnpm run dryad-detect-all- Run the detection on all downloaded Dryad datasetsnpm run dryad-report- Get overview of all completed analyses of Dryad datasets, ordered by level of suspicion.
npm run test- Run automated Jest testsnpm run test-ai- Check that the currently selected model returns the right output on the column-categorization prompt.
- Node.js (v18+)
- Docker
-
Run
npm ito install dependencies -
Create an
.envfile and add the environment variables specified in.env.dist
Start a PostgreSQL container:
docker run -d \
--name science-detective-db \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=science_detective \
-p 5432:5432 \
-v science-detective-pgdata:/var/lib/postgresql/data \
postgres:16Add the database URL to your .env file:
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/science_detective
Generate and run database migrations:
npx drizzle-kit generate
npx drizzle-kit migrateIf you have existing data in data/dryad/datasets.json, run the migration script:
npx tsx -r dotenv/config src/scripts/migrateFromJson.tsThere are currently three pluggable algorithms:
duplicateRows: Finds duplicate rows across sheetsrepeatedColumnSequences: Identifies repeated sequences in columnsindividualNumbers: Detects suspicious individual number patterns
Tests that use real datasets should be located in the benchmark-files repository, next to the file they are using in the test. Unit tests for general functionality should be located next to the regular function file.