Alignment Under Use

Research project investigating how ChatGPT is used in practice through shared conversation analysis.

Figure 1 Overview of Alignment Under Use pipeline

Overview
Data Sources & APIs
- Arctic Shift Reddit Archive
- ChatGPT Backend API
Technical Requirements
Usage
Output Schema
Data Folder Structure
Topic Modeling (Three Models)
Linguistic Style Matching (LSM)
Lexical + Syntactic Alignment (LexSyn)
Combined Analysis
Reproducibility
Project Structure
Data Availability & Ethics
License

Overview

This project collects and analyzes publicly shared ChatGPT conversations from Reddit to understand real-world usage patterns, interaction styles, and alignment in practice.

Data Collection: The pipeline leverages two key data sources:

Arctic Shift API - Historical Reddit archive providing full-text search across Reddit posts and comments
ChatGPT Backend API - OpenAI's share endpoint exposing rich conversation metadata

Collection occurs in three stages: Reddit post discovery, comment extraction, and conversation fetching with comprehensive metadata.

Alignment Analysis: The project computes semantic and sentiment alignment in human-AI conversations:

Semantic Alignment - Measures meaning similarity between user and assistant messages using sentence embeddings (cosine similarity)
Sentiment Alignment - Measures emotional tone alignment using sentiment analysis (polarity difference)
Visualization - Generates plots analyzing alignment patterns across conversation dynamics, model versions, and message characteristics

Data Sources & APIs

Arctic Shift Reddit Archive

Arctic Shift provides historical Reddit data with full-text search capabilities.

Endpoints:

Posts: https://arctic-shift.photon-reddit.com/api/posts/search
Comments: https://arctic-shift.photon-reddit.com/api/comments/search

Collection Strategy: Two-phase approach:

Search for posts containing ChatGPT share URLs
For each post with comments, fetch all comments and extract share URLs

Testing shows ~2.6% of comments on ChatGPT-related posts contain share URLs, yielding thousands of additional conversations.

ChatGPT Backend API

OpenAI's backend API exposes rich metadata for shared conversations.

Endpoint: https://chatgpt.com/backend-api/share/{share_id}

Response includes:

Conversation metadata (title, model, timestamps)
Complete message tree with parent-child relationships
Per-message metadata (role, timestamps, content, model version)
Tool usage, reasoning traces, citations, attachments

Cloudflare Protection: Requires curl-cffi with browser impersonation to bypass bot detection.

Technical Requirements

Dependencies

Data collection:

curl-cffi - Cloudflare bypass (critical for ChatGPT API)
requests - Arctic Shift API client
tqdm - Progress bars
pandas - Data processing

Data cleaning:

ftfy - Text encoding fixes
regex - Enhanced pattern matching
langid - Language detection

Alignment analysis:

sentence-transformers - Semantic embeddings (all-mpnet-base-v2)
transformers - Sentiment analysis (distilbert-sst-2)
torch - PyTorch backend for models
numpy - Array operations and caching
matplotlib, seaborn - Visualization

Optional:

presidio-analyzer, presidio-anonymizer - PII detection (anonymization utility only)

Why curl-cffi? The ChatGPT backend API uses Cloudflare bot detection. curl-cffi provides browser impersonation that bypasses these protections.

Installation

pip install -r requirements.txt

Authentication

To fetch ChatGPT conversations, you need a valid session cookie from your browser:

Log into https://chatgpt.com
Open Developer Tools (F12) → Network tab
Refresh page, select any request
Copy full Cookie: header value
Set environment variable:

PowerShell:

[System.Environment]::SetEnvironmentVariable("CHATGPT_COOKIE", "your_cookie", "Process")

Bash:

export CHATGPT_COOKIE="your_cookie"

Note: Cookies expire periodically. Refresh if you get HTTP 403 errors. If you only have a Cloudflare clearance token, set CF_CLEARANCE instead of CHATGPT_COOKIE.

Usage

Run all commands from the repository root so the src package is discoverable.

Pipeline Execution

Run full pipeline:

python -m src.collection.main

Resume interrupted collection:

python -m src.collection.main --resume

Refresh failed fetches:

python -m src.collection.main --resume --refresh-missing

Run individual stages:

# Stage 1 only: Reddit posts
python -m src.collection.main --reddit-only

# Stage 2 only: Reddit comments (requires reddit_posts.jsonl)
python -m src.collection.main --comments-only

# Stage 3 only: Conversations (requires reddit_posts.jsonl and reddit_comments.jsonl)
python -m src.collection.main --conversations-only

Limit collection for testing:

# Limit conversations to fetch
python -m src.collection.main --limit 100

# Limit posts to process for comments
python -m src.collection.main --max-comments-posts 10

Data Cleaning

Clean and filter dataset:

python -m src.processing.cleaning

The cleaning script applies minimal destructive text normalization:

Fixes text encoding issues (mojibake, HTML entities) with ftfy
Cleans markdown formatting while preserving code blocks as [CODE_BLOCK_REMOVED]
Extracts structural features (turn counts, message lengths)
Detects language with langid (fast, deterministic)
Filters to English-only conversations

Options:

--input PATH                 # Input file (default: data/raw/conversations.jsonl)
--output PATH                # Output file (default: data/processed/conversations_english.jsonl)
--output-all-clean PATH      # Save all successful fetches before language filtering
--skip-language-filter       # Skip language detection
--skip-markdown-cleaning     # Skip markdown cleaning

Figure 2 Overview of data collection and cleaning

Alignment Analysis

Compute semantic alignment (sentence embeddings):

python -m src.measures.semantic_alignment

Computes semantic similarity using all-mpnet-base-v2 model. Creates turn pairs (user→assistant and assistant→user) and computes cosine similarity between sentence embeddings.

Compute sentiment alignment:

python -m src.measures.sentiment_alignment

Computes sentiment similarity using distilbert sentiment model. Maps sentiment to [-1, 1] polarity and computes similarity as 1 - |difference|/2.

Options:

# Semantic alignment
--model NAME                 # Sentence transformer model (default: all-mpnet-base-v2)
--batch-size N               # Batch size (default: 256)
--device auto|cpu|cuda       # Computation device (default: auto)
--force-recompute            # Ignore cached embeddings

# Sentiment alignment
--from-conversations PATH    # Load from conversations JSONL instead of semantic_alignment.csv
--model NAME                 # Sentiment model (default: distilbert-base-uncased-finetuned-sst-2-english)

End-to-End Analysis Script

Run the full analysis pipeline (semantic, sentiment, LSM, optional topics) and archive outputs:

bash scripts/alignment_score_extraction.sh \
  data/processed/conversations_english.jsonl \
  data/derived \
  data/outputs \
  false \
  true

Arguments (all optional, shown in order):

input JSONL path (default: data/processed/conversations_english.jsonl)
derived output dir (default: data/derived)
outputs dir (default: data/outputs)
skip topics (true or false, default: false)
verbose (true or false, default: true)

Command-Line Options

Main Pipeline (src/collection/main.py):

--reddit-only - Only run Reddit post collection
--comments-only - Only run Reddit comments collection
--conversations-only - Only run conversation collection
--limit N - Limit conversations to fetch
--max-pages N - Max pages to fetch for Reddit posts (default: 1)
--continue - Continue Reddit post pagination from last post
--max-comments-posts N - Limit posts to process for comments
--comments-delay N - Delay between comment API requests (default: 0.5s)
--resume - Resume from previous run
--refresh-missing - Re-fetch failed attempts when resuming
--output-dir PATH - Output directory (default: data/raw)

Reddit Posts Collection (src/collection/collect_reddit_posts.py):

--output-dir PATH - Output directory (default: data/raw)
--outfile NAME - Output filename (default: reddit_posts.jsonl)
--max-pages N - Max pages to fetch (default: 1, ~1000 posts per page)
--continue - Continue pagination from last post
--dry-run - Count matches without writing

Reddit Comments Collection (src/collection/collect_reddit_comments.py):

--posts-file PATH - Input JSONL with posts (default: data/raw/reddit_posts.jsonl)
--output-dir PATH - Output directory (default: data/raw)
--outfile NAME - Output filename (default: reddit_comments.jsonl)
--max-posts N - Max posts to process for comments
--delay N - Delay between API requests (default: 0.5s)
--dry-run - Count matches without writing

Conversation Collection (src/collection/collect_conversations.py):

--input FILES - Input JSONL files (default: data/raw/reddit_posts.jsonl data/raw/reddit_comments.jsonl)
--output FILE - Output file (default: data/raw/conversations.jsonl)
--limit N - Max conversations to fetch
--timeout N - Request timeout in seconds (default: 15)
--sleep N - Sleep between requests in seconds (default: 1.0)
--resume - Skip already fetched shares
--refresh-missing - Re-fetch failed attempts when resuming
--keep-raw - Include raw API response in output

Data Cleaning (src/processing/cleaning.py):

--input PATH - Input file (default: data/raw/conversations.jsonl)
--output PATH - Output file (default: data/processed/conversations_english.jsonl)
--output-all-clean PATH - Save all successful fetches before language filtering
--skip-language-filter - Skip language detection
--skip-markdown-cleaning - Skip markdown normalization

Semantic Alignment (src/measures/semantic_alignment.py):

--input PATH - Input JSONL (default: data/processed/conversations_english.jsonl)
--output PATH - Output CSV (default: data/derived/semantic_alignment.csv)
--embeddings-cache-dir PATH - Directory for cached embeddings (default: data/derived)
--model NAME - Sentence transformer model (default: all-mpnet-base-v2)
--batch-size N - Batch size (default: 256)
--device auto|cpu|cuda - Computation device (default: auto)
--force-recompute - Ignore cached embeddings

Sentiment Alignment (src/measures/sentiment_alignment.py):

--input PATH - Input CSV from semantic_alignment (default: data/derived/semantic_alignment.csv)
--from-conversations PATH - Alternatively load from conversations JSONL
--conversations PATH - Conversations JSONL to load text for missing columns
--output PATH - Output CSV (default: data/derived/sentiment_alignment.csv)
--cache-dir PATH - Directory for cached sentiment scores (default: data/derived)
--model NAME - Sentiment model (default: distilbert-base-uncased-finetuned-sst-2-english)
--batch-size N - Batch size (default: 64)
--device auto|cpu|cuda - Computation device (default: auto)
--force-recompute - Ignore cached sentiment scores

Output Schema

The pipeline produces multiple JSONL files with comprehensive metadata.

reddit_posts.jsonl

Each line contains metadata for one Reddit post with ChatGPT share link:

{
  "id": "abc123",
  "name": "t3_abc123",
  "subreddit": "ChatGPT",
  "author": "username",
  "created_utc": 1234567890,
  "title": "Post title",
  "score": 42,
  "num_comments": 5,
  "url": "https://chatgpt.com/share/xyz",
  "permalink": "https://reddit.com/r/ChatGPT/comments/..."
}

reddit_comments.jsonl

Each line contains metadata for one Reddit comment with ChatGPT share link(s):

{
  "id": "def456",
  "name": "t1_def456",
  "subreddit": "ChatGPT",
  "author": "username",
  "created_utc": 1234567890,
  "body": "Comment text with https://chatgpt.com/share/xyz",
  "score": 10,
  "link_id": "t3_abc123",
  "parent_id": "t3_abc123",
  "permalink": "https://reddit.com/r/ChatGPT/comments/.../def456",
  "share_urls": ["https://chatgpt.com/share/xyz"],
  "source_post_id": "abc123"
}

Data Folder Structure

To keep the repository clean while preserving a clear workflow, data artifacts are organized under data/:

raw: source dumps collected from APIs
- reddit_posts.jsonl, reddit_comments.jsonl, conversations.jsonl
processed: cleaned, curated datasets ready for analysis
- conversations_english.jsonl, anonymized_conversations.jsonl, df_pairs.csv
derived: computed arrays and intermediate features
- message_embeddings.npy, message_ids.npy, message_sentiment.npy, message_ids_sentiment.npy
- semantic_alignment.csv, sentiment_alignment.csv, lsm_scores.csv
outputs: analysis outputs and merged datasets
- merged.csv (merged features from merge_all.py)
- outputs/bayes: Bayesian model outputs
- outputs/gamm: GAMM model outputs
- outputs/other: misc analysis outputs
- outputs/topics: topic modeling outputs

Only data/README.md is tracked in Git; all other files are ignored via .gitignore.

Topic Modeling (Three Models)

Run the KeyNMF pipeline over user-only, assistant-only, and combined documents:

python -m src.measures.topic_modeling --input data/processed/conversations_english.jsonl --keywords 9 --plot

Outputs are saved to data/outputs/topics/. Common options:

--output-dir PATH - Output directory (default: data/outputs/topics)
--topics N - Topics per model (default: 30)
--max-chars-per-doc N - Truncate long documents (default: 20000)

Linguistic Style Matching (LSM)

Compute linguistic style matching scores between sequential user-assistant message pairs:

python -m src.measures.lsm_scoring
python -m src.measures.lsm_scoring --input data/processed/conversations_english.jsonl --output data/derived/lsm_scores.csv

LSM measures linguistic alignment across functional word categories: articles, prepositions, pronouns, auxiliary verbs, conjunctions, negations, and common adverbs. No filtering is applied; all conversations are processed.

Output is saved to data/derived/lsm_scores.csv with columns: conv_id, turn, lsm_score.

Lexical + Syntactic Alignment (LexSyn)

Compute lexical (word overlap) and syntactic (POS tag overlap) alignment per turn pair:

python -m src.measures.lexsyn_alignment --input data/processed/conversations_english.jsonl --output data/derived/lexsyn_alignment.csv

Requires spaCy and the English model:

pip install spacy
python -m spacy download en_core_web_sm

Combined Analysis

To merge topic assignments with sentiment and LSM outputs, use src/alignment/merge_all.py and write a combined CSV:

python -m src.alignment.merge_all \
  --conv data/processed/conversations_english.jsonl \
  --lsm data/derived/lsm_scores.csv \
  --sentiment data/derived/sentiment_alignment.csv \
  --semantic data/derived/semantic_alignment.csv \
  --lexsyn data/derived/lexsyn_alignment.csv \
  --topics data/outputs/topics/conversations_with_topics.csv \
  --output data/outputs/merged.csv

Note: The share_urls field contains all ChatGPT share URLs extracted from the comment body. The pattern matches various URL formats:

https://chatgpt.com/share/...
http://chatgpt.com/share/...
chatgpt.com/share/... (no protocol)
www.chatgpt.com/share/...
chat.openai.com/share/... (legacy domain)

All URLs are normalized to https:// in the output.

conversations.jsonl

Each line contains full conversation data with enhanced metadata:

{
  "share_id": "xyz",
  "url": "https://chatgpt.com/share/xyz",
  "fetched_at": 1234567890,
  "fetched_at_iso": "2025-12-06T10:30:00Z",
  "fetch_success": true,
  "status_code": 200,
  "error": null,
  "reddit_sources": [
    {
      "id": "abc123",
      "subreddit": "ChatGPT",
      "author": "username",
      "created_utc": 1234567890,
      "title": "Post title",
      "score": 42,
      "num_comments": 5,
      "permalink": "https://reddit.com/r/..."
    }
  ],
  "conversation_metadata": {
    "title": "Conversation title",
    "conversation_id": "conv_123",
    "create_time": 1234567890.0,
    "update_time": 1234567900.0,
    "model": "gpt-4",
    "is_public": true,
    "is_archived": false,
    "current_node": "node_xyz",
    "memory_scope": "conversation",
    "has_custom_instructions": false,
    "tools_used": ["python", "dalle"],
    "tool_call_count": 5,
    "file_count": 2,
    "file_types": ["txt", "pdf"],
    "has_reasoning": true,
    "reasoning_message_count": 3,
    "total_thinking_seconds": 12.5,
    "code_block_count": 4,
    "code_execution_count": 2,
    "citation_count": 3,
    "custom_gpt_used": false,
    "gizmo_ids": [],
    "branch_count": 2
  },
  "messages": [
    {
      "id": "msg_123",
      "node_id": "node_xyz",
      "parent": "node_abc",
      "children": ["node_def"],
      "role": "user",
      "create_time": 1234567890.0,
      "update_time": 1234567890.0,
      "content_type": "text",
      "text": "Message content",
      "model_slug": "gpt-4-turbo",
      "citations": [],
      "attachments": [],
      "recipient": "all",
      "metadata": {},
      "reasoning": null,
      "request_id": null
    }
  ]
}

Conversation Metadata Fields:

Basic: title, conversation_id, create_time, update_time
Model: model (conversation-level model slug)
Privacy: is_public, is_archived
Structure: current_node, memory_scope, branch_count
Features:
- has_custom_instructions - Detects custom instruction presence (content redacted in shares)
- tools_used - List of tool names used (e.g., ["python", "dalle", "file_search"])
- tool_call_count - Total tool invocations
- file_count - Number of files attached
- file_types - List of file extensions
- has_reasoning - Whether conversation includes reasoning/thinking traces
- reasoning_message_count - Count of messages with reasoning
- total_thinking_seconds - Sum of reasoning duration
- code_block_count - Count of code blocks in messages
- code_execution_count - Count of executed code blocks
- citation_count - Total citations across messages
- custom_gpt_used - Whether custom GPT was used
- gizmo_ids - List of custom GPT IDs

Message Fields:

Tree Structure: id, node_id, parent, children
Timing: create_time, update_time
Content: role, content_type, text, model_slug
Metadata: citations, attachments, recipient, metadata, reasoning, request_id

Message Roles: user, assistant, system, tool

Content Types: text, code, execution_output, multimodal_text, model_editable_context (custom instructions)

Reproducibility

This project is designed for research reproducibility:

Data Provenance

Arctic Shift API: Historical Reddit archive provides consistent, queryable Reddit data
Timestamps: All data includes collection timestamps (fetched_at, create_time, update_time)
Source Tracking: Each conversation links back to Reddit posts via reddit_sources array
Idempotent Collection: Resume capability allows interrupted runs to continue without duplicates

Replication Steps

Install dependencies:
```
pip install -r requirements.txt
```
Extract ChatGPT cookies:
- Log into https://chatgpt.com
- Open browser DevTools (F12) → Network tab
- Refresh page, select any request
- Copy full Cookie: header value
- Set CHATGPT_COOKIE environment variable (or CF_CLEARANCE if you only have the Cloudflare token)

Run collection:

# Full pipeline (posts + comments + conversations)
python -m src.collection.main

# Or with limits for testing
python -m src.collection.main --limit 100 --max-comments-posts 10

Resume if interrupted:
```
python -m src.collection.main --resume
```
Clean and filter data:
```
python -m src.processing.cleaning
```

Rate Limiting

Arctic Shift API: No documented rate limits
ChatGPT Backend API:
- Default: 1 second sleep between requests
- Increase if experiencing HTTP 429 errors: --sleep 2.0
- Timeout per request: 15 seconds (configurable via --timeout)

Known Limitations

Cookie Expiration: ChatGPT cookies expire periodically
Cloudflare Protection: Requires curl-cffi with browser impersonation
Custom Instructions: Content redacted in shared conversations
Deleted/Private Shares: May return 404 if owner deleted or made private

Project Structure

AlignmentUnderUse/
├── src/
│   ├── alignment/
│   │   ├── bayes_topic_alignment.Rmd # Bayesian topic alignment
│   │   ├── gamm_modeling.Rmd         # GAMM alignment analysis
│   │   └── merge_all.py              # Merge outputs for downstream analysis
│   ├── measures/
│   │   ├── topic_modeling.py        # Three-model KeyNMF pipeline
│   │   ├── semantic_alignment.py    # Semantic similarity (sentence embeddings)
│   │   ├── sentiment_alignment.py   # Sentiment similarity
│   │   └── lsm_scoring.py           # Linguistic style matching
│   ├── collection/
│   │   ├── arctic_shift_api.py      # Arctic Shift API client (posts + comments)
│   │   ├── collect_reddit_posts.py  # Stage 1: Reddit post collection
│   │   ├── collect_reddit_comments.py # Stage 2: Reddit comments collection
│   │   ├── collect_conversations.py # Stage 3: ChatGPT conversation fetching
│   │   └── main.py                  # Pipeline orchestrator (3 stages)
│   ├── processing/
│   │   ├── cleaning.py              # Text normalization + language filtering
│   │   └── anonymize.py             # Optional PII anonymization utility
│   ├── schemas/
│   │   └── turn.py                  # Turn-level schema constants
│   └── utils/
│       └── io_utils.py              # JSONL IO utilities
├── data/                            # Ignored in Git (except README.md)
│   ├── README.md                    # Tracked; documents data layout
│   ├── raw/                         # Source dumps collected from APIs
│   │   ├── reddit_posts.jsonl
│   │   ├── reddit_comments.jsonl
│   │   └── conversations.jsonl
│   ├── processed/                   # Cleaned datasets ready for analysis
│   │   ├── conversations_english.jsonl
│   │   └── anonymized_conversations.jsonl
│   ├── derived/                     # Computed arrays and intermediate features
│   │   ├── message_embeddings.npy
│   │   ├── message_ids.npy
│   │   ├── message_sentiment.npy
│   │   ├── message_ids_sentiment.npy
│   │   ├── semantic_alignment.csv
│   │   ├── sentiment_alignment.csv
│   │   └── lsm_scores.csv
│   └── outputs/
│       ├── merged.csv
│       ├── bayes/
│       │   └── bayes_topic_alignment_outputs/
│       │       ├── figures/
│       │       ├── diagnostics/
│       │       └── ppc/
│       ├── gamm/
│       │   ├── figures/
│       │   └── gamm_models/
│       ├── other/
│       └── topics/
│           ├── conversations_with_topics.csv
│           ├── topic_distributions.png
│           └── combined_measures.csv
└── scripts/
    └── alignment_score_extraction.sh # End-to-end alignment score extraction

Data Availability & Ethics

Analysis Pipeline

All analysis is performed on raw data. The data cleaning pipeline (src/processing/cleaning.py) performs text normalization and language filtering but does not remove or obscure any identifying information. This preserves the full semantic and structural content needed for discourse and alignment analysis.

Data Release Policy

The conversational dataset is not released. Raw conversation data and source URLs are withheld for the following reasons:

Privacy considerations: While conversations were publicly shared by users, re-distribution without explicit consent may violate reasonable privacy expectations.
Ethical considerations: Individuals who shared conversations may not have anticipated research use or corpus aggregation.
Terms of service: Redistribution of ChatGPT conversation data may conflict with OpenAI's terms of service.

Privacy risk is managed at the disclosure boundary, not during internal computation.

What Is Shared

This repository provides:

Data collection pipeline: Complete code for replicating the collection process
Analysis code: Data cleaning, feature extraction, and analytical methods
Anonymization utility: Optional tool for PII removal (src/processing/anonymize.py)

The anonymization script is not part of the analytical pipeline. It is provided as an optional utility for:

Inspecting or sharing individual conversation excerpts
Creating demonstration examples
Use by third parties on independently collected data

This tool does not provide complete privacy protection and should not be relied upon as a sole safeguard.

Replication

Results can be replicated by:

Using the provided pipeline to collect newly shared conversations
Following the data cleaning and analysis procedures documented in this repository
Comparing findings with those reported in associated publications

The collection pipeline is deterministic given the same Reddit data source and time period.

Ethical Responsibility

Users of this pipeline are responsible for:

Ensuring compliance with relevant terms of service
Obtaining necessary ethical approvals for their research
Making independent determinations about data sharing and privacy protection
Respecting the privacy and dignity of individuals whose conversations are analyzed

This repository provides tools for research but does not make claims about the ethical status of any particular use.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Citation

If you use this project in your research, please cite it as follows:

@software{alignment_under_use,
  title = {Alignment Under Use: Analyzing ChatGPT Usage Patterns Through Shared Conversations},
  author = {Gloria Stvol & Sabrina Zaki},
  year = {2026},
  url = {https://github.com/sabszh/AlignmentUnderUse},
  note = {Open-source research pipeline}
}

Contributing

Contributions are welcome! Please feel free to:

Report bugs via GitHub Issues
Submit feature requests
Create pull requests with improvements
Suggest enhancements to documentation

For substantial changes, please open an issue first to discuss proposed modifications.

Support & Troubleshooting

Common Issues

Issue: HTTP 403 errors when fetching conversations

Solution: Your ChatGPT cookie has expired. Re-extract it from your browser following the Authentication section.

Issue: ImportError for required packages

Solution: Ensure all dependencies are installed: pip install -r requirements.txt

Issue: CUDA memory errors during embedding computation

Solution: Reduce batch size with --batch-size 64 or use CPU with --device cpu

Issue: Arctic Shift API timeouts

Solution: Increase delays between requests: --comments-delay 2.0

Getting Help

Check the Known Limitations section
Review the Command-Line Options for detailed parameter documentation
Open an issue on GitHub with:
- Command used and any error messages
- Python version and OS
- Relevant output from running the command

Acknowledgments

Arctic Shift - Historical Reddit data archive
OpenAI - ChatGPT conversation sharing feature and backend API
Hugging Face - Pre-trained models (sentence-transformers, transformers)

Made with ❤️ for research and understanding AI-human interaction

License • GitHub • Report Issue

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
img		img
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

sabszh/AlignmentUnderUse

Folders and files

Latest commit

History

Repository files navigation

Alignment Under Use

Table of Contents

Overview

Data Sources & APIs

Arctic Shift Reddit Archive

ChatGPT Backend API

Technical Requirements

Dependencies

Installation

Authentication

Usage

Pipeline Execution

Data Cleaning

Alignment Analysis

End-to-End Analysis Script

Command-Line Options

Output Schema

reddit_posts.jsonl

reddit_comments.jsonl

Data Folder Structure

Topic Modeling (Three Models)

Linguistic Style Matching (LSM)

Lexical + Syntactic Alignment (LexSyn)

Combined Analysis

conversations.jsonl

Reproducibility

Data Provenance

Replication Steps

Rate Limiting

Known Limitations

Project Structure

Data Availability & Ethics

Analysis Pipeline

Data Release Policy

What Is Shared

Replication

Ethical Responsibility

License

Citation

Contributing

Support & Troubleshooting

Common Issues

Getting Help

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages