Blocktail Tokenization Analysis

This repository provides a proof-of-concept toolset for measuring how different naming approaches (including Blocktail) affect token usage in modern language models (LLMs). The results here focus on token count and cost implications within pay-per-token systems or context-limited AI workflows. While this is only one benchmark among many possible tests, it demonstrates how methodology choices in class naming can reduce code verbosity and preserve context space.

Overview

Key Scripts

tokenize_tests.py
- Reads class-naming examples from data/test_cases.json.
- Applies multiple tokenizers (e.g., GPT-4, LLaMA) to these naming patterns.
- Exports tokenized results as JSON in the results/ directory (one file per tokenizer).
analysis-compiler.py
- Processes the generated *_results.json files.
- Aggregates token usage for each naming methodology (Blocktail, BEM, Traditional).
- Calculates metrics such as:
  - Token reductions
  - Marker complexity
  - Projected iterative cost savings
- Produces a detailed Markdown report (tokenization_summary.md) with cross-model comparisons.

Workflow

Prepare Test Cases
- Add or edit naming examples in data/test_cases.json.
- Each entry corresponds to a potential "class naming scenario."
Run Tokenization Tests
- To test all configured tokenizers at once:
```
python tokenize_tests.py all
```
- Or run a single tokenizer:
```
python tokenize_tests.py gpt2
```
Compile Analysis
- Generate the final summary report:
```
python analysis-compiler.py
```
- This reads all JSON output from results/ and collates statistics.
Review Results
- Open tokenization_summary.md to see:
  - Average token counts per naming convention
  - Relative savings (e.g., "Blocktail reduces tokens by ~40% vs. X")
  - Marker-complexity breakdown
  - Potential cost/time gains for iterative AI prompts

Dependencies

Make sure to install:

sentencepiece (for SentencePiece-based models)
transformers (Hugging Face)
tiktoken (OpenAI GPT models)
numpy, pandas, scipy (for basic statistical computations)

Authentication

If you use private Hugging Face models (e.g., LLaMA 3), create a .env file with:

HF_AUTH_TOKEN=your_huggingface_token

Tokenization Analysis Summary

Cross Comparison

Tokenizer	Blocktail Avg	Traditional Avg	BEM Avg	vs Trad.	vs BEM
llama3	8.7	8.7	14.6	0.3%	40.5%
t5	14.5	13.8	24.1	-5.1%	39.7%
spiece	14.5	13.8	24.1	-5.1%	39.7%
roberta	10.9	13.1	19.6	16.8%	44.3%
gpt35	8.7	8.7	14.6	0.3%	40.5%
gpt2	10.9	13.1	19.6	16.8%	44.3%
bert	11.2	12.9	21.9	13.0%	48.9%
gpt4	8.7	8.7	14.6	0.3%	40.5%
mistral	10.6	13.4	19.3	20.5%	44.9%

Practical Impact

Average token reduction per component: 4.5 tokens
In a typical page with 20 components: 90 tokens
Over 5 iterative refinements: 450 tokens saved per page

(In long-chain AI-assisted development, these savings compound across multiple revisions.)

Token Usage by Marker Complexity

0 markers: 2.7 tokens (±0.5)
1 markers: 5.2 tokens (±1.0)
2 markers: 6.2 tokens (±1.1)
3 markers: 8.4 tokens (±1.4)
4 markers: 11.6 tokens (±2.4)
5 markers: 13.7 tokens (±2.7)
6 markers: 15.6 tokens (±2.8)
7 markers: 14.4 tokens (±3.1)

Note: Note: All naming conventions rely on some markers. As you add more states or contexts, token usage inevitably rises. However, methods like Blocktail aim to keep subword splitting minimal, leading to leaner tokens even when stacking multiple states or contexts.

Sample Data

While the sample data aims to represent best-case code practices, particularly for traditional HTML, real-world naming patterns can vary significantly. When managing multiple states and dynamic behaviors, naming conventions may naturally become more verbose or semantic than those used here.

Tokenizer Associations

Implementation Library	Tokenizers Used
`tiktoken`	gpt-4, gpt-3.5-turbo
`transformers`	meta-llama/Meta-Llama-3-70B, mistralai/Mistral-7B-v0.3, bert-base-uncased, roberta-base, t5-base, gpt2
`sentencepiece`	spiece.model

This table reflects only the tokenizer implementations and specific tokenizers tested in our analysis. We access these tokenizers through their respective libraries but do not test or use the full models themselves.

For full methodology and extended documentation, see blocktail.io

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
results		results
results_snippets		results_snippets
.env.test		.env.test
README.md		README.md
analysis-compiler.py		analysis-compiler.py
tokenization_summary.md		tokenization_summary.md
tokenize_tests.py		tokenize_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blocktail Tokenization Analysis

Overview

Key Scripts

Workflow

Dependencies

Authentication

Tokenization Analysis Summary

Cross Comparison

Practical Impact

Token Usage by Marker Complexity

Sample Data

Tokenizer Associations

About

Uh oh!

Releases

Packages

Languages

saadirfan/blocktail-tokenization-analysis

Folders and files

Latest commit

History

Repository files navigation

Blocktail Tokenization Analysis

Overview

Key Scripts

Workflow

Dependencies

Authentication

Tokenization Analysis Summary

Cross Comparison

Practical Impact

Token Usage by Marker Complexity

Sample Data

Tokenizer Associations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages