The Hidden Web of YouTube: How Comments Connect Content and Communities

Abstract 📄

YouTube's recommendation algorithm is a closely guarded secret. We seek to circumvent this "black box" by mapping the social structure of the platform from the bottom up. Using a massive dataset including 8.6 billion comments, we construct a network where channels are connected solely by the users who comment on them.

This approach ignores standard metrics such as view counts in order to reveal organic communities formed by genuine human interactions. The result is a transparent recommendation engine that prioritizes users' active interests —what makes them react— rather than passive metadata that may result from a simple trend. By shifting the focus from static categories (simple views) to dynamic user behavior, we offer a new way to discover content based on where communities are actually active.

Alongside the structural analysis of channels and videos, we also study the humans behind the network. Using streaming methods on billions of comments, we analyze how users actually behave on YouTube: how active they are, how concentrated their attention is, and wether meaningful participation patterns exist at all.

We show that commenter behavior is not random: most users leave weak traces, while a smaller population exhibits stable and informative engagement. This behavioral backbone justifies the selection of "signal users", and ensures that the resulting content network reflects genuine community structure rather than statistical noise.

Research Questions 💭

User Level (Behavior): Is there any tendency in user commenting behavior? Is it possible to construct a network based on it?
Content Level (Structure): What organic structures emerge when we connect channels/videos/categories based on human behavior (comments) rather than algorithms? Which metric could be used to design a meaningful network based on comments?
Application (Recommendation): Can we build a transparent recommendation engine based on comments that bypasses the "Rich-Get-Richer" cycle? Can we predict a user's next favorite channel simply by knowing who their "digital neighbors" are?

Dataset 📚

Considering the size of YouNiverse, we chose not to explore another dataset.

For detailed documentation and methodology, see the original YouNiverse paper: YouNiverse: Large-Scale Channel and Video Metadata from English-Speaking YouTube

The dataset is available on Zenodo.

Methods 🛠️

We focused on the large-scale structure of the YouNiverse dataset, employing the following methods:

Louvain Community Detection (for identifying organic clusters)
Pointwise Mutual Information (PMI) (for edge weighting)
OLS Regression (for analyzing the link between subscribers and connectivity score)
Interactive Visualization (Chord Diagrams, Sankey Diagrams)
Large-Scale User Behavior Analysis (streaming computation of activity, breadth, and concentration over billions of comments)
Participation Regimes & Filtering (identifying meaningful users vs. noise and grounding “Super User” filtering in empirical evidence)
Robustness Diagnostics (noise evaluation, singleton impact, stability and sanity checks to ensure behavioral signal reliability)

1. Preprocessing & User Profiling (The Signal)

To build a robust graph, we defined the "Signal" by filtering for "Super Users" ($U_{super}$). A user $u$ is retained only if they satisfy the following engagement thresholds:

$$u \in U_{super} \iff (N_{videos}(u) \ge 24) \land (N_{likes}(u) \ge 5)$$

Where:

$N_{videos}(u)$: The number of unique videos user $u$ commented on (ensuring consistency).
$N_{likes}(u)$: The total likes received on their comments (ensuring social validation).
Bot Removal: We strictly removed the top $1%$ of most active accounts to eliminate non-human behavior.

2. Network Construction (The Map)

We aggregated billions of interactions into a graph where nodes $i$ and $j$ represent Channels (aggregated by category).

The Interaction Score ($W_{ij}$): We developed a custom edge weight that balances specificity (PMI) with volume (raw shared count). The weight $W_{ij}$ between two channels is defined as:

$$W_{ij} = \text{PMI}(i, j) \times \log(|U_i \cap U_j|)$$

Where the Pointwise Mutual Information (PMI) is calculated as:

$$\text{PMI}(i, j) = \log\left(\frac{P(i, j)}{P(i)P(j)}\right) = \log\left(\frac{N \cdot |U_i \cap U_j|}{|U_i| \cdot |U_j|}\right)$$
- $|U_i \cap U_j|$: Number of shared commentators between channel $i$ and $j$.
- $N$: Total number of users in the network.
- Logic: PMI penalizes generic links between massive channels, while the $\log$ term prevents statistically high PMI values from insignificant niche channels (e.g., 2 users sharing 2 channels) from dominating the graph.
Topology Analysis:
- We applied the Louvain Algorithm to maximize the modularity $Q$, partitioning the network into communities $C_1, ..., C_k$ where internal density is maximized.
- We calculated Degree to identify Hubs.

3. The Recommendation Engine (The Tool)

Finally, we operationalized the network structure.

Proximity-Based Logic: We built a tool that suggests channels based on network proximity. By locating a user within a specific behavioral cluster, the engine recommends the strongest neighboring nodes ("digital neighbors") that they haven't visited yet.
Value over Views: This topology-based approach prioritizes Appreciation (strong social links $W_{ij}$) over raw Views, effectively bypassing the "Rich-Get-Richer" loop of traditional algorithms.

4. User Behavior Analysis

We studied wether user behavior is structured, stable and meaningful or wether it's mostly random noise. Using streaming computation over billions of comments, we compute for every user:

Activity: total number of comments
Breadth: number of distinct channels interacted with
Focus/Concentration: wether engagement is centered on a few channels or widely scattered

5. Participation Regimes & Signal Extraction

From this analysis we identify Participation Regimes, separating users according to the strength and stability of their behavior:

casual, low-signal users
moderately engaged users
highly committed users with stable preferences

Only the later categories provide enough behavioral evidences to meaningfully infer relationships between channels. Thus justifies filtering not as arbitrary tresholds, but as an empirically grounded decision supported by the data.

6. Diagnostics & Robustness

To ensure that our framework truly reflects human behavior rather than artifacts, we perform a series of diagnostic checks:

Noise and Singletons: evaluate the influence of users appearing only once or in extremely sparse contexts
Stability Checks: verify that behavioral summaries remain consistent under thresholding
Distributional Sanity Checks: ensure results match expected large-scale engagement behavior

These diagnostics confirm that the network we build is supported by reliable human signal rather than statistical chance, reinforcing the robustness of the final model.

Data Story

Dive into the visual side of our analysis. Our data story moves beyond the code to visualize the full network of 449 million users, featuring interactive chord diagrams and a deep dive into the "Hubs" and "Bridges" that define the platform. It also shows you how user behavior emerges from the chaotic sea of YouTube comments, featuring graphs about user profiles and their characterization.

Repository Structure 📁

ada-2025-project-radatouille/
├── data/
│   ├── raw/
│   ├── models/                           # network modeling dataset, with additional values than juste filtered files
│   └── filtered/                             # first filtering files
│
├── utils/                                         
│   ├── __init__.py                      
│   ├── network_helper.py       #utils methods file for VIDEO-level part
│   └── community_helper.py     #utils methods file for USER-level part
│
├── web/
│                                           
├── .gitignore
├── README.md                                       # Project description and instructions
├── requirements.txt                           
├── results_content_network.ipynb                   # Jupyter notebook with the results for USER-level part
└── results_user_community.ipynb                    # Jupyter notebook with the results for VIDEO-level part

How to Run the Code 💻

Clone the repository:

git clone https://github.com/epfl-ada/ada-2025-project-radatouille.git
cd ada-2025-project-radatouille

Set up the environment:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Data Acquisition:
- Download the dataset from the Zenodo and place it in data/raw/.
Execution:
- Run the results_[...].ipynb notebooks to view the analysis pipeline and visualizations.
- ⚠️The notebook is long, and some cells may take a long time and be memory-intensive to run.

Contributions 📝

Team Member	Contribution Focus
Romain	Network Construction: Handled the crawling of edge data, implementation of the PMI/Score metric, and efficient handling of large dataframes.
Albert	Visualization & Story: Created the Chord diagrams, Gephi network exports, and led the design and implementation of the Data Story website.
Hugo	Algorithm & Analysis: Implemented community detection (Louvain), defined the specific metrics for "Hubs" and "Bridges," and managed the repository structure.
Thomas	User Analysis & Report: Built the user-level analysis pipeline, created and analyzed the user space and clustered the groups into communities
Matteo	User Analysis & Report: Developed user-level diagnostics, demonstrated the existence of meaningful behavioral structure and applied filtering strategies

Acknowledgments & AI Usage ☑️

AI coding assistants were used to assist with code implementation, debugging, data visualization, and technical documentation.
All analytical decisions, research design, and interpretations were made by the team.
The introductory image was created using ChatGPT.

Contributors 👥

SltMatteo, Tkemper2, albertfares, jeanninhugo, frossardr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Hidden Web of YouTube: How Comments Connect Content and Communities

Abstract 📄

Research Questions 💭

Dataset 📚

Methods 🛠️

1. Preprocessing & User Profiling (The Signal)

2. Network Construction (The Map)

3. The Recommendation Engine (The Tool)

4. User Behavior Analysis

5. Participation Regimes & Signal Extraction

6. Diagnostics & Robustness

Data Story

Repository Structure 📁

How to Run the Code 💻

Contributions 📝

Acknowledgments & AI Usage ☑️

Contributors 👥

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.github/workflows		.github/workflows
data		data
utils		utils
web		web
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
results_content_network.ipynb		results_content_network.ipynb
results_user_community.ipynb		results_user_community.ipynb

Folders and files

Latest commit

History

Repository files navigation

The Hidden Web of YouTube: How Comments Connect Content and Communities

Abstract 📄

Research Questions 💭

Dataset 📚

Methods 🛠️

1. Preprocessing & User Profiling (The Signal)

2. Network Construction (The Map)

3. The Recommendation Engine (The Tool)

4. User Behavior Analysis

5. Participation Regimes & Signal Extraction

6. Diagnostics & Robustness

Data Story

Repository Structure 📁

How to Run the Code 💻

Contributions 📝

Acknowledgments & AI Usage ☑️

Contributors 👥

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages