The Hidden Web of YouTube: How Comments Connect Content and Communities
YouTube's recommendation algorithm is a closely guarded secret. We seek to circumvent this "black box" by mapping the social structure of the platform from the bottom up. Using a massive dataset including 8.6 billion comments, we construct a network where channels are connected solely by the users who comment on them.
This approach ignores standard metrics such as view counts in order to reveal organic communities formed by genuine human interactions. The result is a transparent recommendation engine that prioritizes users' active interests —what makes them react— rather than passive metadata that may result from a simple trend. By shifting the focus from static categories (simple views) to dynamic user behavior, we offer a new way to discover content based on where communities are actually active.
Alongside the structural analysis of channels and videos, we also study the humans behind the network. Using streaming methods on billions of comments, we analyze how users actually behave on YouTube: how active they are, how concentrated their attention is, and wether meaningful participation patterns exist at all.
We show that commenter behavior is not random: most users leave weak traces, while a smaller population exhibits stable and informative engagement. This behavioral backbone justifies the selection of "signal users", and ensures that the resulting content network reflects genuine community structure rather than statistical noise.
- User Level (Behavior): Is there any tendency in user commenting behavior? Is it possible to construct a network based on it?
- Content Level (Structure): What organic structures emerge when we connect channels/videos/categories based on human behavior (comments) rather than algorithms? Which metric could be used to design a meaningful network based on comments?
- Application (Recommendation): Can we build a transparent recommendation engine based on comments that bypasses the "Rich-Get-Richer" cycle? Can we predict a user's next favorite channel simply by knowing who their "digital neighbors" are?
Considering the size of YouNiverse, we chose not to explore another dataset.
For detailed documentation and methodology, see the original YouNiverse paper: YouNiverse: Large-Scale Channel and Video Metadata from English-Speaking YouTube
The dataset is available on Zenodo.
We focused on the large-scale structure of the YouNiverse dataset, employing the following methods:
- Louvain Community Detection (for identifying organic clusters)
- Pointwise Mutual Information (PMI) (for edge weighting)
- OLS Regression (for analyzing the link between subscribers and connectivity score)
- Interactive Visualization (Chord Diagrams, Sankey Diagrams)
- Large-Scale User Behavior Analysis (streaming computation of activity, breadth, and concentration over billions of comments)
- Participation Regimes & Filtering (identifying meaningful users vs. noise and grounding “Super User” filtering in empirical evidence)
- Robustness Diagnostics (noise evaluation, singleton impact, stability and sanity checks to ensure behavioral signal reliability)
To build a robust graph, we defined the "Signal" by filtering for "Super Users" (
Where:
-
$N_{videos}(u)$ : The number of unique videos user$u$ commented on (ensuring consistency). -
$N_{likes}(u)$ : The total likes received on their comments (ensuring social validation). -
Bot Removal: We strictly removed the top
$1%$ of most active accounts to eliminate non-human behavior.
We aggregated billions of interactions into a graph where nodes
-
The Interaction Score (
$W_{ij}$ ): We developed a custom edge weight that balances specificity (PMI) with volume (raw shared count). The weight$W_{ij}$ between two channels is defined as:$$W_{ij} = \text{PMI}(i, j) \times \log(|U_i \cap U_j|)$$ Where the Pointwise Mutual Information (PMI) is calculated as:
$$\text{PMI}(i, j) = \log\left(\frac{P(i, j)}{P(i)P(j)}\right) = \log\left(\frac{N \cdot |U_i \cap U_j|}{|U_i| \cdot |U_j|}\right)$$ -
$|U_i \cap U_j|$ : Number of shared commentators between channel$i$ and$j$ . -
$N$ : Total number of users in the network. -
Logic: PMI penalizes generic links between massive channels, while the
$\log$ term prevents statistically high PMI values from insignificant niche channels (e.g., 2 users sharing 2 channels) from dominating the graph.
-
-
Topology Analysis:
- We applied the Louvain Algorithm to maximize the modularity
$Q$ , partitioning the network into communities$C_1, ..., C_k$ where internal density is maximized. - We calculated Degree to identify Hubs.
- We applied the Louvain Algorithm to maximize the modularity
Finally, we operationalized the network structure.
- Proximity-Based Logic: We built a tool that suggests channels based on network proximity. By locating a user within a specific behavioral cluster, the engine recommends the strongest neighboring nodes ("digital neighbors") that they haven't visited yet.
-
Value over Views: This topology-based approach prioritizes Appreciation (strong social links
$W_{ij}$ ) over raw Views, effectively bypassing the "Rich-Get-Richer" loop of traditional algorithms.
We studied wether user behavior is structured, stable and meaningful or wether it's mostly random noise. Using streaming computation over billions of comments, we compute for every user:
- Activity: total number of comments
- Breadth: number of distinct channels interacted with
- Focus/Concentration: wether engagement is centered on a few channels or widely scattered
From this analysis we identify Participation Regimes, separating users according to the strength and stability of their behavior:
- casual, low-signal users
- moderately engaged users
- highly committed users with stable preferences
Only the later categories provide enough behavioral evidences to meaningfully infer relationships between channels. Thus justifies filtering not as arbitrary tresholds, but as an empirically grounded decision supported by the data.
To ensure that our framework truly reflects human behavior rather than artifacts, we perform a series of diagnostic checks:
- Noise and Singletons: evaluate the influence of users appearing only once or in extremely sparse contexts
- Stability Checks: verify that behavioral summaries remain consistent under thresholding
- Distributional Sanity Checks: ensure results match expected large-scale engagement behavior
These diagnostics confirm that the network we build is supported by reliable human signal rather than statistical chance, reinforcing the robustness of the final model.
Dive into the visual side of our analysis. Our data story moves beyond the code to visualize the full network of 449 million users, featuring interactive chord diagrams and a deep dive into the "Hubs" and "Bridges" that define the platform. It also shows you how user behavior emerges from the chaotic sea of YouTube comments, featuring graphs about user profiles and their characterization.
ada-2025-project-radatouille/
├── data/
│ ├── raw/
│ ├── models/ # network modeling dataset, with additional values than juste filtered files
│ └── filtered/ # first filtering files
│
├── utils/
│ ├── __init__.py
│ ├── network_helper.py #utils methods file for VIDEO-level part
│ └── community_helper.py #utils methods file for USER-level part
│
├── web/
│
├── .gitignore
├── README.md # Project description and instructions
├── requirements.txt
├── results_content_network.ipynb # Jupyter notebook with the results for USER-level part
└── results_user_community.ipynb # Jupyter notebook with the results for VIDEO-level part
-
Clone the repository:
git clone https://github.com/epfl-ada/ada-2025-project-radatouille.git cd ada-2025-project-radatouille -
Set up the environment:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt -
Data Acquisition:
- Download the dataset from the Zenodo and place it in
data/raw/.
- Download the dataset from the Zenodo and place it in
-
Execution:
- Run the
results_[...].ipynbnotebooks to view the analysis pipeline and visualizations. ⚠️ The notebook is long, and some cells may take a long time and be memory-intensive to run.
- Run the
| Team Member | Contribution Focus |
|---|---|
| Romain | Network Construction: Handled the crawling of edge data, implementation of the PMI/Score metric, and efficient handling of large dataframes. |
| Albert | Visualization & Story: Created the Chord diagrams, Gephi network exports, and led the design and implementation of the Data Story website. |
| Hugo | Algorithm & Analysis: Implemented community detection (Louvain), defined the specific metrics for "Hubs" and "Bridges," and managed the repository structure. |
| Thomas | User Analysis & Report: Built the user-level analysis pipeline, created and analyzed the user space and clustered the groups into communities |
| Matteo | User Analysis & Report: Developed user-level diagnostics, demonstrated the existence of meaningful behavioral structure and applied filtering strategies |
- AI coding assistants were used to assist with code implementation, debugging, data visualization, and technical documentation.
- All analytical decisions, research design, and interpretations were made by the team.
- The introductory image was created using ChatGPT.
