This repository contains experiments comparing Mixture-of-Experts (MoE) and Fast Feed-Forward (FFF) models introduced in FFF and UltraFastBert papers (author’s repository).
The experiments folder contains (almost) self-contained Jupyter
notebooks with benchmarks and experiments with the architecture.
The FastFF folder contains several implementations of the FFF model,
including the reference one, with additional tools to get data from
models and train them.
Use pip or other package manager to install the package from this repository
pip install git+https://github.com/ssslakter/FastFFThe main results are:
-
SMEAR gives slight improvements in the FFF model as well as MoE, although the hierarchical structure makes it harder to train. Jupyter notebook
-
Data distribution between experts shifts to the single peak when increasing the number of neurons in the experts. Jupyter notebook
- FFF can be formulated as a MoE with a sparse binary matrix of transitions and additional activation function (Softplus in the reference formulation). Additional experiments show that linear activation function performs better. Jupyter notebook
- With matrix formulation, the utilization of parallelism is higher than in the reference implementation, therefore for shallow layers there is a speedup. For deep layers the sequential branch selection becomes faster, when dense matrices require lots of space.Jupyter notebook


