Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
f5f393f
Ran 1st test locally for lab 1
tonytarizzo Jan 27, 2025
d52abad
Putting all data into git
tonytarizzo Jan 27, 2025
2bf20f4
Update it for tutorial 5 progress
tonytarizzo Jan 30, 2025
468b0a5
Update git to finish lab 2
tonytarizzo Feb 3, 2025
8369124
Finish lab 3, created plots
tonytarizzo Feb 4, 2025
401a355
push final markdown sheet with answers
tonytarizzo Feb 7, 2025
38267b0
Added cleaned-up colab files to directory for lab oral
tonytarizzo Feb 7, 2025
21b5efb
Add wave2vec model to supported models, add CTC head for inference an…
tonytarizzo Mar 7, 2025
3b36eb9
Added/Changed multiple file to implement Wav2Vec correctly. This push…
tonytarizzo Mar 10, 2025
403a2c7
testing change to WER function to be not batched
tonytarizzo Mar 10, 2025
1fa7cc8
It worked! evaluation produced the correct output
tonytarizzo Mar 10, 2025
e3e4e98
tested to confirm pipeline working, both train and evaluate
tonytarizzo Mar 10, 2025
2e9d2a3
added quantisation, lora, NAS implementation scripts
tonytarizzo Mar 11, 2025
4622028
pruning stuff
Mar 11, 2025
24851a2
pruning stuff
Mar 11, 2025
c9134bd
Modified fetch_info() function in prune.py to fall back to module met…
Mar 11, 2025
740c90c
Added lilibrispeech_asr in MaseDataModule
Matthieu6 Mar 11, 2025
2ac7bc2
Fixed movement local and global, address l1-normal global issues and …
tonytarizzo Mar 13, 2025
96044da
Weight movement pruning done, activation movement pruning still to be…
tonytarizzo Mar 13, 2025
409c481
added gradient accumalation and mini batch sizes
tonytarizzo Mar 14, 2025
d77c8b7
add librspeech_asr dataset and processor logic to get_tokenized_datas…
tonytarizzo Mar 15, 2025
822413b
adding dataset
Nyal101 Mar 15, 2025
18ff07a
add dataset config chagnes for condensed lib speach
Nyal101 Mar 15, 2025
72bec36
fix run prepare data error
Nyal101 Mar 15, 2025
f6125db
orgnaised datacollator, movement pruning into mase
tonytarizzo Mar 15, 2025
f496dfe
Merge branch 'main' of https://github.com/tonytarizzo/mase-individual
tonytarizzo Mar 15, 2025
bd72a87
change epset_version from 11 to 14.
Matthieu6 Mar 15, 2025
96f561f
adding hwpq pruning
Nyal101 Mar 15, 2025
5ca31ab
adding hwpq pruning
Nyal101 Mar 15, 2025
29f9862
try correcting greedy to beam decoding ctc head
tonytarizzo Mar 15, 2025
84e29ea
Merge branch 'main' of https://github.com/tonytarizzo/mase-individual
tonytarizzo Mar 15, 2025
5762e1a
update runtime_analysis_pass method for WER and ctc, still needs testing
tonytarizzo Mar 16, 2025
1a38589
Finished rewriting runtime analysis, including evaluate function for …
tonytarizzo Mar 16, 2025
3828a54
add name to keyword runtime config
tonytarizzo Mar 16, 2025
bfa60b9
add debugging prints to runtimeanalysis
tonytarizzo Mar 16, 2025
0804bf0
quick fix to print statmenets
tonytarizzo Mar 16, 2025
84942f6
ammend
tonytarizzo Mar 16, 2025
e10eeb5
ammend actually
tonytarizzo Mar 16, 2025
ad36d11
added new prints
tonytarizzo Mar 16, 2025
7204838
add debug
tonytarizzo Mar 16, 2025
558b8de
add trt import check
tonytarizzo Mar 16, 2025
dfbea25
add proper imports
tonytarizzo Mar 16, 2025
5064e88
add tesnorrt and pycuda to setup
tonytarizzo Mar 16, 2025
4a176b7
addlib speach dataset to get tokenizer function in hugging face
Nyal101 Mar 16, 2025
c388350
detatch logic fix
tonytarizzo Mar 16, 2025
67aef40
Merge branch 'main' of https://github.com/tonytarizzo/mase-individual
tonytarizzo Mar 16, 2025
35d3b02
test preds keys
tonytarizzo Mar 16, 2025
c17a084
added ctc head to runtime_analysis
tonytarizzo Mar 16, 2025
0decd5b
fix pred logic handling
tonytarizzo Mar 16, 2025
08c9374
add final_pipeline.py
Nyal101 Mar 16, 2025
ce1dd38
remove stupid print
tonytarizzo Mar 17, 2025
0acc7fa
Add FlexRound implementation and related changes
Matthieu6 Mar 17, 2025
2d016d1
move ctc_head and pred logic to same device
tonytarizzo Mar 17, 2025
1d8cff3
cahnge hwpq to just prune
Nyal101 Mar 17, 2025
aa9fde3
cahnge hwpq to just prune and remove unecessary code
Nyal101 Mar 17, 2025
fa44778
cahnge hwpq to just prune and remove unecessary code
Nyal101 Mar 17, 2025
82dbdfc
cahnge hwpq to just prune and remove unecessary code
Nyal101 Mar 17, 2025
f6e5672
calibration for FlexRound
Matthieu6 Mar 22, 2025
883adda
add examples
Matthieu6 Mar 22, 2025
1583cf6
Merge branch 'Mat_Flexround'
Matthieu6 Mar 22, 2025
ddb4022
Merge Flexround
Matthieu6 Mar 22, 2025
1f5b3af
Merge branch 'main' of https://github.com/tonytarizzo/mase-individual
Matthieu6 Mar 22, 2025
d5d08c8
clean up FlexRound
Matthieu6 Mar 22, 2025
17ecd1d
fix hwpq error
Nyal101 Mar 22, 2025
b04fd01
fixed onnx_runtime_interface_pass by adding attention_mask, edited ru…
tonytarizzo Mar 23, 2025
99e5a9f
Merge branch 'main' of https://github.com/tonytarizzo/mase-individual
tonytarizzo Mar 23, 2025
74bf35d
removed pytorch quant
tonytarizzo Mar 23, 2025
a39e29a
imports of tesnorrt and cuda readded
tonytarizzo Mar 23, 2025
a8f093e
move imports inside the function
tonytarizzo Mar 23, 2025
62db3d4
change global method of importing tensortt
tonytarizzo Mar 23, 2025
149ab7a
added attention mask handlign to runtime_analysis_pass
tonytarizzo Mar 23, 2025
8dc3270
added attention mask functioanlity to the inference functions
tonytarizzo Mar 23, 2025
e1ff225
Add SNIP pruning implementation
Mar 23, 2025
69697b8
Add SNIP pruning implementation
Mar 23, 2025
e7cae66
Improve parameter counting display in SNIP example
Mar 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added Working_Data/Question_5b_results.docx
Binary file not shown.
Binary file added Working_Data/Random Search_search_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
185 changes: 185 additions & 0 deletions Working_Data/adls_labs_at5424.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
## Tutorial 1 (Lab 0): Introduction

### Run-Time Error
Here is the run-time error encountered during the session:

```plaintext
RuntimeError: Tried to erase Node bert_embeddings_dropout but it still had 6 users in the graph:
{getattr_2: None, size_4: None, bert_encoder_layer_0_attention_self_query: None,
bert_encoder_layer_0_attention_self_key: None, bert_encoder_layer_0_attention_self_value: None,
add_7: None}!
```

## Tutorial 2 (Lab 0): Lora Finetune

### Removing attention_mask, labels from hf_input_names and its effect on the graphs:
The graph was created with and without the extra information and compared.

Firstly, having no labels meant that at the end of the process, there is no cross-entropy calculated and viewed (so 4 blocks are removed). This is becuase the ground truth labels are required for loss calculations and not having them therefore means no losses can be calculated. Secondly, when there is no attention_mask specified, the model calls an extra block called getattr_1 after the input, instead of having a seperate input attention_mask block. When no mask is specified, more information from the model is used as an input to the masking process, implying that the mask is created based on the input information, whilst the external mask would be used for manually choosing which information to mask or not.


## Tutorial 3 (Lab 1): QAT
The following combinations of widths were tested: (4,2) (8,4) (12,8) (16,8) (20,10) (24,12) (28,14) (32,16).

![Fixed point width vs highest achieved accuracy](fixed_point_width_vs_accuracy.png)
![PTQ and QAT comparison vs highest achieved accuracy](ptq_vs_qat_accuracy.png)

It is clear that quantisation-aware training is much more effective in improving accuracy than just quantising after training, as it allows the model to adapt to the lower precision during training, reducing the loss that would otherwise be encoutered when quantisation is applied at the end of the processs.
The model that offered a strong balance between maximising accuracy and amount of quantisation was with fixed width 16 (16,8)

## Tutorial 4 (Lab 1): Pruning
A range of sparisty levels were then applied, ranging from 0.1 to 0.9. Each sparsity level was tested with both the random and the l1-norm method. For each combination, 5 epochs of training were ran to allow the accuracies reached to closely match their potential if many more epochs of training were done.

![Sparsity vs highest achieved accuracy](highest_accuracy_by_sparsity.png)
![Random vs L1-Norm comparison](pruning_accuracy_by_sparsity.png)

L1-norm performed better than random in all cases and allowed for more drastic pruning.

## Tutorial 5 (Lab 2): Nas Optuna
Each sampler was given the same search space, and 10 iterations to find the optimal setup. Each search space setup combination was trained for 3 epochs to allow the accuracies reached to stablise and represent the effectiveness of the search-space input more truly.
![Random vs Grid vs TPE search method comparison](combined_optuna_results.png)

TPE found the highest accuracy combination the fastest and reached the highest accuracy, therefore was the best search method.
TPE was then used in part b, and compression-aware search ran and tested.

![Effects of compression and post-compression fine-tuning](compression_aware_results.png)

No compression eventually performed the best, mainly due to the compression being quite severe, but the compression aware training method reach similar accuracy levels to the non-compressed model with a much smaller model size.

## Tutorial 6 (Lab 3): Mixed Precision Search
![Number of trails vs maximum achieved accuracy](optuna_search_results.png)

TPE Sampler was used to search for the optimal configuration, which was found on the 3rd iteration.
The search was then extended to contain the following quantised configuartions for each layer:

- torch.nn.Linear (no quantisation)
- LinearInteger,
- LinearMinifloatDenorm
- LinearMinifloatIEEE
- LinearLog
- LinearBlockFP
- LinearBlockLog
- LinearBinary
- LinearBinaryScaling

(I wasn't able to use LinearBlockMinifloat without errors and LinearBinaryResidualSign had not been implemented yet so these were ommited.) I initially used the Optuna sampler (TPE in this case) to search for the optimal layer types which yielded the following results.

![Number of trails vs maximum achieved accuracy](extended_optuna_search_results_2_curves.png)

Iteration 2 (the highest accuracy iteration) used LinearMinifloatIEEE. After realising seperate curves for each precision were wanted, the code was rewritten to do 5 iterations of 3 epochs for each layer type. Here are these results.

![Maximum achieved accuracy after 5 iterations for each precision layer type](optuna_combined_precision_progress.png)
The LinearLog and LinearBlockLog were both found to be completely inneffective (maybe implemented incorrectly), with LinearBinary and LinearBinaryScaling found not much higher in accuracy achieved.

To see the trends in the rest of the results, here is a zoomed in view of these, showing that eventually LinearInteger performed best and even surpassed the full precision version.
![Best performing precision layer types](optuna_combined_precision_progress_zoomed.png)

## Lab 4: ADLS Software Stream
### Part 1
The pre-compiled model showed a longer processing time of 24.5413s compared to the JIT-compiled model with 15.3287s. This could be down to a few reasons that were tested by modifying the code.

- Variations in time could be introduced by different backend compilation methods, which are optimised for different purposes. The compilers already available to test were inductor, cudagraphs and eager, with the default compiler being inductor offering a trade-off between efficiency and memory overhead.
- The timing method seemed to change its results quite a lot depending on if the script run was the first time or had been previously done already. This indicates some degree of warming-up was required for the precompiled method, potentially due to memory bottlenecks on the first runthrough. A likely cause of this is that the first run-through includes tracing and graph transformations that later ones do not.
- In addition, the lower number of iterations chosen in the code (n=5) meant that the results were more susceptible to initial memory overhead changes than in consequent iterations, reliability of the timing tests could be improved with more iterations.

To test how much the timings were affected by number of itertaions, CPU/GPU, compiler methods, the following combinations were ran (the Colab environment was reset between every 5 or 20 runs to reset the environment, accomodating for potential warm-up effects). Cudagraphs was ommited from CPU runs since it requires GPU.

5 runs on "cpu":
- Original model: 15.3287 s
- Optimised model: 24.5413 s
- Optimised model (inductor): 10.8946 s
- Optimised model (eager): 14.0508 s

20 runs on "cpu":
- Original model: 15.0116 s
- Optimised model: 14.1991 s
- Optimised model (inductor): 11.3416 s
- Optimised model (eager): 13.3609 s

5 runs on "cuda":
- Original model: 0.2519 s
- Optimised model: 5.0008 s
- Optimised model (inductor): 0.1103 s
- Optimised model (eager): 0.2530 s
- Optimised model (cudagraphs): 0.9695 s

20 runs on "cuda":
- Original model: 0.1099 s
- Optimised model: 0.4991 s
- Optimised model (inductor): 0.1129 s
- Optimised model (eager): 0.1394 s
- Optimised model (cudagraphs): 0.3148 s

20 runs but ignoring first 10 on "cuda":
- Original model: 0.0950 s
- Optimised model: 0.1106 s
- Optimised model (inductor): 0.1106 s
- Optimised model (eager): 0.0952 s
- Optimised model (cudagraphs): 0.1577 s

100 runs but ignoring first 10 on "cuda":
- Original model: 0.0950 s
- Optimised model: 0.1120 s
- Optimised model (inductor): 0.1128 s
- Optimised model (eager): 0.0968 s
- Optimised model (cudagraphs): 0.0990 s

The main takeaways from this is that:
- significant initial overhead in all cases due to tracing and graph transformations
- low number of iterations (n=5) skew results due to this, therefore more needed to stabilise.
- chaning the device from cpu to cuda allowed the GPU to be used much more effectively, showing the biggest improvement in performance.
- overall, the pre-compiled method did not show improvements indicating that the optimisation under the hood was well purposed for the already optimised ResNet18. Potentially for more custom models, this would not be the case.

### Part 2
For 5 iterations:
- Naive SDPA time (cpu): 0.033738 s
- Fused SDPA time (cpu): 0.020238 s
- Naive SDPA time (cuda): 0.000379 s
- Fused SDPA time (cuda): 0.000047 s

For 20 iterations:
- Naive SDPA time (cpu): 0.026931 s
- Fused SDPA time (cpu): 0.020990 s
- Naive SDPA time (cuda): 0.000140 s
- Fused SDPA time (cuda): 0.000022 s

For 100 iterations:
- Naive SDPA time (cpu): 0.025309 s
- Fused SDPA time (cpu): 0.020964 s
- Naive SDPA time (cuda): 0.000202 s
- Fused SDPA time (cuda): 0.000044 s

The fused SDPA kernel outperformed the naive implementation in every case and device type, most significantly on CUDA. On CPU, the fused version showed around 20-25% increase in speed, whilst on GPU, the fused version was 500-10000% faster than the naive version. This shows that the kernel fusion on CPUs was still limited by CPU memory bandwidth and its lack of parallelism, whilst the GPU case was able to take advantage of fused kernels, thanks to the reduced memory overhead removing the memory bottleneck.

### Part 3:
Answer 3a: MXINT8 benefits custom hardware if both activaton and weights are quantised in the same method for a few reasons:
- Consistent data types means that fusing mutliple operations, e.g. matrix multiplication and then an activation layer, can be done more easily without extra data type conversions that woudl otherwise require additional memory reads/writes. If all the data types are consistent, many layers can be fused into one highly optimised kernel which is beneficial for larger GPU based models.
- Using MXINT8 for both also means that there are no intermediate tensors being stored, reducing memory requirements. This is beneficial since memory bandwith is often a limiting factor in custom hardware methods.
- There are also dedicated MXINT8 hardware units that are optimised for certain data types, such as INT8 multiply-accumulate (MAC) units. Consistency then allows these hardwares to be better used and optimised.
- Lastly having more efficient processes means less power consumption overall, reducing costs, hardware weardown and maintencance, increases working life etc.

Answer 3b: The rold of the dont_need_abs variable and bias_variable is to correctly scale the 'value' part of the number, so that when the exponent is then applied to it, it is scaled correctly to represent the desired number. This methodology is required since MXINT has no implicit leading bit for the mantissa. For example, if a 6-bit mantissa number (bit count starts from 0) is SXXX_XXXX, so the left-most X is the 6th bit, this 6th bit determines whether the mantissa is in the range of 0-63 or 64-127. Similar to scientific number format, if the value is smaller than 64, it also contains some scaling information which should be contained already by the exponent. Therefore the value has to be normalised to within the correct range, so that its job is just to represent the value, and then to be scaled by the exponent. If the scaling information (due to too small a value) was retained and then the exponent applied, the resulting number would be some 2^integer out in order of magnitude. To implement this, the 6th bit is checked if its 1 or 0. If its 1, nothing is changed (dont_need_abs = True), if it is 0, this means the values of mantissa need to shifted in position to the left until the 6th bit is 1. This is done by adding a 2^num to the value, depending on the number of position shifts required. The number of shifts required to scale correctly can be calculated by the equation: bias_variable = 2^(exp - 127) * C, where exp is the original MXINT8 exponent value for that group.

Answer 3c: The cta_tiler partitions the data in a memory-efficient way. It allows a large input tensor or data structure to have smaller sub-tensors or components to be extracted and processed in seperate units (such as threads, warps and thread blocks) and then reformed into a new large tensor correctly, retaining the indexing structure. This allows for more efficient parrallelism and more efficient/higher accuracy memory/cache access.

The cta_tiler dynamically adjusts the block sizes being sent for processing based on the incoming group size. The larger the group size, the larger BLK_M, but the smaller BLK_K is - changing the sub-tensor shape. This helps to maintain a consitent computational workload per tile since roughly the same amount of data is being sent to each thread in the Cooperative Thread Arrays (CTAs). This is how it allows memory requirements stay consistent and stable during processing.

layout_sX partitions the threads in a threadblock by applying the make_layout command from CuTe. It takes in a shape (e.g. a matrix of 8 by 8) and then creates a layout object, which is essential for mapping from coordinate space(s) to an index space. This is what ensures that the larger tensor can be reconstructed correctly after tiling. The layout object, just like cta_tiler, dynamically changes the shape of the sub-tensor based on the input group size (since the shape input to make_layout is also BLK_M and BLK_K), mainting consistent amounts of data per each tile and therefore processing thread.

Answer 3d:
Theres a few reasons why the saved GPU memory is not exactly (32 - (4+8/32))/32 = 86.7% The first is that the code actually uses MXINT8 not MXINT4, meaning the above equation is slightly incorrect. The correct theroetical savings should be (32 - (8 + 8/32))/32 = 74.2%, since the mantissa should be 8 bits not 4. Even with this change, the theoretical savings are not reached. This is due to the quantisation process only being applied to certain layer types. The following code block shows that only linear layers are quantised, and any others (such as activation layers) are left in full precision.

```plaintext
for layer_name, layer in model.named_modules():
if not isinstance(layer, torch.nn.Linear):
continue
if "classifier" in layer_name:
continue
layer.cuda()
layer_q = QLinearPacked.build_from_linear(layer, group_size=mxint8_group_size)
set_layer_by_name(model, layer_name, layer_q)
del layer
torch.cuda.empty_cache()
```

Therefore, even though the quantised weights take up far less space, the overall memory saving is reduced to about 66.4% instead of the upped theoretical bound 74.2% due to non-quantised layers also being used by the GPU.
Binary file added Working_Data/adls_labs_at5424.zip
Binary file not shown.
Binary file added Working_Data/combined_optuna_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Working_Data/compression_aware_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions Working_Data/edited-colab-files/lab4-software.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Working_Data/edited-colab-files/tutorial_3_qat.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Working_Data/edited-colab-files/tutorial_4_pruning.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file added Working_Data/extended_optuna_search_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Working_Data/fixed_point_width_vs_accuracy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Working_Data/highest_accuracy_by_sparsity.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Working_Data/mase_graph_not_removed.pdf
Binary file not shown.
Binary file added Working_Data/mase_graph_removed.pdf
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Working_Data/optuna_search_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading