NUS Hackers x QRT - Intro to ML Workshop Notes
Table of Contents
Today I attended NUS Hackers x QRT Workshop: Introduction to Machine Learning.
I decided to share what I learned with my notes. It was a really good experience for both learning and network.
I thought there was pizza but there wasn’t, sorry NF. QAQ
Data
Types of data
- Structured Data: Key $\rightarrow$ Value, Relational DB, Documents, Graph DB, Vectors, etc.
- Semi-Structured Data: JSON, XML
- Unstructured Data: Videos, Images
Data Cleaning
- Missing Values
- Duplicates
- Outliers
- Imbalanced Dataset
- Oversampling (minority)
- Undersampling (majority)
- Hybrid
- Data Standardization
- Z-Score, might break the relation between 2 datasets
- Data Normalization
- Put all value into an interval e.g. $[0, 1]$.
Data Split
- Train/Validation/Test datasets with proper ratio
- Cross-validation: K-fold, stratified K-fold, leave-one-out
- Stratified: Split the data following the original distribution
Data Augmentation
- Modify the original data, mostly in CV (e.g. flipping images) and NLP (e.g. translating back and forth)
- Increase robustness
Data Loading
- Batch processing at every stage
- Asynchronous execution
- Prefetching queue hides latency (buffering)
- Longer queue $\rightarrow$ bigger memory footprint
Models
Model Layers
- Embedding Layers
- Function: convert categorical data into vectors
- Usage: Often used in NLP
- Dense Layers (fully connected)
- Function: A complete bipartite. Letting the network to learn complex representation by integrating all input features
- Usage: Final Stage of regressions.
- Conv. Layer
- Function: apply convolution to datas. Capturing Spatial hierarchies by learning local patterns
- Usage: Image recognitions
- Pooling Layers
- Function: Similar to conv. layer. Selecting or calculating a value within a pooling window, retaining the most important feature.
- Usage: Reduce computation for CNN
- Recurrent Layers
- Function: Process sequential data by maintaining a state that carries information across time steps
- Usage: Often involves time series data or sequence, e.g. Language Model and speech recognition.
- Downside: The model might forget previous data
- Attention Layer
- Function: To improve recurrent layer. Build a query-key to create inner product to expand their influence. Allow the model to weight the importance of element within same input sequence, capturing long-range dependences
- Usage: NLP, Transformers. To model relationships between words irrespective of their position
- Reference: Attention Is All You Need
Model Architecture
- Dense vs. Sparse (High computation cost vs. Performance)
- Using different way to store data, just like your trade*off between adjacent matrix and adjacent list.
- Mixture of Experts (Deepseek R1?)
- Multi-model with each one being expert at one task, route the query into tasks
Loss Function
- Regression losses
- Mean Square Error (MSE)
- Mean Absolute Error (MAE)
- Root Mean Square Error (RMSE)
- $R^2$ value: $1 - \frac{\sum{(y_i-\hat y)^2}}{\sum{(y_i-\bar y)^2}}$
- L1, L2 Regularization
- To avoid over/underfitting.
- Penalize high-value, correlated coefficients.
- Learn more here
- L1: Least Absolute Shrinkage and Selection Operator (LASSO)
- L2: Ridge Regression
Evaluation
- Accuracy(Precision, How many retrieved items are relevant), Recall(Sensitivity, How many relevant item are retrieved)
- Confusion Matrix
| Actual vs. Predict | Position | Negative |
|---|---|---|
| Positive | True Positive(TP) | False Negative(FN), Type II Error |
| Negative | False Positive(FP), Type I Error | True Negative(TN) |
Training
GPU Training
GPU provides many core to perform calculations Allows multithreading better than CPU
- CUDA Core
- Tensor Core
- Read more here
Parallelism
- Data Parallelism (running datas on different GPU with same model)
- Frameworks support: PyTorch DDP, TensorFlow distribution strategies, Megatron, DeepSpeed
- Tensor Parallelism (cutting model)
- Sharding large matrices across devices
- Pipeline Parallelism
- Micro-batch processing
- Read more here
Fine-tuning
For most people, we prepare our own data with other big companies’ data. Lower Cost!
- Transfer Learning
- Few-shot Learning
- One-shot Learning
Profiling
- Performance monitoring: Memory Usage, Runtime
- Bottleneck Identification: CPU, GPU, I/O bottlenecks
- Optimization Strategies: Based of profiling result
- Tools: TensorBoard Profiler, PyTorch Profiler, NVIDIA Nsight
Inference
Model compression
- Quantization
- Integer Quantization
- Float16/BFloat16 Precision
- Mapping an interval to another interval, compressing the size
- GPU calculates lower precision data faster
- Knowledge Distillation
- Teacher-Student Framework
- Distillation Loss on soft label
- Low-rank Factorization
- SVD-based factorization
Application
- Computer Vision
- Image Classification
- Object Detection
- Re-Identification
- Image Segmentation (Medical)
- Image Search
- Natural Language Programming
- Sentiment Analysis
- Spam Email Detection
- Tebular Data Query
- Large Language Models
- Stock Price Prediction
- Fraud Detection in Finance
OSEMN
Obtain $\rightarrow$ Scrub $\rightarrow$ Explore $\rightarrow$ Model $\rightarrow$ Interpret