Data Science for Electron Microscopy
Lecture 4: CNN Classification & Self-supervised Learning

Philipp Pelz

FAU Erlangen-Nürnberg

What is an Autoencoder?

Neural network architecture that learns to:
- Compress (encode) data into a lower-dimensional representation
- Reconstruct (decode) the original data from this representation
Trained to minimize reconstruction error
Learns efficient data representations unsupervised

Basic Autoencoder Architecture

Autoencoder Components

Encoder: Compresses input into latent representation
Latent Space: Compressed representation of the data
Decoder: Reconstructs input from latent representation
Training objective: minimize difference between input and output

import torch
import torch.nn as nn

class ConvAutoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Encoder
        self.encoder = nn.Sequential(
            nn.Conv2d(1, 16, 3, stride=2, padding=1),  # [B, 1, 28, 28] -> [B, 16, 14, 14]
            nn.ReLU(),
            nn.Conv2d(16, 32, 3, stride=2, padding=1), # [B, 16, 14, 14] -> [B, 32, 7, 7]
            nn.ReLU(),
            nn.Conv2d(32, 64, 7)                       # [B, 32, 7, 7] -> [B, 64, 1, 1]
        )
        
        # Decoder
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(64, 32, 7),              # [B, 64, 1, 1] -> [B, 32, 7, 7]
            nn.ReLU(),
            nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1, output_padding=1), 
            nn.ReLU(),
            nn.ConvTranspose2d(16, 1, 3, stride=2, padding=1, output_padding=1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

Training an Autoencoder

def train_autoencoder(model, train_loader, num_epochs=10):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    for epoch in range(num_epochs):
        for data in train_loader:
            img = data[0].to(device)
            
            # Forward pass
            output = model(img)
            loss = criterion(output, img)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Applications of Autoencoders

Dimensionality Reduction
- Alternative to PCA
- Can capture non-linear relationships
Denoising
- Train to reconstruct clean data from noisy input
- Useful for image restoration
Feature Learning
- Learn meaningful representations for downstream tasks
- Transfer learning

Variations of Autoencoders

Denoising Autoencoders
- Add noise to input during training
- Learn to recover original data
Variational Autoencoders (VAE)
- Learn probabilistic encodings
- Generate new samples
Sparse Autoencoders
- Add sparsity constraints to latent representation
- Learn more efficient encodings

Example: Denoising Autoencoder

def add_noise(img, noise_factor=0.3):
    noisy = img + noise_factor * torch.randn(*img.shape)
    return torch.clamp(noisy, 0., 1.)

def train_denoising_autoencoder(model, train_loader, num_epochs=10):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    for epoch in range(num_epochs):
        for data in train_loader:
            img = data[0].to(device)
            noisy_img = add_noise(img)
            
            # Forward pass
            output = model(noisy_img)
            loss = criterion(output, img)  # Compare with clean image
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Practical Tips for Autoencoders

Choose appropriate architecture for your data type
- CNNs for images
- RNNs for sequences
- Dense layers for tabular data
Consider:
- Latent space dimension
- Depth of encoder/decoder
- Loss function
- Regularization techniques
Common issues:
- Overfitting
- Underfitting
- Mode collapse (in VAEs)
- Reconstruction quality vs. compression trade-off

Variational Autoencoders (VAEs)

Extension of traditional autoencoders that learns a probabilistic latent representation
Instead of encoding to fixed points, encodes to probability distributions
Enables:
- Principled generation of new samples
- Meaningful latent space interpolation
- Better regularization of the latent space

VAE vs. Traditional Autoencoder

Traditional Autoencoder

Deterministic encoding
Point-wise latent representation
No guarantee of continuous latent space
Focus on reconstruction

Variational Autoencoder

Probabilistic encoding
Distribution-based latent representation
Continuous, structured latent space
Balance between reconstruction and regularization

VAE Mathematics

Instead of encoding input \(x\) to a point, VAE encodes to parameters of a distribution:

Encoder outputs \(\mu\) and \(\log \sigma^2\) for each latent dimension
Latent vector is sampled: \(z = \mu + \sigma \odot \epsilon\), where \(\epsilon \sim \mathcal{N}(0, I)\)

The VAE loss has two terms: \[\mathcal{L}_{\text{VAE}} = \mathcal{L}_{\text{reconstruction}} + \beta \cdot \mathcal{L}_{\text{KL}}\]

where: \[\mathcal{L}_{\text{KL}} = \frac{1}{2}\sum_{j=1}^J (\mu_j^2 + \sigma_j^2 - \log(\sigma_j^2) - 1)\]

VAE Implementation

class ConvVAE(nn.Module):
    def __init__(self, latent_dim=32):
        super().__init__()
        
        # Encoder
        self.encoder = nn.Sequential(
            nn.Conv2d(1, 32, 3, stride=2, padding=1),  # 28x28 -> 14x14
            nn.ReLU(),
            nn.Conv2d(32, 64, 3, stride=2, padding=1), # 14x14 -> 7x7
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 256)
        )
        
        # Latent space
        self.fc_mu = nn.Linear(256, latent_dim)
        self.fc_var = nn.Linear(256, latent_dim)
        
        # Decoder
        self.decoder_input = nn.Linear(latent_dim, 64 * 7 * 7)
        self.decoder = nn.Sequential(
            nn.Unflatten(1, (64, 7, 7)),
            nn.ConvTranspose2d(64, 32, 3, stride=2, padding=1, output_padding=1),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 1, 3, stride=2, padding=1, output_padding=1),
            nn.Sigmoid()
        )
        
    def encode(self, x):
        x = self.encoder(x)
        mu = self.fc_mu(x)
        log_var = self.fc_var(x)
        return mu, log_var
    
    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std
    
    def decode(self, z):
        x = self.decoder_input(z)
        x = self.decoder(x)
        return x
    
    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        return self.decode(z), mu, log_var

Training a VAE

def vae_loss(recon_x, x, mu, log_var, beta=1.0):
    # Reconstruction loss (binary cross entropy)
    BCE = F.binary_cross_entropy(recon_x, x, reduction='sum')
    
    # KL divergence loss
    KLD = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    
    return BCE + beta * KLD

def train_vae(model, train_loader, num_epochs=10):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    for epoch in range(num_epochs):
        for data in train_loader:
            img = data[0].to(device)
            
            # Forward pass
            recon_batch, mu, log_var = model(img)
            loss = vae_loss(recon_batch, img, mu, log_var)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

VAE Latent Space Properties

Continuous: Similar points in latent space decode to similar images
Structured: Enforced by KL divergence term
Meaningful: Can perform interpolation and arithmetic in latent space

Generating New Samples with VAE

def generate_samples(model, num_samples=1):
    with torch.no_grad():
        # Sample from standard normal distribution
        z = torch.randn(num_samples, model.latent_dim).to(device)
        # Decode the samples
        samples = model.decode(z)
    return samples

def interpolate(model, img1, img2, steps=10):
    # Encode both images
    mu1, _ = model.encode(img1)
    mu2, _ = model.encode(img2)
    
    # Create interpolation points
    alphas = torch.linspace(0, 1, steps)
    interpolated = []
    
    with torch.no_grad():
        for alpha in alphas:
            z = alpha * mu1 + (1 - alpha) * mu2
            interpolated.append(model.decode(z))
            
    return interpolated

Key Differences Summary

Latent Space
- Vanilla: Discrete, potentially discontinuous
- VAE: Continuous, probabilistic
Loss Function
- Vanilla: Only reconstruction loss
- VAE: Reconstruction + KL divergence loss
Generation Capabilities
- Vanilla: Limited/unreliable
- VAE: Principled generation of new samples
Training Stability
- Vanilla: Can be unstable
- VAE: More stable due to regularization

Example training a VAE

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import matplotlib.pyplot as plt


def add_noise(x, noise_factor=0.3):
    noisy = x + noise_factor * torch.randn_like(x)
    return torch.clamp(noisy, 0., 1.)

def train_epoch(model, dataloader, optimizer, device, noise_factor=0.3):
    model.train()
    train_loss = 0
    
    for batch_idx, (data, _) in enumerate(dataloader):
        data = data.to(device)
        noisy_data = add_noise(data, noise_factor)
        
        optimizer.zero_grad()
        
        recon_batch, mu, log_var = model(noisy_data)
        
        # Reconstruction loss
        recon_loss = F.binary_cross_entropy(recon_batch, data, reduction='sum')
        
        # KL divergence loss
        kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
        
        # Total loss
        loss = recon_loss + kl_loss
        
        loss.backward()
        train_loss += loss.item()
        optimizer.step()
        
        if batch_idx % 100 == 0:
            print(f'Batch [{batch_idx}/{len(dataloader)}]: Loss = {loss.item()/len(data):.4f}')
    
    return train_loss / len(dataloader.dataset)

def visualize_results(model, test_loader, device, noise_factor=0.3):
    model.eval()
    with torch.no_grad():
        data = next(iter(test_loader))[0][:8].to(device)
        noisy_data = add_noise(data, noise_factor)
        recon_data, _, _ = model(noisy_data)
        
        # Plot results
        plt.figure(figsize=(12, 4))
        for i in range(8):
            # Original
            plt.subplot(3, 8, i + 1)
            plt.imshow(data[i][0].cpu(), cmap='gray')
            plt.axis('off')
            if i == 0:
                plt.title('Original')
                
            # Noisy
            plt.subplot(3, 8, i + 9)
            plt.imshow(noisy_data[i][0].cpu(), cmap='gray')
            plt.axis('off')
            if i == 0:
                plt.title('Noisy')
                
            # Reconstructed
            plt.subplot(3, 8, i + 17)
            plt.imshow(recon_data[i][0].cpu(), cmap='gray')
            plt.axis('off')
            if i == 0:
                plt.title('Reconstructed')
        
        plt.tight_layout()
        plt.savefig('vae_results.png')
        plt.close()

def main():
    # Parameters
    batch_size = 128
    epochs = 0 # increase to actually train it
    latent_dim = 32
    learning_rate = 1e-3
    noise_factor = 0.3
    
    # Device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")
    
    # Data loading
    transform = transforms.Compose([
        transforms.ToTensor()
    ])
    
    train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
    test_dataset = datasets.MNIST('./data', train=False, transform=transform)
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    # Model setup
    model = ConvVAE(latent_dim=latent_dim).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    # Training loop
    for epoch in range(1, epochs + 1):
        print(f"\nEpoch {epoch}/{epochs}")
        train_loss = train_epoch(model, train_loader, optimizer, device, noise_factor)
        print(f'====> Epoch: {epoch} Average loss: {train_loss:.4f}')
        
        # Visualize results every few epochs
        if epoch % 2 == 0:
            visualize_results(model, test_loader, device, noise_factor)
    
    # Save model
    torch.save(model.state_dict(), 'denoising_vae.pth')
    print("Training completed and model saved!")

if __name__ == "__main__":
    main()

Using device: cuda
Training completed and model saved!

VAE vs. Traditional Autoencoder

Traditional Autoencoder

Variational Autoencoder

Classification Introduction

From linear regression to classification
Moving from “how much?” to “which category?” questions
Examples:
- Spam vs. inbox classification
- Customer subscription prediction
- Image classification (donkey, dog, cat, rooster)
- Movie recommendation
- Book section prediction

Classification Problem Types

Hard vs. Soft Classification

Hard Classification: Direct assignment to categories
Soft Classification: Probability assessment for each category
Often blurred in practice - even hard assignments use soft probabilities

Multi-label Classification

Multiple labels can be true simultaneously
Example: News article covering entertainment, business, and space flight
Not mutually exclusive categories

Simple Image Classification Example

Problem Setup

Input: \(2 \times 2\) grayscale image
Features: \(x_1, x_2, x_3, x_4\) (pixel values)
Categories: “cat”, “chicken”, “dog”

Label Representation

Two approaches:

Integer encoding: \(y \in \{1, 2, 3\}\)
One-hot encoding:
- “cat”: \((1, 0, 0)\)
- “chicken”: \((0, 1, 0)\)
- “dog”: \((0, 0, 1)\)

Linear Model for Classification

Model Structure

For 4 features and 3 categories:

\[ \begin{aligned} o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1\\ o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2\\ o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3 \end{aligned} \]

Vectorized Form

\(\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}\)

\(\mathbf{W} \in \mathbb{R}^{3 \times 4}\): weight matrix
\(\mathbf{b} \in \mathbb{R}^3\): bias vector

The Softmax Function

Why Softmax?

Problems with direct regression:

No guarantee outputs sum to 1
No guarantee of non-negative outputs
No upper bound on probabilities

Softmax Definition

\[\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \quad \textrm{where}\quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}\]

Key properties:

Outputs are non-negative
Sum to 1
Preserves ordering: \(\operatorname*{argmax}_j \hat y_j = \operatorname*{argmax}_j o_j\)

Vectorization for Efficiency

Batch Processing

For minibatch \(\mathbf{X} \in \mathbb{R}^{n \times d}\):

\[ \begin{aligned} \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}\\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}) \end{aligned} \]

Matrix-matrix product \(\mathbf{X} \mathbf{W}\) is dominant operation
Softmax computed rowwise
Numerical stability considerations

Loss Function: Cross-Entropy

Log-Likelihood

For dataset with features \(\mathbf{X}\) and one-hot labels \(\mathbf{Y}\):

\[ P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}) \]

Cross-Entropy Loss

\[l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j\]

Properties:

Bounded below by 0
Zero only with perfect prediction
Never actually reaches zero for finite weights

The Softmax Operation

Why Implement from Scratch?

Fundamental understanding of softmax regression
Builds on linear regression components
Essential for deep learning foundations

Mathematical Definition

The softmax function transforms input values into probabilities:

\[\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}\]

Where:

\(\mathbf{X}\) is the input matrix
\(i,j\) are matrix indices
The denominator is called the (log) partition function

Implementation

import d2l
import torch 
def softmax(X):
    X_exp = d2l.exp(X)
    partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
    return X_exp / partition  # Broadcasting applied here

Important Note

The implementation above is not robust against very large or small arguments. Deep learning frameworks have built-in protections.

The Model Architecture

Key Components

Input: Flattened \(28 \times 28\) pixel images (784-dimensional vectors)
Output: 10 classes (Fashion-MNIST dataset)
Weights: \(784 \times 10\) matrix
Biases: \(1 \times 10\) vector

Model Implementation

import torch.nn as nn

class SoftmaxRegressionScratch(d2l.Classifier):
    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        # Initialize weights and biases as nn.Parameter
        self.W = nn.Parameter(torch.normal(0, sigma, size=(num_inputs, num_outputs)))
        self.b = nn.Parameter(torch.zeros(num_outputs))
    
    def parameters(self):
        return [self.W, self.b]

Forward Pass

@d2l.add_to_class(SoftmaxRegressionScratch)
def forward(self, X):
    X = d2l.reshape(X, (-1, self.W.shape[0]))
    return softmax(d2l.matmul(X, self.W) + self.b)

Cross-Entropy Loss

Mathematical Definition

The cross-entropy loss is defined as:

\[L(\mathbf{y}, \hat{\mathbf{y}}) = -\frac{1}{n}\sum_{i=1}^n \sum_{j=1}^k y_{ij} \log(\hat{y}_{ij})\]

Where:

\(\mathbf{y}\) is the true label (one-hot encoded)
\(\hat{\mathbf{y}}\) is the predicted probability
\(n\) is the batch size
\(k\) is the number of classes

Implementation

def cross_entropy(y_hat, y):
    return -d2l.reduce_mean(d2l.log(y_hat[list(range(len(y_hat))), y]))

Training Process

Hyperparameters

Number of epochs: 10
Batch size: 256
Learning rate: 0.1

Training Code

data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
# trainer.fit(model, data)

Model Evaluation

Prediction

X, y = next(iter(data.val_dataloader()))
preds = d2l.argmax(model(X), axis=1)

Error Analysis

We focus on misclassified examples to understand model weaknesses.

Summary

Implemented softmax regression from scratch
Used cross-entropy loss for classification
Trained on Fashion-MNIST dataset
Achieved reasonable classification performance

Exercises

Numerical Stability
- Test softmax with input value of 100
- Test with largest input < -100
- Implement numerical stability fix
Cross-Entropy Implementation
- Implement alternative cross-entropy function
- Analyze performance differences
- Consider domain of logarithm

Data Science for Electron Microscopy Lecture 4: CNN Classification & Self-supervised Learning

What is an Autoencoder?

Autoencoder Components

Training an Autoencoder

Applications of Autoencoders

Variations of Autoencoders

Example: Denoising Autoencoder

Practical Tips for Autoencoders

Variational Autoencoders (VAEs)

VAE vs. Traditional Autoencoder

VAE Mathematics

VAE Implementation

Training a VAE

VAE Latent Space Properties

Generating New Samples with VAE

Key Differences Summary

Example training a VAE

VAE vs. Traditional Autoencoder

Classification Introduction

Classification Problem Types

Hard vs. Soft Classification

Multi-label Classification

Simple Image Classification Example

Problem Setup

Label Representation

Linear Model for Classification

Model Structure

Vectorized Form

The Softmax Function

Why Softmax?

Softmax Definition

Vectorization for Efficiency

Batch Processing

Loss Function: Cross-Entropy

Log-Likelihood

Cross-Entropy Loss

The Softmax Operation

Mathematical Definition

Implementation

The Model Architecture

Key Components

Model Implementation

Forward Pass

Cross-Entropy Loss

Mathematical Definition

Implementation

Training Process

Hyperparameters

Training Code

Model Evaluation

Prediction

Error Analysis

Summary

Exercises

References

Data Science for Electron Microscopy
Lecture 4: CNN Classification & Self-supervised Learning