Data Science for Electron Microscopy
Lecture 4: CNN Classification & Self-supervised Learning

Philipp Pelz

FAU Erlangen-Nürnberg

What is an Autoencoder?

  • Neural network architecture that learns to:
    • Compress (encode) data into a lower-dimensional representation
    • Reconstruct (decode) the original data from this representation
  • Trained to minimize reconstruction error
  • Learns efficient data representations unsupervised

Basic Autoencoder Architecture

Autoencoder Components

  • Encoder: Compresses input into latent representation
  • Latent Space: Compressed representation of the data
  • Decoder: Reconstructs input from latent representation
  • Training objective: minimize difference between input and output
import torch
import torch.nn as nn

class ConvAutoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Encoder
        self.encoder = nn.Sequential(
            nn.Conv2d(1, 16, 3, stride=2, padding=1),  # [B, 1, 28, 28] -> [B, 16, 14, 14]
            nn.ReLU(),
            nn.Conv2d(16, 32, 3, stride=2, padding=1), # [B, 16, 14, 14] -> [B, 32, 7, 7]
            nn.ReLU(),
            nn.Conv2d(32, 64, 7)                       # [B, 32, 7, 7] -> [B, 64, 1, 1]
        )
        
        # Decoder
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(64, 32, 7),              # [B, 64, 1, 1] -> [B, 32, 7, 7]
            nn.ReLU(),
            nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1, output_padding=1), 
            nn.ReLU(),
            nn.ConvTranspose2d(16, 1, 3, stride=2, padding=1, output_padding=1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

Training an Autoencoder

def train_autoencoder(model, train_loader, num_epochs=10):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    for epoch in range(num_epochs):
        for data in train_loader:
            img = data[0].to(device)
            
            # Forward pass
            output = model(img)
            loss = criterion(output, img)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Applications of Autoencoders

  • Dimensionality Reduction
    • Alternative to PCA
    • Can capture non-linear relationships
  • Denoising
    • Train to reconstruct clean data from noisy input
    • Useful for image restoration
  • Feature Learning
    • Learn meaningful representations for downstream tasks
    • Transfer learning

Variations of Autoencoders

  • Denoising Autoencoders
    • Add noise to input during training
    • Learn to recover original data
  • Variational Autoencoders (VAE)
    • Learn probabilistic encodings
    • Generate new samples
  • Sparse Autoencoders
    • Add sparsity constraints to latent representation
    • Learn more efficient encodings

Example: Denoising Autoencoder

def add_noise(img, noise_factor=0.3):
    noisy = img + noise_factor * torch.randn(*img.shape)
    return torch.clamp(noisy, 0., 1.)

def train_denoising_autoencoder(model, train_loader, num_epochs=10):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    for epoch in range(num_epochs):
        for data in train_loader:
            img = data[0].to(device)
            noisy_img = add_noise(img)
            
            # Forward pass
            output = model(noisy_img)
            loss = criterion(output, img)  # Compare with clean image
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Practical Tips for Autoencoders

  • Choose appropriate architecture for your data type
    • CNNs for images
    • RNNs for sequences
    • Dense layers for tabular data
  • Consider:
    • Latent space dimension
    • Depth of encoder/decoder
    • Loss function
    • Regularization techniques
  • Common issues:
    • Overfitting
    • Underfitting
    • Mode collapse (in VAEs)
    • Reconstruction quality vs. compression trade-off

Variational Autoencoders (VAEs)

  • Extension of traditional autoencoders that learns a probabilistic latent representation
  • Instead of encoding to fixed points, encodes to probability distributions
  • Enables:
    • Principled generation of new samples
    • Meaningful latent space interpolation
    • Better regularization of the latent space

VAE vs. Traditional Autoencoder

Traditional Autoencoder

  • Deterministic encoding
  • Point-wise latent representation
  • No guarantee of continuous latent space
  • Focus on reconstruction

Variational Autoencoder

  • Probabilistic encoding
  • Distribution-based latent representation
  • Continuous, structured latent space
  • Balance between reconstruction and regularization

VAE Mathematics

Instead of encoding input \(x\) to a point, VAE encodes to parameters of a distribution:

  • Encoder outputs \(\mu\) and \(\log \sigma^2\) for each latent dimension
  • Latent vector is sampled: \(z = \mu + \sigma \odot \epsilon\), where \(\epsilon \sim \mathcal{N}(0, I)\)

The VAE loss has two terms: \[\mathcal{L}_{\text{VAE}} = \mathcal{L}_{\text{reconstruction}} + \beta \cdot \mathcal{L}_{\text{KL}}\]

where: \[\mathcal{L}_{\text{KL}} = \frac{1}{2}\sum_{j=1}^J (\mu_j^2 + \sigma_j^2 - \log(\sigma_j^2) - 1)\]

VAE Implementation

class ConvVAE(nn.Module):
    def __init__(self, latent_dim=32):
        super().__init__()
        
        # Encoder
        self.encoder = nn.Sequential(
            nn.Conv2d(1, 32, 3, stride=2, padding=1),  # 28x28 -> 14x14
            nn.ReLU(),
            nn.Conv2d(32, 64, 3, stride=2, padding=1), # 14x14 -> 7x7
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 256)
        )
        
        # Latent space
        self.fc_mu = nn.Linear(256, latent_dim)
        self.fc_var = nn.Linear(256, latent_dim)
        
        # Decoder
        self.decoder_input = nn.Linear(latent_dim, 64 * 7 * 7)
        self.decoder = nn.Sequential(
            nn.Unflatten(1, (64, 7, 7)),
            nn.ConvTranspose2d(64, 32, 3, stride=2, padding=1, output_padding=1),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 1, 3, stride=2, padding=1, output_padding=1),
            nn.Sigmoid()
        )
        
    def encode(self, x):
        x = self.encoder(x)
        mu = self.fc_mu(x)
        log_var = self.fc_var(x)
        return mu, log_var
    
    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std
    
    def decode(self, z):
        x = self.decoder_input(z)
        x = self.decoder(x)
        return x
    
    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        return self.decode(z), mu, log_var

Training a VAE

def vae_loss(recon_x, x, mu, log_var, beta=1.0):
    # Reconstruction loss (binary cross entropy)
    BCE = F.binary_cross_entropy(recon_x, x, reduction='sum')
    
    # KL divergence loss
    KLD = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    
    return BCE + beta * KLD
def train_vae(model, train_loader, num_epochs=10):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    for epoch in range(num_epochs):
        for data in train_loader:
            img = data[0].to(device)
            
            # Forward pass
            recon_batch, mu, log_var = model(img)
            loss = vae_loss(recon_batch, img, mu, log_var)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

VAE Latent Space Properties

  • Continuous: Similar points in latent space decode to similar images
  • Structured: Enforced by KL divergence term
  • Meaningful: Can perform interpolation and arithmetic in latent space

VAE Latent Space Visualization

samples throughout latent space

Generating New Samples with VAE

def generate_samples(model, num_samples=1):
    with torch.no_grad():
        # Sample from standard normal distribution
        z = torch.randn(num_samples, model.latent_dim).to(device)
        # Decode the samples
        samples = model.decode(z)
    return samples

def interpolate(model, img1, img2, steps=10):
    # Encode both images
    mu1, _ = model.encode(img1)
    mu2, _ = model.encode(img2)
    
    # Create interpolation points
    alphas = torch.linspace(0, 1, steps)
    interpolated = []
    
    with torch.no_grad():
        for alpha in alphas:
            z = alpha * mu1 + (1 - alpha) * mu2
            interpolated.append(model.decode(z))
            
    return interpolated

Key Differences Summary

  1. Latent Space
    • Vanilla: Discrete, potentially discontinuous
    • VAE: Continuous, probabilistic
  2. Loss Function
    • Vanilla: Only reconstruction loss
    • VAE: Reconstruction + KL divergence loss
  3. Generation Capabilities
    • Vanilla: Limited/unreliable
    • VAE: Principled generation of new samples
  4. Training Stability
    • Vanilla: Can be unstable
    • VAE: More stable due to regularization

Example training a VAE

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import matplotlib.pyplot as plt


def add_noise(x, noise_factor=0.3):
    noisy = x + noise_factor * torch.randn_like(x)
    return torch.clamp(noisy, 0., 1.)

def train_epoch(model, dataloader, optimizer, device, noise_factor=0.3):
    model.train()
    train_loss = 0
    
    for batch_idx, (data, _) in enumerate(dataloader):
        data = data.to(device)
        noisy_data = add_noise(data, noise_factor)
        
        optimizer.zero_grad()
        
        recon_batch, mu, log_var = model(noisy_data)
        
        # Reconstruction loss
        recon_loss = F.binary_cross_entropy(recon_batch, data, reduction='sum')
        
        # KL divergence loss
        kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
        
        # Total loss
        loss = recon_loss + kl_loss
        
        loss.backward()
        train_loss += loss.item()
        optimizer.step()
        
        if batch_idx % 100 == 0:
            print(f'Batch [{batch_idx}/{len(dataloader)}]: Loss = {loss.item()/len(data):.4f}')
    
    return train_loss / len(dataloader.dataset)

def visualize_results(model, test_loader, device, noise_factor=0.3):
    model.eval()
    with torch.no_grad():
        data = next(iter(test_loader))[0][:8].to(device)
        noisy_data = add_noise(data, noise_factor)
        recon_data, _, _ = model(noisy_data)
        
        # Plot results
        plt.figure(figsize=(12, 4))
        for i in range(8):
            # Original
            plt.subplot(3, 8, i + 1)
            plt.imshow(data[i][0].cpu(), cmap='gray')
            plt.axis('off')
            if i == 0:
                plt.title('Original')
                
            # Noisy
            plt.subplot(3, 8, i + 9)
            plt.imshow(noisy_data[i][0].cpu(), cmap='gray')
            plt.axis('off')
            if i == 0:
                plt.title('Noisy')
                
            # Reconstructed
            plt.subplot(3, 8, i + 17)
            plt.imshow(recon_data[i][0].cpu(), cmap='gray')
            plt.axis('off')
            if i == 0:
                plt.title('Reconstructed')
        
        plt.tight_layout()
        plt.savefig('vae_results.png')
        plt.close()

def main():
    # Parameters
    batch_size = 128
    epochs = 0 # increase to actually train it
    latent_dim = 32
    learning_rate = 1e-3
    noise_factor = 0.3
    
    # Device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")
    
    # Data loading
    transform = transforms.Compose([
        transforms.ToTensor()
    ])
    
    train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
    test_dataset = datasets.MNIST('./data', train=False, transform=transform)
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    # Model setup
    model = ConvVAE(latent_dim=latent_dim).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    # Training loop
    for epoch in range(1, epochs + 1):
        print(f"\nEpoch {epoch}/{epochs}")
        train_loss = train_epoch(model, train_loader, optimizer, device, noise_factor)
        print(f'====> Epoch: {epoch} Average loss: {train_loss:.4f}')
        
        # Visualize results every few epochs
        if epoch % 2 == 0:
            visualize_results(model, test_loader, device, noise_factor)
    
    # Save model
    torch.save(model.state_dict(), 'denoising_vae.pth')
    print("Training completed and model saved!")

if __name__ == "__main__":
    main()
Using device: cuda
Training completed and model saved!

VAE vs. Traditional Autoencoder

Traditional Autoencoder

Vanilla AE Latent Space Visualization

Variational Autoencoder

VAE Latent Space Visualization

Classification Introduction

  • From linear regression to classification
  • Moving from “how much?” to “which category?” questions
  • Examples:
    • Spam vs. inbox classification
    • Customer subscription prediction
    • Image classification (donkey, dog, cat, rooster)
    • Movie recommendation
    • Book section prediction

Classification Problem Types

Hard vs. Soft Classification

  • Hard Classification: Direct assignment to categories
  • Soft Classification: Probability assessment for each category
  • Often blurred in practice - even hard assignments use soft probabilities

Multi-label Classification

  • Multiple labels can be true simultaneously
  • Example: News article covering entertainment, business, and space flight
  • Not mutually exclusive categories

Simple Image Classification Example

Problem Setup

  • Input: \(2 \times 2\) grayscale image
  • Features: \(x_1, x_2, x_3, x_4\) (pixel values)
  • Categories: “cat”, “chicken”, “dog”

Label Representation

Two approaches:

  1. Integer encoding: \(y \in \{1, 2, 3\}\)
  2. One-hot encoding:
    • “cat”: \((1, 0, 0)\)
    • “chicken”: \((0, 1, 0)\)
    • “dog”: \((0, 0, 1)\)

Linear Model for Classification

Model Structure

For 4 features and 3 categories:

\[ \begin{aligned} o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1\\ o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2\\ o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3 \end{aligned} \]

Vectorized Form

\(\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}\)

  • \(\mathbf{W} \in \mathbb{R}^{3 \times 4}\): weight matrix
  • \(\mathbf{b} \in \mathbb{R}^3\): bias vector

The Softmax Function

Why Softmax?

Problems with direct regression:

  • No guarantee outputs sum to 1
  • No guarantee of non-negative outputs
  • No upper bound on probabilities

Softmax Definition

\[\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \quad \textrm{where}\quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}\]

Key properties:

  • Outputs are non-negative
  • Sum to 1
  • Preserves ordering: \(\operatorname*{argmax}_j \hat y_j = \operatorname*{argmax}_j o_j\)

Vectorization for Efficiency

Batch Processing

For minibatch \(\mathbf{X} \in \mathbb{R}^{n \times d}\):

\[ \begin{aligned} \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}\\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}) \end{aligned} \]

  • Matrix-matrix product \(\mathbf{X} \mathbf{W}\) is dominant operation
  • Softmax computed rowwise
  • Numerical stability considerations

Loss Function: Cross-Entropy

Log-Likelihood

For dataset with features \(\mathbf{X}\) and one-hot labels \(\mathbf{Y}\):

\[ P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}) \]

Cross-Entropy Loss

\[l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j\]

Properties:

  • Bounded below by 0
  • Zero only with perfect prediction
  • Never actually reaches zero for finite weights

The Softmax Operation

Why Implement from Scratch?

  • Fundamental understanding of softmax regression
  • Builds on linear regression components
  • Essential for deep learning foundations

Mathematical Definition

The softmax function transforms input values into probabilities:

\[\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}\]

Where:

  • \(\mathbf{X}\) is the input matrix
  • \(i,j\) are matrix indices
  • The denominator is called the (log) partition function

Implementation

import d2l
import torch 
def softmax(X):
    X_exp = d2l.exp(X)
    partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
    return X_exp / partition  # Broadcasting applied here

Important Note

The implementation above is not robust against very large or small arguments. Deep learning frameworks have built-in protections.

The Model Architecture

Key Components

  • Input: Flattened \(28 \times 28\) pixel images (784-dimensional vectors)
  • Output: 10 classes (Fashion-MNIST dataset)
  • Weights: \(784 \times 10\) matrix
  • Biases: \(1 \times 10\) vector

Model Implementation

import torch.nn as nn

class SoftmaxRegressionScratch(d2l.Classifier):
    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        # Initialize weights and biases as nn.Parameter
        self.W = nn.Parameter(torch.normal(0, sigma, size=(num_inputs, num_outputs)))
        self.b = nn.Parameter(torch.zeros(num_outputs))
    
    def parameters(self):
        return [self.W, self.b]

Forward Pass

@d2l.add_to_class(SoftmaxRegressionScratch)
def forward(self, X):
    X = d2l.reshape(X, (-1, self.W.shape[0]))
    return softmax(d2l.matmul(X, self.W) + self.b)

Cross-Entropy Loss

Mathematical Definition

The cross-entropy loss is defined as:

\[L(\mathbf{y}, \hat{\mathbf{y}}) = -\frac{1}{n}\sum_{i=1}^n \sum_{j=1}^k y_{ij} \log(\hat{y}_{ij})\]

Where:

  • \(\mathbf{y}\) is the true label (one-hot encoded)
  • \(\hat{\mathbf{y}}\) is the predicted probability
  • \(n\) is the batch size
  • \(k\) is the number of classes

Implementation

def cross_entropy(y_hat, y):
    return -d2l.reduce_mean(d2l.log(y_hat[list(range(len(y_hat))), y]))

Training Process

Hyperparameters

  • Number of epochs: 10
  • Batch size: 256
  • Learning rate: 0.1

Training Code

data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
# trainer.fit(model, data)

Model Evaluation

Prediction

X, y = next(iter(data.val_dataloader()))
preds = d2l.argmax(model(X), axis=1)

Error Analysis

We focus on misclassified examples to understand model weaknesses.

Summary

  • Implemented softmax regression from scratch
  • Used cross-entropy loss for classification
  • Trained on Fashion-MNIST dataset
  • Achieved reasonable classification performance

Exercises

  1. Numerical Stability
    • Test softmax with input value of 100
    • Test with largest input < -100
    • Implement numerical stability fix
  2. Cross-Entropy Implementation
    • Implement alternative cross-entropy function
    • Analyze performance differences
    • Consider domain of logarithm

References