In Python:

# i.e. check torch.nn's all functions

# i.e. get help on a function, class or something

PyTorch Forum
PyTorch Tutorials
PyTorch Documentation

torch.tensor & other functions

When creating Tensor using ndarray

The two will share storage:

arr = np.array([0.,1.,2.])
A = torch.from_numpy(arr)
arr[:] = 0
A # tensor([0., 0., 0.], dtype=torch.float64)

## not for a scalar, PyTorch will copy a new one
arr_scalar = np.array(3.)
B = torch.from_numpy(arr_scalar)
arr_scaler = 0
B # tensor(3., dtype=torch.float64)

# tolist

Basic Operations for Tensors
'''create tensors'''

X = torch.tensor([[1,1],[1,2]])
X = torch.zeros([10,10])
X = torch.ones([1,2,3])
X = torch.arange(3)
X = torch.linspace(-2,2,10)
X_repeat = X.repeat(3, 2) # repeat 3 times along dim=0, 2 times along dim=1

X.data # acquire the tensor without gradient
X_clone = X.clone # this will create the same tensor as X, but don't share the same storage. 
# ATTENTION: this will be recoded by autograd (when set requires_grad=True), you can use X_clone = X.detach().clone() to disable this

'''tensor's shape and number of dimensions'''

arrayX = np.arange(100).reshape(2,5,10)
arrayX.shape # return (2,5,10)

X = torch.tensor(array)
X.shape # return torch.Size([2,5,10]) with respects to 
# dim 0, dim 1, dim 2. here the dim is same to the axis in NumPy

X.ndim, X.dim(), X.ndimension() # difference?

'''tensor's dtype'''

X = torch.tensor([1,2])
X.dtype # return torch.int64
X = torch.Tensor([1,2])
X.dtype # return torch.float32
X = torch.tensor([1,2],dtype=torch.float32)
X = X.type(torch.int64)

'''common arithmetic functions'''

Z = X + Y
Z = X - Y
Y = X.pow(2)
X_sum = X.sum()
X_sum_axis0 = X.sum(axis=0) # decrease dimension along axis=0
X_mean = X.mean()
X.max(), X.min()
X.numel() # count num of elements

'''matrix operation'''

A = torch.ones(2,4)
B = torch.ones(4,1)
torch.matmul(A,B) # like matrix multiplication, you can try A = torch.ones(4), which will be different with torch.ones(1,4)
A@B == torch.matmul(A, B)
X_trans = X.transpose(0,1)

X = torch.ones((3,2,4))
Y = torch.ones((3,4,6))
torch.bmm(X, Y).reshape # return torch.Size([3,2,6]) batch matrix multiplication

'''squeeze and unsqueeze'''

X = torch.zeros([1,2,5])
X.squeeze(0) # Remove length-1 dimension only, otherwise returning the same tensor
X.unsqueeze(0) # Expand a dimension

'''concatenate multiple tensors'''
X = torch.ones([1,1,3])
Y = torch.ones([1,4,3])
Z = torch.ones([1,5,3])
cat_XYZ = torch.cat([X,Y,Z], dim=1)
cat_XYZ.shape # return torch.Size([1,10,3])

'''stack multiple tensors'''

A = torch.ones(3,4)
B = torch.ones(3,4)
torch.stack([A,B], dim=0)

# difference between torch.cat and torch.stack

torch.stack([A,B], dim=0).shape # return torch.Size([2,3,4]), concatenate along a new dimension
torch.cat([A,B], dim=0).shape # return torch.Size([6,4]), concatenate in the given dimension

torch.unbind(torch.tensor[[1.,2.], [3.,4.], dim=1]
X = torch.arange(4.) + 1
Difference between reshape and view?

reshape: return a new tensor
view: share storage


Use cuda

torch.cuda.is_available() # check if your gpu works

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

X = torch.ones([1,2,3])
X = X.to('cpu')
X = X.to('cuda') # set X computed by GPU

# here is an example

device = torch.device('cpu')
if torch.cuda.is_availabe():
    device = torch.device('cuda')


torch.autograd.backward(tensor, grad_tensors=None, 
            retain_graph=None, create_graph=False)

X = torch.tensor([[1,1],[1,2]],requires_grad=True)
f_X = X.pow(2).mean()
X.grad.zero_() # clear the grad

'''quicker ways to require grad when you have many tensors to do so'''
X = torch.ones(10)
Y = torch.randn(1,4)
Z = torch.arange(2.)
for t in [X, Y, Z]:


More About Autograd

Sources: PyTorch Tutorials, stackoverflow, Documentation, PyTorch Autograd

Tensor.backward(gradient=None, retain_graph=None, create_graph=False, inpus=None)

  • create_graph: if True, the graph of the derivative will be constructed, allowing to compute higher oreder derivative products.
    X = torch.tensor(1., requires_grad=True)
    Y = X.pow
    d1 = X.grad

    By defaults (with no arguments): backward() is called on a scaler
    X = torch.tensor([2.,3.,4.], requires_grad=True)
    Y = X.sum().backward()
    A = torch.arange(6.).reshape(2,-1)
    for i in range(A.shape[0]):
        for j in range(A.shape[1]):
            out = A[i,j].pow(2)

    Note that PyTorch does not support non-scaler function derivatives. Any non-scaler tensors \(\mathbf{M}\) are regarded as intermediates (or local nodes in computational graphs), and PyTorch always expected that there exists some loss \(L\) (scaler), and it can calculate \(\frac{\partial{L}}{\partial{\mathbf{M}}}\) according to the chain rules.

i.e. the following Y is a vector, when you call backward on it, PyTorch expected you give it the "upstream" gradients it need to calculate \(\frac{\partial{L}}{\partial{\mathbf{Y}}}\) (it images there exists \(L\)). Below I gives torch.ones_like(X) as the "upstream" gradient. It can look like \(\mathbf{Y}\) is calculated by some function as (actually not): L=y1+y2+y3+C,C is some constants,∂L∂Y=[1,1,1] then the gradients of X is calculated as: ∂L∂X=∂L∂Y∂YX=[1,1,1]∘[4,6,8]=[4,6,8]

'''You can 'add' grad to a tensor'''
X = torch.tensor([2.,3.,4.], requires_grad=True)
Y = x.pow(2)
Y # a vector

There still some confusing things according to the Chain rule, see the below examples
X = torch.tensor([1.,2.], requires_grad=True)
A = torch.tensor([[1.,2.], [3.,4.]], requires_grad=True)
Y = torch.matmul(X,A)
X.grad, A.grad

??? need more study

Difference between detach and with torch.no_grad

Sources: stackoverflow

  • tensor.detach() creates a tensor that does not requires grad, which shares the same storage with the original tensor. And it detaches the tensor from the computational graph.
    net = nn.Linear(4,1)
    X = torch.ones(4, requires_grad=True)
    X # tensor([1., 1., 1., 1.], requires_grad=True)
    X_de = X.detach()
    X_de # tensor([1., 1., 1., 1.])
    Y = net(X)
    Y # tensor([-0.1955], grad_fn=\<AddBackward0>)
    Y_de = Y.detach()
    Y_de # tensor([-0.1955])
What is PyTorch Graph?

Run the below code you will get

X = torch.ones(1, requires_grad)
X # tensor([1.], requires_grad=True)
Y = 2 * X
Y # tensor([2.], grad_fn=\<MulBackward0>)
X.grad # tensor([2.])
# ...... Calls into the C++ engine to run the backward pass 
# RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)
# Saved intermediate values of the graph are freed when you call .backward() or autograd.grad()
# Specify retain_graph = True if you need to backward through the graph a second or if you need to access saved tensors after calling backward()

Y = 2 * X + 1
Y # 
Z = Y.pow(2) / 2
Z # 



Tensor.masked_fill\_(mask, value)

  • mask (BoolTensor)
    mask = torch.rand(10) >= 0.6
    X = torch.arange(20.).reshape(2,-1)
    X.masked_fill_(mask, 19.)
PyTorch BroadCasting
X = torch.arange(10.).unsqueeze(1) # let X.shape be (10,1)
Y = torch.tensor([2.,2.])
X / Y

Random Variables


Get a uniform distribution in \([r_1,r_2]\)

r1 = 1
r2 = 5
shape = (3,3)
(r_2-r_1) * torch.rand(shape) + r_2

Make masks:
mask = torch.rand(10) > 0.6


Standard normal distribution.

torch.randn((3,3), dtype=torch.float32, requires_grad=False)

  • out: the output tensor
  • layout: the storage layout of the tensor
  • device: torch.device('cuda' if cuda.is_available() else 'cpu')
Tensors' in-place random sampling


X = torch.tensor([1.,4.])

'''Uniform distribution'''

X.uniform_(from=1, to=5, generator=None)

'''Bernoulli Distribution'''

'''Cauchy distribution'''

'''Exponential distribution'''

'''Geometric distribution'''

'''Log-normal distribution'''

'''Normal distribution'''

'''Discrete uniform distribution'''

'''Continuous uniform distribution'''




Creates and returns a generator object, used as a keyword argument in many in-place random sampling.

  • get_state(): Returns the Generator state as a torch.ByteTensor, which contains all the necessary bits to restore a Generator.
  • set_state()
  • manual_seed(seed): the seed can be any 32-bit integer, returning a torch.Generator object
    X = torch.tensor([1.,2.])
    g = torch.Generator().manual_seed(19)
    state_0 = g.get_state() # get generator's current state
    X.normal_(0, 2, generator=g) # this will change X's value in place
    state_1 = g.get_state()
    X.normal_(0, 2, generator=g) # this will return a different tensor
    # note that when you pass the keyword argument generator to the in-palce sampling (i.e. normal_)
    # you just tell PyTorch which generator to use, it will not reset the states for you
    X.normal_(0, 2, generator=g)
    X.normal_(0, 2, generator=g)



torch.repeat_interleave(input, repeats, dim=None, *, ouput_size=None)

Repeat elements of a tensor.

  • repeats(Tensor or int): number of repetitions for each elemts
  • dim(int, opt): the dimension along which to repeate values. By default, will return a flat output array with repeated values.
reapeat_interleave and repeat in making attention masks (the former is True)


torch.nan_to_num(input, nan=0.0)

Common Functions for Tensors

Elementary Functions

Sources: [Wikipedia](https://en.wikipedia.org/wiki/Elementary_function#:~:text=In%20mathematics%2C%20an%20elementary%20function,inverse%20functions%20(e.g.%2C%20arcsin%2C)

X = torch.tensor([3.])
torch.sin(0.01 * X)

Activation functions
X = torch.tensor([2.,1.,1.])
torch.argmax and torch.max
torch.argmax(input, dim, keepdim=False)
torch.max(input, dim, keepdim=False, out=None)


Triangular matrices


a = torch.rand((4,4))




torch.linalg.vector_norm(x, ord=2, dim=None, keepdim=False, dtype=None, out=None)

  • for a complex value x, return x.abs()
  • dim=None, flatten and compute norm. dim is an int or tuple, compute along the dimensions.
  • ord: inf for max(abx(x)), -inf for min(abs(x)), 0 for sum(x!=0). Other int or float: \(\(\left(\sum\limits_{\lvert x_i\rvert}^{\text{ord}}\right)^{1/\text{ord}}\)\)




Base class for all PyTorch map-styple datasets. All subclasses should overwrite __getitem__(). Because different datasets have different map.

Supporting fetching a data sample for a given key (for hashable object)

Custom Your Dataset

Overwrite __getitem__, __len__

class MyDataset(Dataset):
    def __init__(self, file):
        self.data = ...

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)       

Here is an advanced example
Sources: PyTorch Documentation

import os
import pandas as pd
from skimage import io
import torch
import torch.utils.data as data
from torchvison import transforms

class FaceLandmarksDataset(Dataset):
    def __init__(self, csv_file, root_dir, transform=None):
        self.landmarks_frame = pd.read_csv(file)
        self.root_dir = root_dir
        self.transform = transform

    def __len__(self):
        return len(self.landmarks_frame)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):

        img_name = os.path.join(self.root_dir,
        image = io.imread(image_name)
        landmarks = self.landmark_frame.iloc[idx, 1:]
        landmarks = np.array([landmarks],dtype=float).reshape(-1,2)
        sample = {'image':image, 'landmarks':landmarks}

        if self.tranfrom:
            sample = self.transform(sample)

        return sample

Map-style Dataset: Get Image and its target
import torch
from PIL import Image
About Dataset Class

Wrapping tensors as a dataset, you can then get each sample by indexing tensors along the first dimension.

dataset = data.TensorDataset(*tensors) # the tensors need to have the same size of the first dimension


Create an iterable-style dataset with arrays
import torch
from torch.utils import data
def load_array(data_arrays, batch_size, is_train=True):
    dataset = data.TensorDataset(*data_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)


What is DataLoader
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, 
        batch_sample=None, num_workers=0, collate_fn=None,
        pin_memory=False, drop_last=False, timeout=0,
        worker_init_fn=None, *, prefetch_factor=2, persistent_workers=False)

At the heart of PyTorch data loading utility is the torch.utils.data.DataLoader class, which represents a Python iterable over a dataset (combines a dataset and a sampler)

  • dataset(Dataset)
  • batch_size(int,opt,1):
  • shuffle(bool,opt,False): set to True to reshuffle the data every epoch
  • sampler(Sampler or Iterable,opt,None): define the method to get samples from the dataset. ❗If specified, shuffle must not be specified
Load Fashion-MNIST Dataset

import torch
import torchvision
from torch.utils import data
from torchvision import transforms
mnist_train = torchvision.datasets



Base class for all Samples

Every Sampler subclass has to provide an __iter__() method, which is a way to ierate over indices or list of indices (for batches) of dataset elements, and a __len__() method that returns the length of the returned iterators



torch.utils.data.random_split(dataset, lengths,
            generator=<torch._C.Generator object>)

Split a dataset into non-overlapping new datasets. Note that random_split will return Dataset object, so you can then use DataLoader to process Dataset

  • generator: Check here
  • lengths: a sequence (lengths or fractions of splits to be produced)

g = torch.Generator().manual_seed(20)
train_dataset, validation_dataset = random_split(range(10), [0.3,0.7], generator=g)
train_iter = torch.utils.data.DataLoader(train_dataset)


Split a dataset

Sources: stackoverflow

import torch.utils.data as data
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = data.random_split(dataset, [train_size, test_size])



base class for all neural network modules

  • apply(fn): Applies function recursively to every submodule as well as self.

    # initialize the parameters
    def init_weights(m):
        if type(m) == nn.Linear:
    net = nn.Sequentail(nn.Linear(4,4), nn.ReLU(), nn.Linear(2,2))
    net.apply(init_weights) # this will initialize nn.Linear(4,4) and nn.Linear(2,2) parameters

  • zero_grad(set_to_none=False): Reset gradients of all model parameters.

  • eval(): Set the module in evaluation mode, which is equivalent to `self.train(False)
Cases using eval()
'''Batch normalization'''


'''Forbid gradient calculation'''
  • children(): Returns an iterator over immediate children modules. Here's an example:
    class ParentNetwork(nn.Module):
        def __init__(self):
            super(ParentNetwork, self).__init__()
            self.layer1 = nn.Linear(4,2)
            self.bastard = torch.tensor([[2.,3.],[3.,4.]]) # this will not be registered as the module's child
            self.layer2 = nn.ReLU()
            self.layer3 = nn.Linear(2,1)
        def forward(self, input):
            return self.layer3(self.layer2(self.bastard(self.layer1(input))))
    model = ParentNetwork()
    for layer in model.children():
    for layer in model.children():
        if hasattr(layer, 'reset_parameters'):

    About module.train() and module.eval()
Why do we need module.train()/module.eval()?

Sources: stackoverflow

You can define names for submodules:

class NN(nn.Module):
    def __init__(self):

    def diy_add_modules(self, submodules, names):
        for name, submodule in zip(names, submodules):
            self._modules[name] = submodule

    def forward(self, X, who_gonna_do):
        for name in who_gonna_do:
            X = self._modules[name](X)

My_nn = NN()
My_nn.diy_add_modules([nn.ReLU(), nn.Linear(1,1), nn.Linear(1,2), nn.Linear(2,1)], ['A','B','C','D'])
X = torch.tensor([-1.])
My_nn(X, ['A'])
My_nn(X, ['C','A','D'])

# PyTorch also provides you with add_modules() method
My_nn.add_module(nn.ReLU(), 'A_')

  • add_module(name, module)
  • modules(): Returns an iterator over all modules.
nn.ModuleDict, OrderedDict and dict

Sources: PyTorch Forum
Note that submodules are registered using _modules.

Difference between .module() and .children() & Remove NN layers


net = nn.Sequential(nn.Linear(2,2), nn.Sequential(nn.Sigmoid(), nn.ReLU()))
'''.modules() will recursively go into all modules in the network'''
(0): Linear(in_features=2, out_features=2, bias=True)
(1): Sequential(
    (0): Sigmoid()
    (1): ReLU()
), Linear(in_features=2, out_features=2, bias=True), Sequential(
    (0): Sigmoid()
    (1): ReLU()
), Sigmoid(), ReLU()]'''

'''.children() will not go into the submodule'''
list(nn.Sequential(nn.Linear(3,4), nn.ReLU()).children())
'''[Linear(in_features=2, out_features=2, bias=True), Sequential(
    (0): Sigmoid()
    (1): ReLU()

  • parameters(recurse=True): Returns an iterator over module parameters
how to reset a module's parameters?
for layer in model.children():
    if hasattr(layer, 'reset_parameters'):
  • state_dict() Returns an OrderedDict containing references to the whole state of a module.
  • register_buffer(name, tensor, persistent): Adds a buffer to the module
Difference between register_buffer and register_parameter?


  • cpu(): move all model parameters and buffer to CPU
  • cuda(device=None): move all model parameters and buffer to GPU
How to save and load my Module/Tensor?
'''save & load tensor'''
T = torch.tensor('......')
torch.save(x, 'path/name.pt')
T_load = torch.load('path/name.pt')

'''save & load tensor list'''
X, Y = torch.tensor([2.,3.]), torch.tensor(3.)
torch.save([X, Y], 'path\\anyname')
X, Y = torch.load('path\\anyname')

'''save & load dict'''
mydict = {'module':module, 'param':param}
torch.save(mydict, 'path/mydict.pt')
mydict_ = torch.load('')

'''save a moudle and its parameters'''
torch.save(net.state_dict(), 'module.params')
clone = MLP()

add modules in order
You can see nn.Sequential as an ordered container.

import torch
import torch.nn as nn
from collections import OrderedDict

net = nn.Sequential(

# functionally the same as above
named_net = nn.Sequential(OrderedDict([
                ('conv1', nn.Conv2d(1,20,5)),
                ('relu1', nn.ReLU()),
                ('conv2', nn.Conv2d(20,64,5)),
                ('relu2', nn.ReLU())

# you can also append layers to a list, and then create a net using nn.Sequential
layers = []
net = nn.Sequential(*layers)

# acquire specific layer
named_net.conv1.weight.data # call by the name of the layer

Build a simple Net: Toy MLP
import torch.nn as nn

class ToyMLP(nn.Module)
    super(ToyMLP, self).__init__()
    self.net = nn.Sequential(
        nn.Linear(8, 4),
        nn.Linear(4, 4)

    def forward(self, X):
        return self.net(X)

Another example

class ToyMLP(nn.Module)
    def __init__(self, n_feature, n_hidden, n_output)
        super(ToyMLP, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)
        self.relu = torch.nn.ReLU()
        self.out = torch.nn.Linear(n_hidden, n_output)
        self.softmax = torch.nn.Softmax(dim=0)

    def forward(self, X):
        X = self.hidden(X)
        X = self.relu(X)
        X = self.out(X)
        X = self.softmax(X)
        return X
Custom Your Network: A more flexible example

Sources: D2l

  • append(module): appends a given module to the end of the list
  • extend(module): receive iterable of modules to append

Linear Layer
\(\(Linear\_layer(X) = W\cdot X + b\)\)

torch.nn.Linear(in_features, out_features)
Note that the input tensor's last dimension must be the same as the linear layer's first dimension
i.e. Input Tensor: \(*\times2\) -> Linear Layer: \(2\times 3\) -> Output Tensor: \(*\times3\)

import torch
import torch.nn as nn

lin_layer = nn.Linear(10,20) # you can simply think this as a function, with random parameters

# you can call the parameters and initialize them as you want


lin_layer.weight.data = torch.ones(lin_layer.weight.shape)

net = nn.Sequential(nn.LazyLinear(4), nn.ReLU(), nn.LazyLinear(1))
net[0].weight # \<UninitializedParameter>
net[0] # LazyLinear(in_features=0, out_features=4, bias=True)
X = torch.rand(2,2)
net(X) # the framework will initialize sequentially
nn.Sigmoid, torch.nn.ReLU:

Common Activation Functions

import torch.nn as nn

acti_sigmoid = nn.Sigmoid()
acti_Relu = nn.ReLU()

  • input & output: any number of additional dimensions, the shapes of input and output are same.
  • dim(int): the dimension along which Softmax calculates.
How to copy my module?

Python has copy.deepcopy() to handle this
Sources: geeksforgeeks

import copy
LN = torch.nn.Linear(4,1)
LN_clone = copy.deepcopy(LN)
LN_clone == LN, LN_clone.weight == LN.weight, LN_clone.bias == LN.bias
# the first will return False while others True


Sources: Documentation




torch.nn.Conv2d(in_channels, out_channels,
        kernel_size, stride=1, padding=0, dilation=1
        group=1, bias=True, padding_mode='zeros',
        device=None, dtype=None)

  • padding: not that if you use padding=2, then all sides of the tensors will be added 2 rows(columns)
nn.MaxPool, nn.AvgPool

# PyTorch's MaxPool and AvgPool's output channels' number is the same as input numbers

X = torch.arange(16).reshape(1, 1, 4, -1)
X = torch.cat((X, X+1),1)

avg_pool2d = nn.MaxPool2d(3, padding=1, stride=2)


torch.nn.Flatten(strat_dim=1, end_dim=-1)
torch.nn.Unflatten(dim, unflattened_size)
  • dim: the dimension of the input tensor to be unflattened


torch.nn.ConvTranspose2d(in_channels, out_channels, kernel_size, 
            stride=1, padding=0, output_padding=0,
            groups=1, bias=True, 
            dialation=1, padding_mode='zeros',
            device=None, dtype=None)


Sources: PyTotch discussion



Sources: [Documentation], stackoverflow

torch.nn.RNN(self, input_size, hidden_size
        num_layers=1, nonlinearity='tanh',
        bias=True, batch_first=False, dropout=0,
        biadirctional=False, device=Nonw,

  • Note that all weights and biases are initialized from uniformed distribution: \(u(-\sqrt{k},\sqrt{k}),k=\frac{1}{\text{hidden\_size}}\)
  • batch_first: If True, then the input and output tensors are provided as (batch, seq, feature) instead of (seq, batch, feature). This does not apply to hidden or cell states.
  • Return: 1) output: \((N,L,D*H_{out})\) \(D=2\) if bidirectional=True otherwise 1; 2) h_n: tensor of shape \((D\times \text{num\_layers}, H_{out})\) for unbatched input or \((D\times\text{num\_layers}, N, H_{out})\) containing the latest hidden state.
    rnn_layer = nn.RNN(4, 3, , bias=False, num_layers=2)
    rnn_layer.weight_ih_l0 # ih for input-hidden, l0 for layer 0
    rnn_layer.weight_hh_l0 # hh for hidden-hidden
    rnn_layer.weight_ih_l1 # acquire the second layer
    input = torch.ones(size=(1, 3, ))





  • I/O:
    • (*) The input should be IntTensor or LongTensor of arbitrary shape.
    • (*, H) H = embedding_dim
  • The weights initialized from \(\mathcal{N}(0,1)\)




torch.nn.MultiheadAttention(embed_dim, num_heads, dropout=0.0,
            bias=True, add_bias_kv=False, add_zero_attn=False,
            kdim=None, vdim=None, batch_first=False,
            device=None, dtype=None

  • embed-dim: Total dimension of the model
  • num_heads: Each head will have dimension embed_dim // numheads
  • bias: Default:True. If specified, adds bias to input / output projection layers.
  • batch_first: Defalut: False (\(\text{seq}, \text{batch}, \text{feature}\)). If True, the input and output tensors are provided as \((\text{batch}, \text{seq}, \text{feature})\)
forward(query, key, value, key_padding_mask=None,
    need_weights=True, attn_mask=None,
    average_attn_weights=True, is_causal=False)
  • query:
    • \((\text{L}, \text{E}_q)\) for unbatched input.
    • \((\text{L}, \text{N}, \text{E}_q)\) when batch_first=False
    • \((\text{N}, \text{L}, \text{E}_q)\) when batch_first=True
  • key:

Examples: Singlehead Dot Scaled Attention

input = torch.randn(2, 10, 4) # batch, time steps, embedding's dimension
Singlehead = nn.MultiheadAttention(4, 1, bias=True, batch_first=True) # be careful with batch_first

'''Single Attention Machinism'''
Query = input@Singlehead.in_proj_weight[:4].t() + Singlehead.in_proj_bias[:4] 
# note that the bias was initialized as 0 vector
Key = input@Singlehead.in_proj_weight[4:8].t()
Value = input@Singlehead.in_proj_weight[8:].t()
Sa = torch.bmm(Query, torch.transpose(Key, 1, 2)) / torch.sqrt(torch.tensor(4.))
Sa_weight = torch.softmax(Sa, dim=-1)
Weighted_value = Sa_weight@Value
Output = Weighted_value@Singlehead.out_proj.weight.t()
a, _ = Singlehead(input, input, input)
a - Output
# the output is really close to zero, but not equal to it, I don't see the reason 

  • attn_mask: if specified, a 2D/3D mask preventing attention to certain positions. Note: (1,X,X) won't broadcast!
    • Binary and float tensors are both supported. Binary: True to indicates that the position is not allowed to attend, while float: the value will be added to the attention weight.
  • key_padding_mask: A mask of shape (N, S) indicating which elements withn key to ignore for the purpose of attention. (i.e. ignore padding)
attn_mask and key_padding_mask
  • key_padding_mask:




torch.nn.Transformer(d_model=512, nhead=8, 
        num_encoder_layers=6, num_decoder_layers=6, 
        dim_feedforward=2048, activation=<function relu\>, dropout=0.1,
        batch_first=False, norm_first=False,
        custom_encoder=None, custom_decoder=None,
        device=None, dtype=None)

An example: Word Language Model
forward(src, tgt, src_mask=None, tgt_mask=None, 
    memory_mask=None, src_key_padding_mask=None,
    tgt_key_padding_mask=None, memory_key_padding_mask=None,
    src_is_causal=None, tgt_is_causal=None, 

  • scr: \((S,E)\) for unbatched input, \((N, S, E)\) if batch_first=True
  • tgt: \((T,E)\) ... \((N, T, E)\) ...
  • src_mask: \((S, S)\) or \((N\cdot num_{heads}, S, S)\)
  • tgt_mask: \((T, T)\) or \((N\cdot num_{heads}, T, T)\)

Loss Functions

torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')
  • reduction: $$l(x,y)=\left{\begin{aligned}


  • reduce: Deprecated
  • size_average: Deprecated


torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean', label_smoothing=0.0)

  • weight: if provided, the input should be a 1D tensor assigning weight to each of the classes. (useful when you have an unbalanced training set)
  • Note that the input has to contain the unnormalized logits for each class. The shape of input has to be \((\text{minibatch}, C)\) or \((\text{minibatch}, C, d_1,d_2,\cdots,d_K)\) (for the K-dimensional case, i.e. computing cross entropy loss per-pixel for 2D images).
  • reduction: $$l(x,y)=\left{\begin{aligned}

\end{aligned}\right.$$ If reduction is set to none, then for a \(N\) batch size input, the loss function will return \(\{l_1,l_2,\cdots,l_N\}\) for each sample's loss.

criterion_mean = nn.CrossEntropyLoss()
criterion_sum = nn.CrossEntropyLoss(reduction='sum')
criterion_none = nn.CrossEntropyLoss(reduction='none')
y_pred = torch.cat([torch.ones(17), torch.tensor([2.,3.,4.])]).reshape(2,-1)
y_real = torch.ones([2,10])
criterion_mean(y_pred, y_real)
criterion_sum(y_pred, y_real)
criterion_none(y_pred, y_real)


CorssEntropyLoss do not calculate Cross Entropy Loss!

Measures the Binary Cross between the input probabilities\(\(l_n=-w_n[\hat{y}_n\log y_n+(1-\hat{y}_n)\log(1-y_n)]\)\)


\(\(l_n=-w_n[y_n\log \sigma(x_n)+(1-y_n)\log(1-\sigma(x_n))]\)\)

BCELoss vs BCEWithlogitsLoss
tensor(nan, grad_fn=\<MseLossBackward>)

State: Unknown Reason
Possible solution: Raddit
to repeat the bug:




torch.nn.Dropout(p=0.5, inplace=False)

  • inplace: if set True this will do this operation in-place.


torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1,
            affine=True, track_running_stats=True,
            device=None, dtype=None)

Applies Batch Normalization over a 4D input (and 2d refers to image(2d))
BN = nn.BatchNorm(1, affine=False, eps=0, momentum=0) # set channel's number to 1
N1 = torch.ones(1,2,2)
N2 = torch.ones(1,2,2) * 2
B = torch.stack([N1, N2], dim=0)

torch.nn.LayerNorm(normalizaed_shape, eps=1e-0.5, elementwise_affine=True, 
        bias=True, device=None, dtype=None)
  • The mean and strandard-deviation are calculated over the last D dimensions (normalizaed_shape)
    • The standard-deviation is calculated via the biased estimateor, equivalent to torch.var(input, unbiased=False)
  • elementwise_affine: set True to use \(\gamma\) and \(\beta\) to shift mean and standard-deviation.
    X = torch.tensor([[[1.,2.],[2.,3.]]]) # X.shape: (1,2,2)
    LayerNorm = nn.LayerNorm((2,2), eps=0, elementwise_affine=False)
    (X - X.mean()) / X.std(correction=0) 
    LayerNorm_ = nn.LayerNorm(2, eps=0)
    (X - X.mean(dim=-1)) / X.std(dim=-1, correction=0)


Difference between torch.nn.functional and torch.nn?

Sources: Stackexchange

  • nn.functional.xxx: You'll need to handle the parameters yourself (passing them to the optimizer or moving them to the GPU)
  • nn.xxx: easy to handle parameters with net.parameters(), net.to(device) etc.
  • You can see nn.functional.xxx as a more flexible function than nn.xxx.
torch.nn.functional.softmax(input, dim=None, _stacklevel=3, dtype=None)
torch.nn.functional.one_hot(tensor, num_classes=-1)
  • num_classes: set to -1, then the number of classes will be greater than the largest value in the input.
    X = torch.tensor([0,3,5])


Why do we need nn.parameter.Parameter?

Sources: stackoverflow

  • Tensors are multi-dimensional matrices, while parameters are Tensor subclasses.When a parameter is associated with a module as its attribute, it will automatically be added to the parameter list of the module, and can be accessed using the .parameters() iterator ???.
  • This comes with a
    class simple_function(nn.Module):
        def __init__(self):
            super(simple_function, self).__init__()
            self.weight0 = torch.tensor([1.,1.], requires_grad=True)
            self.weight1 = torch.nn.Parameter(torch.tensor(5.,6.)) # by default, the parameter will be set `requires_grad=True`
        def forward(self, input):
            return torch.matmul(self.weight1, input) + self.weight0.sum() * 1/2
    sf = simple_function()
    for param in sf.parameters():
        print(type(param.data), param)
    # return <class 'torch.Tensor'> Parameter containing: tensor([5., 6.], requires_grad=True)
    X = torch.tensor([1.,1.])
    Y = sf(X)
    Y # tensor(12., grad_fn=\<AddBackward0>)
    sf.weight1.grad # tensor([1., 1.])
    sf.weight0.grad # tensor([0.5000, 0.5000])


torch.nn.parameter.Parameter(data=None, requires_grad=True)

A kind of tensor used as a module parameter. Parameters



'''Normal distribution'''
nn.init.normal_(torch.empty(3,3), mean=3, std=1) # default 0.0, 1.0

LinearNet = nn.Linear(4,1)

'''Uniform distribution'''
def init_uniform(net):
    if isintance(net, nn.Linear):

Xavier Uniform: generated tensor from \(\(u(-a,a), a=\text{gain}\times \sqrt{\frac{6}{\text{fan\_in}+\text{fan\_out}}}\)\)
w = torch.empty(2,2)





base class for all optimizers
torch.optim.Optimizer(params, defaults)

  • param is an iterable of torch.Tensors or dicts
  • a dict containing default values of optimization options

  • Optimizer.zero_grad(set_to_none=True): Resets the gradient of all optimized torch.Tensors
    -set to none(bool): set the grads to None instead of to zero.

Why do we need to call zero_grad?

Sources: stackoverflow

PyTroch accumulates the gradients on subsequent backward passes, which is useful when we want to sum the whole loss summed over multiple batches or training RNNs.


Stomatic Gradient Descent

torch.optim.SGD(params, lr, momentum=0, dampening=0
        weight_decay=0, nesterov=False, maximize=False, foreach=None, differentiable=False )

  • momentum: momentum factor \(\mu\)
    • damplening: dampening \(\tau\) for momentum:
    • Check here
  • params: iterable of parameters to optimize or dicts
    net = nn.Sequential(nn.Linear(4,1), nn.Linear(1,2), nn.ReLU(), nn.Linear(2,1, bias=False))
    optimizer = torch.optim.SGD([{"params": net[0].weight, "weight_decay": 0.01}, {"params": net[0].bias}], lr=1.) # although all net.parameters are set to require gradient, you can choose which to update using optimizer
    rand_samples = torch.randn(5, 4)
    for i in range(len(rand_samples)):
        Y = net(rand_samples[i])
        for param in net.parameters():

torch.optim.Adam(params, lr=0.001, betas=(0.9,0.999),
        eps=1e-0.8, weight_decay=0, amsgrad=False,
        maximize=False, capturable=False,
  • amsgrad: AMSGrad vairant of this algorithm Check
Change the learning rate based on number of epochs
import torch.optim.lr_scheduler.StepLR
scheduler = StepLR(optimizer, step_size=5, gamma=0.1)

for epoch in range(100):






All transformations accept PIL Image, Tensor Image \((C,H,W)\) or batch of Tensor Images \((B,C,H,W)\) as input.



torchtext.data.utils.get_tokenizer(tokenizer, language='en')
  • tokenizer: if None, if returns split() function (by space)
    • basic_english: return _basic_english_normalize() function, first normalizing the string and then spliting it by space.


from torchtext.vocab import vocab
from collections import Counter, OrderedDict

Unsorted Notes

The Busy Person's Intro to LLMs


Model Inference: What is a LLM?

Two Files (llama-2-70b by Meta AI): llama series - 2nd iteration of it - 70 billion parameters - All access to the paper, parameter, architecture.

  • Parameters: \(140\)GB
    • every one of those parameters is stored as 2 bytes (float 16 number)
  • run.c \(\sim500\) lines of C code

You only need a device...(don't need network)

Model Training: How do we get the parameters?
  • Chunk of the internet (\(\sim10\)TB of text)
  • \(6000\) GPUs for \(12\) days, \(\sim\)$2M \(\sim1e24\) FLOPs (floating-point operations per second)
    • like compression the chunks into a 'zip' file (but we don't have the chunks)
    • this are only rookie numbers, the state of the art models by \(10\) or more...

The LLM is simply predicting the next word in the sequence

Next word prediction forces the neural network to learn a lot about the world

LLM 'Dreams' / Hallucination
  • Fake Links
  • Fake Codes
  • ...
    It just puts in what every it 'thinks' reasonable
How do they work?

We know every math operation. But little is known in full detail

  • We can measure that this works, but we don't really know how the billions collaborate to do it. (Or equally, we do not actually how to rectify parameters precisely to make it work, we just 'train' it, like human teachers)
  • The model's database is strange: reversal curse (one-dimensional?)
Fine tuning: Training the Assistant

Training will be the same, but with different datasets.

First you write labeling instructions, and then hire people to write ideal Q&A responses, after that you have the dataset, and use it to train your dataset (fine-tuning), and deplot the model. When model running, you will get misbehaviours, then you can fine-tuning again (let people write the right Q&A responses and feed it into the dataset)

  • \(\sim100K\) conversations (people write: questions and ideal answers)
  • Quality over quantity
    After fine tuning you have the Assistant model.

The model somehow still have access to the first-state (pretraining) knowledge.

  • Another way to fine-tuning: compare answers and feed back.
LLM Scaling Laws


LLMs Use tools
  • Browser
  • Calculator
  • Python Interpretor
  • Vision: See and Generate images
  • Audio: Hear and Speak

LLMs only have instinctive part, cannot think reasonably.

Take 30minutes thinking

Create tree of thoughts.



What does the step 2 look like in LLMs? Lack of reward criterion.

Data poisoning and Backdoor attacks

Hugging Face

from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier(['I miss you', 'I just miss your so much. I guess'])
  • by default, the pipeline selects a particular trained model. The model is downloaded and cached.
Zero-shot classification
classifier = pipeline('zero-shot-classification')
    "this is my life",
  • You don't need to fine=tune the model on your data to use it.
Text Generation
generator = pipeline('text-generation')
generator('I miss you')

# specify a model
generator = pipeline('text-generation', model='distilgpt2')
    'I miss you.',
Save models

