Prerequisite · Linear Algebra

Norms, Dot Products, and Angles

15 min read
By the end of this reading you will be able to:
  • Compute the l1, l2, and l-inf norms of a vector and explain what each penalises geometrically
  • Calculate the dot product of two vectors and use it to find the angle between them
  • State the Cauchy-Schwarz inequality and explain why it justifies defining angle via the dot product
  • Compute cosine similarity as a normalised dot product and interpret the score for embedding retrieval tasks
  • Identify orthogonality from a zero dot product and explain its significance for projections
  • Contrast l1 and l2 regularisation in terms of sparsity, geometry, and gradient behaviour

Length of a Vector

Given a vector v=(v1,v2,,vn)TRn\vec{v} = (v_1, v_2, \ldots, v_n)^T \in \mathbb{R}^n, its Euclidean length (or 2\ell^2 norm) is:

v=v12+v22++vn2=i=1nvi2|\vec{v}| = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} = \sqrt{\sum_{i=1}^{n} v_i^2}

This generalises the familiar Pythagorean theorem to nn dimensions. A vector v\vec{v} has v=0|\vec{v}| = 0 if and only if v=0\vec{v} = \vec{0}.

Scaling and the Unit Vector

For any scalar c0c \geq 0, scaling a vector scales its length: cv=cv|c\vec{v}| = c|\vec{v}|. A unit vector has length exactly 1. Given any nonzero v\vec{v}, the vector

v^=vv\hat{v} = \frac{\vec{v}}{|\vec{v}|}

is the unit vector in the same direction — this process is called normalisation. Unit vectors matter because they carry only directional information, with magnitude removed.

The Dot Product

The dot product of u,vRn\vec{u}, \vec{v} \in \mathbb{R}^n is:

uv=u1v1+u2v2++unvn=i=1nuivi\vec{u} \cdot \vec{v} = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n = \sum_{i=1}^{n} u_i v_i

Notice that vv=v2\vec{v} \cdot \vec{v} = |\vec{v}|^2: the dot product of a vector with itself is the square of its length.

Algebraic Properties

For all u,v,wRn\vec{u}, \vec{v}, \vec{w} \in \mathbb{R}^n and cRc \in \mathbb{R}:

Property Statement
Commutativity uv=vu\vec{u} \cdot \vec{v} = \vec{v} \cdot \vec{u}
Distributivity (u+v)w=uw+vw(\vec{u} + \vec{v}) \cdot \vec{w} = \vec{u} \cdot \vec{w} + \vec{v} \cdot \vec{w}
Scalar pull-out (cu)v=c(uv)(c\vec{u}) \cdot \vec{v} = c(\vec{u} \cdot \vec{v})
Positive-definiteness vv0\vec{v} \cdot \vec{v} \geq 0, with equality iff v=0\vec{v} = \vec{0}

These four properties define a general inner product on a vector space. The Euclidean dot product is the canonical example.

The Cauchy–Schwarz Inequality

For any u,vRn\vec{u}, \vec{v} \in \mathbb{R}^n:

uvuv|\vec{u} \cdot \vec{v}| \leq |\vec{u}|\,|\vec{v}|

Equality holds if and only if one vector is a scalar multiple of the other (they are parallel). Cauchy–Schwarz is the foundational inequality of linear algebra — it underlies the triangle inequality, the definition of angles, and the theory of Fourier series.

Angles Between Vectors

Because uvuv|\vec{u} \cdot \vec{v}| \leq |\vec{u}||\vec{v}|, the ratio uvuv\frac{\vec{u} \cdot \vec{v}}{|\vec{u}||\vec{v}|} always lies in [1,1][-1, 1]. The angle θ\theta between two nonzero vectors is the unique θ[0,π]\theta \in [0, \pi] satisfying:

cosθ=uvuv\cos\theta = \frac{\vec{u} \cdot \vec{v}}{|\vec{u}|\,|\vec{v}|}

Key cases:

  • θ=0\theta = 0: vectors point in the same direction, uv=uv\vec{u} \cdot \vec{v} = |\vec{u}||\vec{v}|
  • θ=π/2\theta = \pi/2: vectors are orthogonal (perpendicular), uv=0\vec{u} \cdot \vec{v} = 0
  • θ=π\theta = \pi: vectors point in opposite directions, uv=uv\vec{u} \cdot \vec{v} = -|\vec{u}||\vec{v}|

Orthogonality

Two vectors are orthogonal if uv=0\vec{u} \cdot \vec{v} = 0. Orthogonality is the algebraic definition of perpendicularity — it requires no notion of angle, just the dot product. A set of vectors is orthogonal if every pair is orthogonal, and orthonormal if additionally every vector has unit length.

The standard basis {e1,,en}\{\vec{e}_1, \ldots, \vec{e}_n\} is orthonormal: eiej=δij\vec{e}_i \cdot \vec{e}_j = \delta_{ij} (the Kronecker delta).

General Norms

Beyond Euclidean length, ML regularly uses:

Norm Formula Intuition
1\ell^1 (Manhattan) v1=ivi\|\vec{v}\|_1 = \sum_i |v_i| Sum of absolute values; sparsity-promoting
2\ell^2 (Euclidean) v2=ivi2\|\vec{v}\|_2 = \sqrt{\sum_i v_i^2} Geometric length
\ell^\infty (Max) v=maxivi\|\vec{v}\|_\infty = \max_i |v_i| Largest component
p\ell^p vp=(ivip)1/p\|\vec{v}\|_p = (\sum_i |v_i|^p)^{1/p} Continuous family; 2\ell^2 is p=2p=2

All norms satisfy the triangle inequality u+vu+v\|\vec{u}+\vec{v}\| \leq \|\vec{u}\| + \|\vec{v}\|.

ML Connections

Cosine Similarity

The cosine of the angle between two vectors depends only on direction, not magnitude:

cosine_sim(u,v)=uvuv=u^v^\text{cosine\_sim}(\vec{u}, \vec{v}) = \frac{\vec{u} \cdot \vec{v}}{|\vec{u}||\vec{v}|} = \hat{u} \cdot \hat{v}

In embedding models (word2vec, BERT, CLIP), each token or image is encoded as a high-dimensional vector. Cosine similarity measures semantic relatedness: +1+1 means identical direction (synonyms), 00 means unrelated (orthogonal), 1-1 means opposite meaning (antonyms).

Regularisation

2\ell^2 regularisation (weight decay) penalises λw22\lambda\|\vec{w}\|_2^2 to prevent large weights. 1\ell^1 regularisation penalises λw1\lambda\|\vec{w}\|_1 and promotes sparsity (many weights exactly zero). Both are norm-based geometric constraints on the weight space.

Distance Metrics

Euclidean distance between two points x,y\vec{x}, \vec{y} is xy|\vec{x} - \vec{y}|. kk-NN, kk-means, and many other algorithms are expressed in terms of Euclidean or p\ell^p distances — which is why feature scaling (normalising columns to unit variance) matters: it prevents one feature's scale from dominating the norm.

Computing Norms and Similarities

import numpy as np

u = np.array([3., 1., -2.])
v = np.array([1., 4.,  1.])

# L2 norm
norm_u = np.linalg.norm(u)          # sqrt(9+1+4) = sqrt(14)
norm_v = np.linalg.norm(v)          # sqrt(1+16+1) = sqrt(18)
print(f'|u| = {norm_u:.4f}')        # 3.7417
print(f'|v| = {norm_v:.4f}')        # 4.2426

# Dot product
dot = np.dot(u, v)                  # 3*1 + 1*4 + (-2)*1 = 5
print(f'u · v = {dot}')             # 5

# Angle
cosine = dot / (norm_u * norm_v)
angle_rad = np.arccos(np.clip(cosine, -1, 1))
angle_deg = np.degrees(angle_rad)
print(f'cos θ = {cosine:.4f}')      # 0.3143
print(f'θ     = {angle_deg:.2f}°') # 71.67°

# Unit vectors (normalisation)
u_hat = u / norm_u
v_hat = v / norm_v
print(f'|û| = {np.linalg.norm(u_hat):.6f}')  # 1.000000

# Cosine similarity via unit-vector dot product
cos_sim = np.dot(u_hat, v_hat)
print(f'cosine similarity = {cos_sim:.4f}')  # 0.3143 (same as cosine above)

# L1 and Linf norms
print(f'||u||_1   = {np.linalg.norm(u, ord=1)}')   # 6.0
print(f'||u||_inf = {np.linalg.norm(u, ord=np.inf)}') # 3.0

# Cauchy-Schwarz check: |u · v| <= |u| * |v|
print(f'|u · v| = {abs(dot):.4f}  <=  |u||v| = {norm_u*norm_v:.4f}')  # True
References
Hefferon 2020 — Linear Algebra, Ch. One §II: Linear Geometry of n-Space