Spark GraphX and Graph Processing

GraphX is Spark's library for processing graph-structured data. A graph represents entities (vertices) and the relationships between them (edges). Social networks, road maps, supply chains, and recommendation systems are all naturally modeled as graphs.

What Is a Graph?

Social network graph:

    Alice ---[follows]--> Bob
    Alice ---[follows]--> Carol
    Bob   ---[follows]--> Carol
    Carol ---[follows]--> Dave

Vertices (nodes): Alice, Bob, Carol, Dave
Edges (links):    follows relationships

Each vertex and edge can carry properties:
Vertex: {name: "Alice", age: 32, city: "NY"}
Edge:   {relationship: "follows", since: "2022-01"}

Core Graph Concepts

Vertex — an entity (a person, a city, a webpage)
Edge — a relationship between two vertices (follows, connects to, links to)
Directed graph — edges have a direction (Alice follows Bob ≠ Bob follows Alice)
Weighted edge — edges have a numeric value (distance between cities, strength of friendship)
Degree — number of edges connected to a vertex

Creating a Graph in GraphX (Scala)

GraphX is a Scala/Java API. Python users work with GraphFrames (a community library) which wraps GraphX.

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

// Create vertices: (id, property)
val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(
    (1L, "Alice"),
    (2L, "Bob"),
    (3L, "Carol"),
    (4L, "Dave")
))

// Create edges: Edge(srcId, dstId, property)
val edges: RDD[Edge[String]] = sc.parallelize(Array(
    Edge(1L, 2L, "follows"),
    Edge(1L, 3L, "follows"),
    Edge(2L, 3L, "follows"),
    Edge(3L, 4L, "follows")
))

// Build the graph
val graph = Graph(vertices, edges)

// Count vertices and edges
println(s"Vertices: ${graph.vertices.count()}")  // 4
println(s"Edges: ${graph.edges.count()}")         // 4

GraphFrames — Python-Friendly Graph Processing

# Install: pip install graphframes
from pyspark.sql import SparkSession
from graphframes import GraphFrame

spark = SparkSession.builder.getOrCreate()

# Vertices DataFrame — must have an 'id' column
v = spark.createDataFrame([
    ("alice", "Alice", 32),
    ("bob",   "Bob",   28),
    ("carol", "Carol", 35),
    ("dave",  "Dave",  29)
], ["id", "name", "age"])

# Edges DataFrame — must have 'src' and 'dst' columns
e = spark.createDataFrame([
    ("alice", "bob",   "follows"),
    ("alice", "carol", "follows"),
    ("bob",   "carol", "follows"),
    ("carol", "dave",  "follows"),
], ["src", "dst", "relationship"])

# Build the graph
g = GraphFrame(v, e)

print(g.vertices.count())  # 4
print(g.edges.count())     # 4

Common Graph Algorithms

PageRank — Finding Important Vertices

PageRank assigns a score to each vertex based on how many other vertices point to it. Google originally used PageRank to rank web pages. High-score vertices are considered more important.

# Run PageRank for 5 iterations
results = g.pageRank(resetProbability=0.15, maxIter=5)
results.vertices.select("id", "name", "pagerank").orderBy("pagerank", ascending=False).show()

# Output:
# +-----+-----+----------+
# |   id| name|  pagerank|
# +-----+-----+----------+
# |carol|Carol|  1.452310|
# |  bob|  Bob|  0.987612|
# | dave| Dave|  0.876432|
# |alice|Alice|  0.683646|
# +-----+-----+----------+
# Carol ranks highest — most people follow Carol

Shortest Path — Finding the Quickest Route

# Find shortest path from each vertex to "dave"
results = g.shortestPaths(landmarks=["dave"])
results.select("id", "distances").show()

# +-----+------------------+
# |   id|         distances|
# +-----+------------------+
# |alice|{dave -> 2}       |  # alice -> carol -> dave
# |  bob|{dave -> 2}       |  # bob -> carol -> dave
# |carol|{dave -> 1}       |  # carol -> dave directly
# | dave|{dave -> 0}       |  # already there
# +-----+------------------+

Connected Components — Finding Groups

Connected components finds groups of vertices where every member can reach every other member. In a social network, each connected component is an isolated community.

result = g.connectedComponents()
result.select("id", "component").show()
# All vertices with the same component number belong to the same group

Triangle Count — Measuring Density

A triangle exists when three vertices all connect to each other. High triangle counts indicate tight-knit communities. This algorithm counts triangles per vertex.

result = g.triangleCount()
result.select("id", "count").show()
# Vertices inside tight clusters will show higher triangle counts

Real-World Graph Use Cases

Domain	Vertices	Edges	Algorithm
Social network	Users	Friendships	PageRank, community detection
Navigation / maps	Locations	Roads (weighted by distance)	Shortest path
Fraud detection	Accounts, devices	Transactions	Connected components
Recommendation	Users, products	Purchases, ratings	Collaborative filtering via graph
Supply chain	Factories, warehouses	Shipment routes	Shortest path, bottleneck detection

GraphX vs GraphFrames

Feature	GraphX	GraphFrames
Language	Scala, Java	Python, Scala, Java
API style	RDD-based	DataFrame-based
Built into Spark?	Yes	External library (pip install)
SQL queries on graph?	No	Yes (uses DataFrames)
Recommended for Python	No	Yes

Previous lesson

Back to course