Spark GraphX and Graph Processing
GraphX is Spark's library for processing graph-structured data. A graph represents entities (vertices) and the relationships between them (edges). Social networks, road maps, supply chains, and recommendation systems are all naturally modeled as graphs.
What Is a Graph?
Social network graph:
Alice ---[follows]--> Bob
Alice ---[follows]--> Carol
Bob ---[follows]--> Carol
Carol ---[follows]--> Dave
Vertices (nodes): Alice, Bob, Carol, Dave
Edges (links): follows relationships
Each vertex and edge can carry properties:
Vertex: {name: "Alice", age: 32, city: "NY"}
Edge: {relationship: "follows", since: "2022-01"}
Core Graph Concepts
- Vertex — an entity (a person, a city, a webpage)
- Edge — a relationship between two vertices (follows, connects to, links to)
- Directed graph — edges have a direction (Alice follows Bob ≠ Bob follows Alice)
- Weighted edge — edges have a numeric value (distance between cities, strength of friendship)
- Degree — number of edges connected to a vertex
Creating a Graph in GraphX (Scala)
GraphX is a Scala/Java API. Python users work with GraphFrames (a community library) which wraps GraphX.
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
// Create vertices: (id, property)
val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(
(1L, "Alice"),
(2L, "Bob"),
(3L, "Carol"),
(4L, "Dave")
))
// Create edges: Edge(srcId, dstId, property)
val edges: RDD[Edge[String]] = sc.parallelize(Array(
Edge(1L, 2L, "follows"),
Edge(1L, 3L, "follows"),
Edge(2L, 3L, "follows"),
Edge(3L, 4L, "follows")
))
// Build the graph
val graph = Graph(vertices, edges)
// Count vertices and edges
println(s"Vertices: ${graph.vertices.count()}") // 4
println(s"Edges: ${graph.edges.count()}") // 4
GraphFrames — Python-Friendly Graph Processing
# Install: pip install graphframes
from pyspark.sql import SparkSession
from graphframes import GraphFrame
spark = SparkSession.builder.getOrCreate()
# Vertices DataFrame — must have an 'id' column
v = spark.createDataFrame([
("alice", "Alice", 32),
("bob", "Bob", 28),
("carol", "Carol", 35),
("dave", "Dave", 29)
], ["id", "name", "age"])
# Edges DataFrame — must have 'src' and 'dst' columns
e = spark.createDataFrame([
("alice", "bob", "follows"),
("alice", "carol", "follows"),
("bob", "carol", "follows"),
("carol", "dave", "follows"),
], ["src", "dst", "relationship"])
# Build the graph
g = GraphFrame(v, e)
print(g.vertices.count()) # 4
print(g.edges.count()) # 4
Common Graph Algorithms
PageRank — Finding Important Vertices
PageRank assigns a score to each vertex based on how many other vertices point to it. Google originally used PageRank to rank web pages. High-score vertices are considered more important.
# Run PageRank for 5 iterations
results = g.pageRank(resetProbability=0.15, maxIter=5)
results.vertices.select("id", "name", "pagerank").orderBy("pagerank", ascending=False).show()
# Output:
# +-----+-----+----------+
# | id| name| pagerank|
# +-----+-----+----------+
# |carol|Carol| 1.452310|
# | bob| Bob| 0.987612|
# | dave| Dave| 0.876432|
# |alice|Alice| 0.683646|
# +-----+-----+----------+
# Carol ranks highest — most people follow Carol
Shortest Path — Finding the Quickest Route
# Find shortest path from each vertex to "dave"
results = g.shortestPaths(landmarks=["dave"])
results.select("id", "distances").show()
# +-----+------------------+
# | id| distances|
# +-----+------------------+
# |alice|{dave -> 2} | # alice -> carol -> dave
# | bob|{dave -> 2} | # bob -> carol -> dave
# |carol|{dave -> 1} | # carol -> dave directly
# | dave|{dave -> 0} | # already there
# +-----+------------------+
Connected Components — Finding Groups
Connected components finds groups of vertices where every member can reach every other member. In a social network, each connected component is an isolated community.
result = g.connectedComponents()
result.select("id", "component").show()
# All vertices with the same component number belong to the same group
Triangle Count — Measuring Density
A triangle exists when three vertices all connect to each other. High triangle counts indicate tight-knit communities. This algorithm counts triangles per vertex.
result = g.triangleCount()
result.select("id", "count").show()
# Vertices inside tight clusters will show higher triangle counts
Real-World Graph Use Cases
| Domain | Vertices | Edges | Algorithm |
|---|---|---|---|
| Social network | Users | Friendships | PageRank, community detection |
| Navigation / maps | Locations | Roads (weighted by distance) | Shortest path |
| Fraud detection | Accounts, devices | Transactions | Connected components |
| Recommendation | Users, products | Purchases, ratings | Collaborative filtering via graph |
| Supply chain | Factories, warehouses | Shipment routes | Shortest path, bottleneck detection |
GraphX vs GraphFrames
| Feature | GraphX | GraphFrames |
|---|---|---|
| Language | Scala, Java | Python, Scala, Java |
| API style | RDD-based | DataFrame-based |
| Built into Spark? | Yes | External library (pip install) |
| SQL queries on graph? | No | Yes (uses DataFrames) |
| Recommended for Python | No | Yes |
