Spark GraphX and Graph Processing

GraphX is Spark's library for processing graph-structured data. A graph represents entities (vertices) and the relationships between them (edges). Social networks, road maps, supply chains, and recommendation systems are all naturally modeled as graphs.

What Is a Graph?

Social network graph:

    Alice ---[follows]--> Bob
    Alice ---[follows]--> Carol
    Bob   ---[follows]--> Carol
    Carol ---[follows]--> Dave

Vertices (nodes): Alice, Bob, Carol, Dave
Edges (links):    follows relationships

Each vertex and edge can carry properties:
Vertex: {name: "Alice", age: 32, city: "NY"}
Edge:   {relationship: "follows", since: "2022-01"}

Core Graph Concepts

  • Vertex — an entity (a person, a city, a webpage)
  • Edge — a relationship between two vertices (follows, connects to, links to)
  • Directed graph — edges have a direction (Alice follows Bob ≠ Bob follows Alice)
  • Weighted edge — edges have a numeric value (distance between cities, strength of friendship)
  • Degree — number of edges connected to a vertex

Creating a Graph in GraphX (Scala)

GraphX is a Scala/Java API. Python users work with GraphFrames (a community library) which wraps GraphX.

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

// Create vertices: (id, property)
val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(
    (1L, "Alice"),
    (2L, "Bob"),
    (3L, "Carol"),
    (4L, "Dave")
))

// Create edges: Edge(srcId, dstId, property)
val edges: RDD[Edge[String]] = sc.parallelize(Array(
    Edge(1L, 2L, "follows"),
    Edge(1L, 3L, "follows"),
    Edge(2L, 3L, "follows"),
    Edge(3L, 4L, "follows")
))

// Build the graph
val graph = Graph(vertices, edges)

// Count vertices and edges
println(s"Vertices: ${graph.vertices.count()}")  // 4
println(s"Edges: ${graph.edges.count()}")         // 4

GraphFrames — Python-Friendly Graph Processing

# Install: pip install graphframes
from pyspark.sql import SparkSession
from graphframes import GraphFrame

spark = SparkSession.builder.getOrCreate()

# Vertices DataFrame — must have an 'id' column
v = spark.createDataFrame([
    ("alice", "Alice", 32),
    ("bob",   "Bob",   28),
    ("carol", "Carol", 35),
    ("dave",  "Dave",  29)
], ["id", "name", "age"])

# Edges DataFrame — must have 'src' and 'dst' columns
e = spark.createDataFrame([
    ("alice", "bob",   "follows"),
    ("alice", "carol", "follows"),
    ("bob",   "carol", "follows"),
    ("carol", "dave",  "follows"),
], ["src", "dst", "relationship"])

# Build the graph
g = GraphFrame(v, e)

print(g.vertices.count())  # 4
print(g.edges.count())     # 4

Common Graph Algorithms

PageRank — Finding Important Vertices

PageRank assigns a score to each vertex based on how many other vertices point to it. Google originally used PageRank to rank web pages. High-score vertices are considered more important.

# Run PageRank for 5 iterations
results = g.pageRank(resetProbability=0.15, maxIter=5)
results.vertices.select("id", "name", "pagerank").orderBy("pagerank", ascending=False).show()

# Output:
# +-----+-----+----------+
# |   id| name|  pagerank|
# +-----+-----+----------+
# |carol|Carol|  1.452310|
# |  bob|  Bob|  0.987612|
# | dave| Dave|  0.876432|
# |alice|Alice|  0.683646|
# +-----+-----+----------+
# Carol ranks highest — most people follow Carol

Shortest Path — Finding the Quickest Route

# Find shortest path from each vertex to "dave"
results = g.shortestPaths(landmarks=["dave"])
results.select("id", "distances").show()

# +-----+------------------+
# |   id|         distances|
# +-----+------------------+
# |alice|{dave -> 2}       |  # alice -> carol -> dave
# |  bob|{dave -> 2}       |  # bob -> carol -> dave
# |carol|{dave -> 1}       |  # carol -> dave directly
# | dave|{dave -> 0}       |  # already there
# +-----+------------------+

Connected Components — Finding Groups

Connected components finds groups of vertices where every member can reach every other member. In a social network, each connected component is an isolated community.

result = g.connectedComponents()
result.select("id", "component").show()
# All vertices with the same component number belong to the same group

Triangle Count — Measuring Density

A triangle exists when three vertices all connect to each other. High triangle counts indicate tight-knit communities. This algorithm counts triangles per vertex.

result = g.triangleCount()
result.select("id", "count").show()
# Vertices inside tight clusters will show higher triangle counts

Real-World Graph Use Cases

DomainVerticesEdgesAlgorithm
Social networkUsersFriendshipsPageRank, community detection
Navigation / mapsLocationsRoads (weighted by distance)Shortest path
Fraud detectionAccounts, devicesTransactionsConnected components
RecommendationUsers, productsPurchases, ratingsCollaborative filtering via graph
Supply chainFactories, warehousesShipment routesShortest path, bottleneck detection

GraphX vs GraphFrames

FeatureGraphXGraphFrames
LanguageScala, JavaPython, Scala, Java
API styleRDD-basedDataFrame-based
Built into Spark?YesExternal library (pip install)
SQL queries on graph?NoYes (uses DataFrames)
Recommended for PythonNoYes

Leave a Comment