GraphFrames: graph algorithms at scale
This is a package for graphs processing and analytics at scale. It is built on top of Apache Spark and relies on DataFrame abstraction. It provides built-in and easy to use distributed graph algorithms as well as flexible APIs like Pregel or AggregateMessages to make custom graph processing. Users can write highly expressive queries by leveraging the DataFrame API, combined with a new API for network motif finding. The user also benefits from DataFrame performance optimizations within the Spark SQL engine. GraphFrames works in Java, Scala, and Python.
GraphFrames usecases
There are some popular use cases when GraphFrames is almost irreplaceable, including, but not limited to:
- Compliance analytics with a scalable shortest paths algorithm and motif analysis;
- Anti-fraud with scalable cycles detection in large networks;
- Identity resolution at the scale of billions with highly efficient connected components;
- Search result ranking with a distributed, Pregel-based PageRank;
- Clustering huge graphs with Label Propagation and Power Iteration Clustering;
- Building a knowledge graph systems with Property Graph Model.
Documentation
- Installation
- Creating Graphs
- Basic Graph Manipulations
- Centrality Metrics
- Motif finding
- Traversals and Connectivity
- Community Detection
- Scala API
- Python API
- Apache Spark compatibility
Quick Start
Now you can create a GraphFrame as follows.
from pyspark.sql import SparkSession from graphframes import GraphFrame spark = SparkSession.builder.getOrCreate() nodes = [ (1, "Alice", 30), (2, "Bob", 25), (3, "Charlie", 35) ] nodes_df = spark.createDataFrame(nodes, ["id", "name", "age"]) edges = [ (1, 2, "friend"), (2, 1, "friend"), (2, 3, "friend"), (3, 2, "enemy") # eek! ] edges_df = spark.createDataFrame(edges, ["src", "dst", "relationship"]) g = GraphFrame(nodes_df, edges_df)
Now let's run some graph algorithms at scale!
g.inDegrees.show() # +---+--------+ # | id|inDegree| # +---+--------+ # | 2| 2| # | 1| 1| # | 3| 1| # +---+--------+ g.outDegrees.show() # +---+---------+ # | id|outDegree| # +---+---------+ # | 1| 1| # | 2| 2| # | 3| 1| # +---+---------+ g.degrees.show() # +---+------+ # | id|degree| # +---+------+ # | 1| 2| # | 2| 4| # | 3| 2| # +---+------+ g2 = g.pageRank(resetProbability=0.15, tol=0.01) g2.vertices.show() # +---+-----+---+------------------+ # | id| name|age| pagerank| # +---+-----+---+------------------+ # | 1| John| 30|0.7758750474847483| # | 2|Alice| 25|1.4482499050305027| # | 3| Bob| 35|0.7758750474847483| # +---+-----+---+------------------+ # GraphFrames' most used feature... # Connected components can do big data entity resolution on billions or even trillions of records! # First connect records with a similarity metric, then run connectedComponents. # This gives you groups of identical records, which you then link by same_as edges or merge into list-based master records. sc.setCheckpointDir("/tmp/graphframes-example-connected-components") # required by GraphFrames.connectedComponents g.connectedComponents().show() # +---+-----+---+---------+ # | id| name|age|component| # +---+-----+---+---------+ # | 1| John| 30| 1| # | 2|Alice| 25| 1| # | 3| Bob| 35| 1| # +---+-----+---+---------+ # Find frenemies with network motif finding! See how graph and relational queries are combined? ( g.find("(a)-[e]->(b); (b)-[e2]->(a)") .filter("e.relationship = 'friend' and e2.relationship = 'enemy'") .show() ) # These are paths, which you can aggregate and count to find complex patterns. # +------------+--------------+----------------+-------------+ # | a| e| b| e2| # +------------+--------------+----------------+-------------+ # |{2, Bob, 25}|{2, 3, friend}|{3, Charlie, 35}|{3, 2, enemy}| # +------------+--------------+----------------+-------------+
Learn GraphFrames
To learn more about GraphFrames, check out these resources:
GraphFrames tutorials
Community Resources
These resources are provided by the community:
- Introducing GraphFrames
- GraphFrames Google Group
- #graphframes Discord Channel on GraphGeeks
- Graph Operations in Apache Spark Using GraphFrames
- Executing Graph Algorithms with GraphFrames on Databricks
- On-Time Flight Performance with GraphFrames for Apache Spark
- Sustainability in Aluminum Production
GraphFrames Internals
- A top level overview of GraphFrames internals
- GraphFrames: An Integrated API for Mixing Graph and Relational Queries, Dave et al. 2016.
Contributing
GraphFrames was made as a collaborative effort among UC Berkeley, MIT, Databricks and the open source community. At the moment GraphFrames is maintained by a group of individual contributors.
See contribution guide and the local development setup walkthrough for step-by-step instructions on preparing your environment, running tests, and submitting changes.
Releases
See release notes.
