Welcome to pyspark-analyzer’s documentation!

PyPI version Python versions CI Status

pyspark-analyzer is a comprehensive profiling library for Apache Spark DataFrames, designed to help data engineers and scientists understand their data quickly and efficiently.

Key Features

  • Comprehensive Statistics: Automatic computation of data type-specific statistics

  • Performance Optimized: Intelligent sampling and batch processing for large datasets

  • Type-Aware: Different statistics for numeric, string, and temporal columns

  • Flexible Output: Multiple output formats (dict, JSON, summary report)

  • Easy Integration: Simple API that works with any PySpark DataFrame

Installation

pip install pyspark-analyzer

Quick Start

from pyspark.sql import SparkSession
from pyspark_analyzer import analyze

# Create a Spark session
spark = SparkSession.builder.appName("ProfilerExample").getOrCreate()

# Load your DataFrame
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Analyze the DataFrame
profile = analyze(df)

# Get summary report
summary = analyze(df, output_format="summary")
print(summary)

Indices and tables