Welcome to pyspark-analyzer’s documentation!
pyspark-analyzer is a comprehensive profiling library for Apache Spark DataFrames, designed to help data engineers and scientists understand their data quickly and efficiently.
Key Features
Comprehensive Statistics: Automatic computation of data type-specific statistics
Performance Optimized: Intelligent sampling and batch processing for large datasets
Type-Aware: Different statistics for numeric, string, and temporal columns
Flexible Output: Multiple output formats (dict, JSON, summary report)
Easy Integration: Simple API that works with any PySpark DataFrame
Installation
pip install pyspark-analyzer
Quick Start
from pyspark.sql import SparkSession
from pyspark_analyzer import analyze
# Create a Spark session
spark = SparkSession.builder.appName("ProfilerExample").getOrCreate()
# Load your DataFrame
df = spark.read.csv("data.csv", header=True, inferSchema=True)
# Analyze the DataFrame
profile = analyze(df)
# Get summary report
summary = analyze(df, output_format="summary")
print(summary)