.. pyspark-analyzer documentation master file Welcome to pyspark-analyzer's documentation! ========================================== .. image:: https://img.shields.io/pypi/v/pyspark-analyzer.svg :target: https://pypi.python.org/pypi/pyspark-analyzer :alt: PyPI version .. image:: https://img.shields.io/pypi/pyversions/pyspark-analyzer.svg :target: https://pypi.python.org/pypi/pyspark-analyzer :alt: Python versions .. image:: https://github.com/bjornvandijkman1993/pyspark-analyzer/workflows/CI/badge.svg :target: https://github.com/bjornvandijkman1993/pyspark-analyzer/actions :alt: CI Status **pyspark-analyzer** is a comprehensive profiling library for Apache Spark DataFrames, designed to help data engineers and scientists understand their data quickly and efficiently. Key Features ------------ * **Comprehensive Statistics**: Automatic computation of data type-specific statistics * **Performance Optimized**: Intelligent sampling and batch processing for large datasets * **Type-Aware**: Different statistics for numeric, string, and temporal columns * **Flexible Output**: Multiple output formats (dict, JSON, summary report) * **Easy Integration**: Simple API that works with any PySpark DataFrame Installation ------------ .. code-block:: bash pip install pyspark-analyzer Quick Start ----------- .. code-block:: python from pyspark.sql import SparkSession from pyspark_analyzer import analyze # Create a Spark session spark = SparkSession.builder.appName("ProfilerExample").getOrCreate() # Load your DataFrame df = spark.read.csv("data.csv", header=True, inferSchema=True) # Analyze the DataFrame profile = analyze(df) # Get summary report summary = analyze(df, output_format="summary") print(summary) .. toctree:: :maxdepth: 2 :caption: Contents: installation quickstart user_guide api_reference examples contributing changelog Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`