pyspark-analyzer

Contents:

  • Installation
    • Requirements
    • Install from PyPI
    • Install from Source
      • Using pip
      • Using uv (recommended for development)
    • Verify Installation
    • Troubleshooting
      • Java Not Found
      • Spark Configuration Issues
  • Quick Start Guide
    • Basic Usage
      • 1. Import and Initialize
      • 2. Load Your Data
      • 3. Profile Your DataFrame
    • Output Formats
      • Pandas DataFrame (default)
      • Dictionary Format
      • JSON Format
    • Working with Large Datasets
      • Automatic Sampling
      • Custom Sampling Configuration
    • Profile Specific Columns
    • Common Use Cases
      • Data Quality Assessment
      • Pre-Processing Analysis
    • Next Steps
  • User Guide
    • Overview
    • Understanding Profile Output
      • Profile Structure
      • Column Statistics by Type
        • Numeric Columns
        • String Columns
        • Temporal Columns
    • Performance Optimization
      • Automatic Optimization
      • Manual Performance Tuning
        • 1. Sampling Configuration
        • 2. Column Selection
        • 3. Partition Optimization
    • Advanced Sampling
      • Quality-Based Sampling
      • Stratified Sampling (Future Feature)
    • Integration Patterns
      • With MLlib
      • With Data Quality Frameworks
      • With Reporting Tools
    • Best Practices
      • 1. Cache Management
      • 2. Memory Management
      • 3. Error Handling
    • Customization
      • Custom Statistics (Future Feature)
      • Output Formatters (Future Feature)
    • Troubleshooting
      • Common Issues
      • Debug Mode
  • API Reference
    • Main Function
      • analyze
        • analyze()
    • Sampling
      • SamplingConfig
        • SamplingConfig
      • SamplingMetadata
        • SamplingMetadata
      • SamplingDecisionEngine
    • Statistics
      • StatisticsComputer
        • StatisticsComputer
    • Performance
      • BatchStatisticsComputer
    • Utility Functions
      • escape_column_name()
      • format_profile_output()
    • Examples
  • Examples
    • Basic Examples
      • Example 1: Simple CSV Profiling
      • Example 2: Data Quality Check
    • Advanced Examples
      • Example 3: Comparative Profiling
      • Example 4: Automated Feature Engineering
      • Example 5: Performance Monitoring
    • Real-World Scenarios
      • Example 6: E-commerce Data Profiling
      • Example 7: Time Series Data Profiling
    • Integration Examples
      • Example 8: Integration with MLflow
      • Example 9: Automated Report Generation
  • Contributing to pyspark-analyzer
    • Development Setup
      • 1. Fork and Clone
      • 2. Install Development Dependencies
      • 3. Set Up Pre-commit Hooks
    • Development Workflow
      • 1. Create a Feature Branch
      • 2. Make Your Changes
      • 3. Run Tests
      • 4. Check Code Quality
      • 5. Update Documentation
    • Code Style Guidelines
      • Python Style
      • Docstring Format
      • Import Order
    • Testing Guidelines
      • Test Structure
      • Test Coverage
    • Submitting Pull Requests
      • 1. Commit Messages
      • 2. Pull Request Template
      • 3. Review Process
    • Adding New Features
      • 1. Discuss First
      • 2. Feature Structure
      • 3. Performance Considerations
    • Reporting Issues
      • Bug Reports
      • Feature Requests
    • Community
      • Code of Conduct
      • Getting Help
    • Release Process
    • License
pyspark-analyzer
  • Search


© Copyright 2025, Bjorn van Dijkman.

Built with Sphinx using a theme provided by Read the Docs.