# Quick Start Guide This guide will help you get started with pyspark-analyzer in just a few minutes. ## Basic Usage ### 1. Import and Initialize ```python from pyspark.sql import SparkSession from pyspark_analyzer import analyze # Create Spark session spark = SparkSession.builder \ .appName("SparkProfilerQuickStart") \ .config("spark.sql.adaptive.enabled", "true") \ .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \ .getOrCreate() ``` ### 2. Load Your Data ```python # From CSV df = spark.read.csv("data.csv", header=True, inferSchema=True) # From Parquet df = spark.read.parquet("data.parquet") # From JSON df = spark.read.json("data.json") ``` ### 3. Profile Your DataFrame ```python # Generate profile with the analyze function profile = analyze(df) # View results as a pandas DataFrame print(profile) ``` ## Output Formats ### Pandas DataFrame (default) ```python # Default output is a pandas DataFrame profile = analyze(df) print(profile) ``` ### Dictionary Format ```python # Get dictionary output profile_dict = analyze(df, output_format="dict") print(profile_dict["overview"]) print(profile_dict["columns"]["age"]) ``` ### JSON Format ```python # Get JSON string output json_profile = analyze(df, output_format="json") print(json_profile) ``` ## Working with Large Datasets ### Automatic Sampling ```python # Enable automatic sampling for large datasets profile = analyze(df, sampling=True) # Specify target number of rows profile = analyze(df, sampling=True, target_rows=100_000) # Or specify sampling fraction profile = analyze(df, sampling=True, fraction=0.1) ``` ### Custom Sampling Configuration ```python from pyspark_analyzer import SamplingConfig # For advanced control, use SamplingConfig config = SamplingConfig( target_size=100_000, # Target 100k rows min_fraction=0.01, # At least 1% of data quality_threshold=0.8 # Minimum quality score ) profile_dict = analyze(df, sampling_config=config, output_format="dict") # Check sampling info print(profile_dict["sampling"]) ``` ## Profile Specific Columns ```python # Profile only specific columns profile = analyze(df, columns=["age", "salary", "department"]) ``` ## Common Use Cases ### Data Quality Assessment ```python # Get profile with quality metrics profile_dict = analyze(df, include_quality=True, output_format="dict") # Check for data quality issues for col_name, col_stats in profile_dict["columns"].items(): null_ratio = col_stats["null_count"] / col_stats["count"] if null_ratio > 0.5: print(f"Warning: {col_name} has {null_ratio:.1%} null values") if col_stats["distinct_count"] == 1: print(f"Warning: {col_name} has only one unique value") ``` ### Pre-Processing Analysis ```python # Identify columns that need cleaning profile_dict = analyze(df, output_format="dict") numeric_cols = [] categorical_cols = [] for col_name, col_stats in profile_dict["columns"].items(): if col_stats["data_type"] in ["integer", "double", "float"]: numeric_cols.append(col_name) elif col_stats["distinct_count"] < 100: # Potential categorical categorical_cols.append(col_name) print(f"Numeric columns: {numeric_cols}") print(f"Categorical candidates: {categorical_cols}") ``` ## Next Steps - Explore the [User Guide](user_guide.md) for advanced features - Check out [Examples](examples.md) for more use cases - Read the [API Reference](api_reference.rst) for detailed documentation