API Reference
This section provides detailed API documentation for all public classes and functions in pyspark-analyzer.
Main Function
analyze
- pyspark_analyzer.analyze(df, *, sampling=None, target_rows=None, fraction=None, columns=None, output_format='pandas', include_advanced=True, include_quality=True, seed=None, show_progress=None)[source]
Analyze a PySpark DataFrame and generate comprehensive statistics.
This is the simplified entry point for profiling DataFrames. It automatically handles sampling configuration based on the provided parameters.
Note: Compatible with PySpark 3.0.0+. Uses native median() function when available (PySpark 3.4.0+) for better performance, with automatic fallback to percentile_approx for older versions.
- Parameters:
df (
DataFrame) – PySpark DataFrame to analyzesampling (
Optional[bool]) – Whether to enable sampling. If None, auto-sampling is enabled for large datasets. If False, no sampling. If True, uses default sampling.target_rows (
Optional[int]) – Sample to approximately this many rows. Mutually exclusive with fraction.fraction (
Optional[float]) – Sample this fraction of the data (0.0-1.0). Mutually exclusive with target_rows.columns (
Optional[list[str]]) – List of specific columns to profile. If None, profiles all columns.output_format (
str) – Output format (“pandas”, “dict”, “json”, “summary”). Default is “pandas”.include_advanced (
bool) – Include advanced statistics (skewness, kurtosis, outliers, etc.)include_quality (
bool) – Include data quality metricsseed (
Optional[int]) – Random seed for reproducible samplingshow_progress (
Optional[bool]) – Show progress indicators during analysis. If None, auto-detects based on environment.
- Returns:
“pandas”: pandas DataFrame with statistics
”dict”: Python dictionary
”json”: JSON string
”summary”: Human-readable summary string
- Return type:
Profile results in the requested format
Examples
>>> # Basic usage with auto-sampling >>> profile = analyze(df)
>>> # Disable sampling >>> profile = analyze(df, sampling=False)
>>> # Sample to 100,000 rows >>> profile = analyze(df, target_rows=100_000)
>>> # Sample 10% of data >>> profile = analyze(df, fraction=0.1)
>>> # Profile specific columns only >>> profile = analyze(df, columns=["age", "salary"])
>>> # Get results as dictionary >>> profile = analyze(df, output_format="dict")
Sampling
SamplingConfig
- class pyspark_analyzer.SamplingConfig(enabled=True, target_rows=None, fraction=None, seed=42)[source]
Bases:
objectConfiguration for sampling operations.
- enabled
Whether to enable sampling. Set to False to disable sampling completely.
- target_rows
Target number of rows to sample. Takes precedence over fraction.
- fraction
Fraction of data to sample (0-1). Only used if target_rows is not set.
- seed
Random seed for reproducible sampling.
SamplingMetadata
SamplingDecisionEngine
Statistics
StatisticsComputer
- class pyspark_analyzer.statistics.StatisticsComputer(dataframe, total_rows=None, cache_manager=None)[source]
Bases:
objectComputes statistics for DataFrame columns using type-specific calculators.
- __init__(dataframe, total_rows=None, cache_manager=None)[source]
Initialize with a PySpark DataFrame.
Performance
BatchStatisticsComputer
Utility Functions
Utility functions for the DataFrame profiler.
- pyspark_analyzer.utils.escape_column_name(column_name)[source]
Escape column name for safe use in PySpark SQL operations.
Handles special characters and SQL injection attempts by properly escaping backticks and wrapping the column name in backticks.
- Parameters:
column_name (
str) – Raw column name that may contain special characters- Return type:
- Returns:
Escaped column name wrapped in backticks
Examples
>>> escape_column_name("normal_column") '`normal_column`' >>> escape_column_name("column.with.dots") '`column.with.dots`' >>> escape_column_name("col`with`backticks") '`col``with``backticks`' >>> escape_column_name("col; DROP TABLE users;--") '`col; DROP TABLE users;--`'
Examples
Basic profiling:
from pyspark_analyzer import analyze
# Get profile as pandas DataFrame
profile = analyze(df)
With sampling configuration:
from pyspark_analyzer import analyze
# Sample to 100,000 rows
profile = analyze(df, target_rows=100_000)
# Or sample 10% of data
profile = analyze(df, fraction=0.1)
With automatic sampling for large datasets:
profile = analyze(df, sampling=True)
Different output formats:
# Get as dictionary
profile_dict = analyze(df, output_format="dict")
# Get as JSON
profile_json = analyze(df, output_format="json")
# Get human-readable summary
summary = analyze(df, output_format="summary")