API Reference

This section provides detailed API documentation for all public classes and functions in pyspark-analyzer.

Main Function

analyze

pyspark_analyzer.analyze(df, *, sampling=None, target_rows=None, fraction=None, columns=None, output_format='pandas', include_advanced=True, include_quality=True, seed=None, show_progress=None)[source]

Analyze a PySpark DataFrame and generate comprehensive statistics.

This is the simplified entry point for profiling DataFrames. It automatically handles sampling configuration based on the provided parameters.

Note: Compatible with PySpark 3.0.0+. Uses native median() function when available (PySpark 3.4.0+) for better performance, with automatic fallback to percentile_approx for older versions.

Parameters:

df (DataFrame) – PySpark DataFrame to analyze
sampling (Optional[bool]) – Whether to enable sampling. If None, auto-sampling is enabled for large datasets. If False, no sampling. If True, uses default sampling.
target_rows (Optional[int]) – Sample to approximately this many rows. Mutually exclusive with fraction.
fraction (Optional[float]) – Sample this fraction of the data (0.0-1.0). Mutually exclusive with target_rows.
columns (Optional[list[str]]) – List of specific columns to profile. If None, profiles all columns.
output_format (str) – Output format (“pandas”, “dict”, “json”, “summary”). Default is “pandas”.
include_advanced (bool) – Include advanced statistics (skewness, kurtosis, outliers, etc.)
include_quality (bool) – Include data quality metrics
seed (Optional[int]) – Random seed for reproducible sampling
show_progress (Optional[bool]) – Show progress indicators during analysis. If None, auto-detects based on environment.

Returns:

“pandas”: pandas DataFrame with statistics
”dict”: Python dictionary
”json”: JSON string
”summary”: Human-readable summary string

Return type:

Profile results in the requested format

Examples

>>> # Basic usage with auto-sampling
>>> profile = analyze(df)

>>> # Disable sampling
>>> profile = analyze(df, sampling=False)

>>> # Sample to 100,000 rows
>>> profile = analyze(df, target_rows=100_000)

>>> # Sample 10% of data
>>> profile = analyze(df, fraction=0.1)

>>> # Profile specific columns only
>>> profile = analyze(df, columns=["age", "salary"])

>>> # Get results as dictionary
>>> profile = analyze(df, output_format="dict")

Sampling

SamplingConfig

class pyspark_analyzer.SamplingConfig(enabled=True, target_rows=None, fraction=None, seed=42)[source]

Bases: object

Configuration for sampling operations.

enabled: Whether to enable sampling. Set to False to disable sampling completely.

target_rows: Target number of rows to sample. Takes precedence over fraction.

fraction: Fraction of data to sample (0-1). Only used if target_rows is not set.

seed: Random seed for reproducible sampling.

Parameters:

enabled (bool)
target_rows (Optional[int])
fraction (Optional[float])
seed (int)

enabled: bool = True

target_rows: Optional[int] = None

fraction: Optional[float] = None

seed: int = 42

__post_init__()[source]

Validate configuration after initialization.

Return type:: None

__init__(enabled=True, target_rows=None, fraction=None, seed=42)

Parameters:

enabled (bool)
target_rows (Optional[int])
fraction (Optional[float])
seed (int)

SamplingMetadata

class pyspark_analyzer.sampling.SamplingMetadata(original_size, sample_size, sampling_fraction, sampling_time, is_sampled)[source]

Bases: object

Metadata about a sampling operation.

Parameters:

original_size (int)
sample_size (int)
sampling_fraction (float)
sampling_time (float)
is_sampled (bool)

original_size: int

sample_size: int

sampling_fraction: float

sampling_time: float

is_sampled: bool

property speedup_estimate: float: Estimate processing speedup from sampling.

__init__(original_size, sample_size, sampling_fraction, sampling_time, is_sampled)

Parameters:

original_size (int)
sample_size (int)
sampling_fraction (float)
sampling_time (float)
is_sampled (bool)

SamplingDecisionEngine

Statistics

StatisticsComputer

class pyspark_analyzer.statistics.StatisticsComputer(dataframe, total_rows=None, cache_manager=None)[source]

Bases: object

Computes statistics for DataFrame columns using type-specific calculators.

Parameters:

dataframe (DataFrame)
total_rows (Optional[int])
cache_manager (Any)

__init__(dataframe, total_rows=None, cache_manager=None)[source]

Initialize with a PySpark DataFrame.

Parameters:

dataframe (DataFrame) – PySpark DataFrame to compute statistics for
total_rows (Optional[int]) – Cached row count to avoid recomputation
cache_manager (Any) – Optional CacheManager for performance optimization

compute_all_columns_batch(columns=None, include_advanced=True, include_quality=True, progress_tracker=None)[source]

Compute statistics for multiple columns with minimal DataFrame scans.

Parameters:

columns (Optional[list[str]]) – List of columns to profile. If None, profiles all columns.
include_advanced (bool) – Include advanced statistics (always True)
include_quality (bool) – Include data quality metrics
progress_tracker (Any) – Optional progress tracker for reporting progress

Return type:

dict[str, dict[str, Any]]

Returns:

Dictionary mapping column names to their statistics

Performance

BatchStatisticsComputer

Utility Functions

Utility functions for the DataFrame profiler.

pyspark_analyzer.utils.escape_column_name(column_name)[source]

Escape column name for safe use in PySpark SQL operations.

Handles special characters and SQL injection attempts by properly escaping backticks and wrapping the column name in backticks.

Parameters:: column_name (str) – Raw column name that may contain special characters
Return type:: str
Returns:: Escaped column name wrapped in backticks

Examples

>>> escape_column_name("normal_column")
'`normal_column`'
>>> escape_column_name("column.with.dots")
'`column.with.dots`'
>>> escape_column_name("col`with`backticks")
'`col``with``backticks`'
>>> escape_column_name("col; DROP TABLE users;--")
'`col; DROP TABLE users;--`'

pyspark_analyzer.utils.format_profile_output(profile_data, format_type='dict')[source]

Format the profile output in different formats.

Parameters:

profile_data (dict[str, Any]) – Raw profile data dictionary
format_type (str) – Output format (“dict”, “json”, “summary”, “pandas”)

Return type:

DataFrame | dict[str, Any] | str

Returns:

Formatted profile data

Examples

Basic profiling:

from pyspark_analyzer import analyze

# Get profile as pandas DataFrame
profile = analyze(df)

With sampling configuration:

from pyspark_analyzer import analyze

# Sample to 100,000 rows
profile = analyze(df, target_rows=100_000)

# Or sample 10% of data
profile = analyze(df, fraction=0.1)

With automatic sampling for large datasets:

profile = analyze(df, sampling=True)

Different output formats:

# Get as dictionary
profile_dict = analyze(df, output_format="dict")

# Get as JSON
profile_json = analyze(df, output_format="json")

# Get human-readable summary
summary = analyze(df, output_format="summary")