API Reference

This section provides detailed API documentation for all public classes and functions in pyspark-analyzer.

Main Function

analyze

pyspark_analyzer.analyze(df, *, sampling=None, target_rows=None, fraction=None, columns=None, output_format='pandas', include_advanced=True, include_quality=True, seed=None, show_progress=None)[source]

Analyze a PySpark DataFrame and generate comprehensive statistics.

This is the simplified entry point for profiling DataFrames. It automatically handles sampling configuration based on the provided parameters.

Note: Compatible with PySpark 3.0.0+. Uses native median() function when available (PySpark 3.4.0+) for better performance, with automatic fallback to percentile_approx for older versions.

Parameters:
  • df (DataFrame) – PySpark DataFrame to analyze

  • sampling (Optional[bool]) – Whether to enable sampling. If None, auto-sampling is enabled for large datasets. If False, no sampling. If True, uses default sampling.

  • target_rows (Optional[int]) – Sample to approximately this many rows. Mutually exclusive with fraction.

  • fraction (Optional[float]) – Sample this fraction of the data (0.0-1.0). Mutually exclusive with target_rows.

  • columns (Optional[list[str]]) – List of specific columns to profile. If None, profiles all columns.

  • output_format (str) – Output format (“pandas”, “dict”, “json”, “summary”). Default is “pandas”.

  • include_advanced (bool) – Include advanced statistics (skewness, kurtosis, outliers, etc.)

  • include_quality (bool) – Include data quality metrics

  • seed (Optional[int]) – Random seed for reproducible sampling

  • show_progress (Optional[bool]) – Show progress indicators during analysis. If None, auto-detects based on environment.

Returns:

  • “pandas”: pandas DataFrame with statistics

  • ”dict”: Python dictionary

  • ”json”: JSON string

  • ”summary”: Human-readable summary string

Return type:

Profile results in the requested format

Examples

>>> # Basic usage with auto-sampling
>>> profile = analyze(df)
>>> # Disable sampling
>>> profile = analyze(df, sampling=False)
>>> # Sample to 100,000 rows
>>> profile = analyze(df, target_rows=100_000)
>>> # Sample 10% of data
>>> profile = analyze(df, fraction=0.1)
>>> # Profile specific columns only
>>> profile = analyze(df, columns=["age", "salary"])
>>> # Get results as dictionary
>>> profile = analyze(df, output_format="dict")

Sampling

SamplingConfig

class pyspark_analyzer.SamplingConfig(enabled=True, target_rows=None, fraction=None, seed=42)[source]

Bases: object

Configuration for sampling operations.

enabled

Whether to enable sampling. Set to False to disable sampling completely.

target_rows

Target number of rows to sample. Takes precedence over fraction.

fraction

Fraction of data to sample (0-1). Only used if target_rows is not set.

seed

Random seed for reproducible sampling.

Parameters:
enabled: bool = True
target_rows: Optional[int] = None
fraction: Optional[float] = None
seed: int = 42
__post_init__()[source]

Validate configuration after initialization.

Return type:

None

__init__(enabled=True, target_rows=None, fraction=None, seed=42)
Parameters:

SamplingMetadata

class pyspark_analyzer.sampling.SamplingMetadata(original_size, sample_size, sampling_fraction, sampling_time, is_sampled)[source]

Bases: object

Metadata about a sampling operation.

Parameters:
  • original_size (int)

  • sample_size (int)

  • sampling_fraction (float)

  • sampling_time (float)

  • is_sampled (bool)

original_size: int
sample_size: int
sampling_fraction: float
sampling_time: float
is_sampled: bool
property speedup_estimate: float

Estimate processing speedup from sampling.

__init__(original_size, sample_size, sampling_fraction, sampling_time, is_sampled)
Parameters:
  • original_size (int)

  • sample_size (int)

  • sampling_fraction (float)

  • sampling_time (float)

  • is_sampled (bool)

SamplingDecisionEngine

Statistics

StatisticsComputer

class pyspark_analyzer.statistics.StatisticsComputer(dataframe, total_rows=None)[source]

Bases: object

Computes statistics for DataFrame columns using type-specific calculators.

Parameters:
__init__(dataframe, total_rows=None)[source]

Initialize with a PySpark DataFrame.

Parameters:
  • dataframe (DataFrame) – PySpark DataFrame to compute statistics for

  • total_rows (Optional[int]) – Cached row count to avoid recomputation

compute_all_columns_batch(columns=None, include_advanced=True, include_quality=True, progress_tracker=None)[source]

Compute statistics for multiple columns with minimal DataFrame scans.

Parameters:
  • columns (Optional[list[str]]) – List of columns to profile. If None, profiles all columns.

  • include_advanced (bool) – Include advanced statistics (always True)

  • include_quality (bool) – Include data quality metrics

  • progress_tracker (Any) – Optional progress tracker for reporting progress

Return type:

dict[str, dict[str, Any]]

Returns:

Dictionary mapping column names to their statistics

Performance

BatchStatisticsComputer

Utility Functions

Utility functions for the DataFrame profiler.

pyspark_analyzer.utils.escape_column_name(column_name)[source]

Escape column name for safe use in PySpark SQL operations.

Handles special characters and SQL injection attempts by properly escaping backticks and wrapping the column name in backticks.

Parameters:

column_name (str) – Raw column name that may contain special characters

Return type:

str

Returns:

Escaped column name wrapped in backticks

Examples

>>> escape_column_name("normal_column")
'`normal_column`'
>>> escape_column_name("column.with.dots")
'`column.with.dots`'
>>> escape_column_name("col`with`backticks")
'`col``with``backticks`'
>>> escape_column_name("col; DROP TABLE users;--")
'`col; DROP TABLE users;--`'
pyspark_analyzer.utils.format_profile_output(profile_data, format_type='dict')[source]

Format the profile output in different formats.

Parameters:
  • profile_data (dict[str, Any]) – Raw profile data dictionary

  • format_type (str) – Output format (“dict”, “json”, “summary”, “pandas”)

Return type:

DataFrame | dict[str, Any] | str

Returns:

Formatted profile data

Examples

Basic profiling:

from pyspark_analyzer import analyze

# Get profile as pandas DataFrame
profile = analyze(df)

With sampling configuration:

from pyspark_analyzer import analyze

# Sample to 100,000 rows
profile = analyze(df, target_rows=100_000)

# Or sample 10% of data
profile = analyze(df, fraction=0.1)

With automatic sampling for large datasets:

profile = analyze(df, sampling=True)

Different output formats:

# Get as dictionary
profile_dict = analyze(df, output_format="dict")

# Get as JSON
profile_json = analyze(df, output_format="json")

# Get human-readable summary
summary = analyze(df, output_format="summary")