Contributing to pyspark-analyzer
We welcome contributions to pyspark-analyzer! This guide will help you get started.
Development Setup
1. Fork and Clone
# Fork the repository on GitHub, then:
git clone https://github.com/YOUR_USERNAME/pyspark-analyzer.git
cd pyspark-analyzer
2. Install Development Dependencies
We use uv for fast dependency management:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install all dependencies including dev
uv sync --all-extras
3. Set Up Pre-commit Hooks
uv run pre-commit install
Development Workflow
1. Create a Feature Branch
git checkout -b feature/your-feature-name
2. Make Your Changes
Follow these guidelines:
Write clear, self-documenting code
Add type hints to all functions
Include docstrings (Google style)
Write tests for new functionality
3. Run Tests
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=pyspark_analyzer
# Run specific test file
uv run pytest tests/test_profiler.py
4. Check Code Quality
# Run all pre-commit hooks
uv run pre-commit run --all-files
# Or individually:
uv run black pyspark_analyzer/ tests/
uv run ruff pyspark_analyzer/ tests/
uv run mypy pyspark_analyzer/
5. Update Documentation
If your changes affect the API:
# Build docs locally
cd docs
uv run make html
# View at docs/build/html/index.html
Code Style Guidelines
Python Style
We follow PEP 8 with these modifications:
Line length: 120 characters
Use Black for formatting
Use type hints everywhere
Docstring Format
Use Google style docstrings:
def compute_statistics(df: DataFrame, columns: List[str]) -> Dict[str, Any]:
"""
Compute statistics for specified columns.
Args:
df: The input DataFrame
columns: List of column names to analyze
Returns:
Dictionary mapping column names to their statistics
Raises:
ValueError: If columns don't exist in DataFrame
Example:
>>> stats = compute_statistics(df, ["age", "salary"])
>>> print(stats["age"]["mean"])
"""
Import Order
Standard library imports
Third-party imports
Local imports
Each group separated by a blank line.
Testing Guidelines
Test Structure
class TestDataFrameProfiler:
"""Test cases for DataFrameProfiler class."""
def test_basic_profiling(self, spark_session, sample_df):
"""Test basic profiling functionality."""
# Arrange
profiler = DataFrameProfiler(sample_df)
# Act
profile = profiler.profile()
# Assert
assert "overview" in profile
assert profile["overview"]["row_count"] == 100
Test Coverage
Aim for >90% test coverage
Test edge cases and error conditions
Include integration tests for Spark functionality
Submitting Pull Requests
1. Commit Messages
Follow conventional commits:
feat: add support for decimal type profiling
fix: handle null values in median calculation
docs: update installation instructions
test: add tests for sampling module
refactor: optimize batch processing logic
2. Pull Request Template
## Description
Brief description of changes
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update
## Testing
- [ ] Tests pass locally
- [ ] Added new tests
- [ ] Updated documentation
## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Comments added for complex logic
3. Review Process
Submit PR against
mainbranchEnsure CI passes
Address review feedback
Squash commits if requested
Adding New Features
1. Discuss First
For major features:
Open an issue for discussion
Get feedback on approach
Consider backward compatibility
2. Feature Structure
pyspark_analyzer/
├── new_feature.py # Core implementation
tests/
├── test_new_feature.py # Comprehensive tests
docs/source/
├── new_feature.md # User documentation
examples/
├── new_feature_demo.py # Usage example
3. Performance Considerations
Profile performance impact
Add benchmarks for significant features
Consider memory usage
Test with large datasets
Reporting Issues
Bug Reports
Include:
Spark version
Python version
Minimal reproducible example
Error messages
Expected vs actual behavior
Feature Requests
Include:
Use case description
Proposed API
Alternative solutions considered
Potential impact
Community
Code of Conduct
Be respectful and inclusive
Welcome newcomers
Provide constructive feedback
Focus on the issue, not the person
Getting Help
Check existing issues/PRs
Read the documentation
Ask in discussions
Tag maintainers for urgent issues
Release Process
Maintainers handle releases:
Update version in
pyproject.tomlUpdate CHANGELOG.md
Create release tag
GitHub Actions publishes to PyPI
License
By contributing, you agree that your contributions will be licensed under the same license as the project (Apache 2.0).