# Contributing to pyspark-analyzer

We welcome contributions to pyspark-analyzer! This guide will help you get started.

## Development Setup

### 1. Fork and Clone

```bash
# Fork the repository on GitHub, then:
git clone https://github.com/YOUR_USERNAME/pyspark-analyzer.git
cd pyspark-analyzer
```

### 2. Install Development Dependencies

We use `uv` for fast dependency management:

```bash
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install all dependencies including dev
uv sync --all-extras
```

### 3. Set Up Pre-commit Hooks

```bash
uv run pre-commit install
```

## Development Workflow

### 1. Create a Feature Branch

```bash
git checkout -b feature/your-feature-name
```

### 2. Make Your Changes

Follow these guidelines:
- Write clear, self-documenting code
- Add type hints to all functions
- Include docstrings (Google style)
- Write tests for new functionality

### 3. Run Tests

```bash
# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=pyspark_analyzer

# Run specific test file
uv run pytest tests/test_profiler.py
```

### 4. Check Code Quality

```bash
# Run all pre-commit hooks
uv run pre-commit run --all-files

# Or individually:
uv run black pyspark_analyzer/ tests/
uv run ruff pyspark_analyzer/ tests/
uv run mypy pyspark_analyzer/
```

### 5. Update Documentation

If your changes affect the API:

```bash
# Build docs locally
cd docs
uv run make html
# View at docs/build/html/index.html
```

## Code Style Guidelines

### Python Style

We follow PEP 8 with these modifications:
- Line length: 120 characters
- Use Black for formatting
- Use type hints everywhere

### Docstring Format

Use Google style docstrings:

```python
def compute_statistics(df: DataFrame, columns: List[str]) -> Dict[str, Any]:
    """
    Compute statistics for specified columns.

    Args:
        df: The input DataFrame
        columns: List of column names to analyze

    Returns:
        Dictionary mapping column names to their statistics

    Raises:
        ValueError: If columns don't exist in DataFrame

    Example:
        >>> stats = compute_statistics(df, ["age", "salary"])
        >>> print(stats["age"]["mean"])
    """
```

### Import Order

1. Standard library imports
2. Third-party imports
3. Local imports

Each group separated by a blank line.

## Testing Guidelines

### Test Structure

```python
class TestDataFrameProfiler:
    """Test cases for DataFrameProfiler class."""

    def test_basic_profiling(self, spark_session, sample_df):
        """Test basic profiling functionality."""
        # Arrange
        profiler = DataFrameProfiler(sample_df)

        # Act
        profile = profiler.profile()

        # Assert
        assert "overview" in profile
        assert profile["overview"]["row_count"] == 100
```

### Test Coverage

- Aim for >90% test coverage
- Test edge cases and error conditions
- Include integration tests for Spark functionality

## Submitting Pull Requests

### 1. Commit Messages

Follow conventional commits:

```
feat: add support for decimal type profiling
fix: handle null values in median calculation
docs: update installation instructions
test: add tests for sampling module
refactor: optimize batch processing logic
```

### 2. Pull Request Template

```markdown
## Description
Brief description of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Tests pass locally
- [ ] Added new tests
- [ ] Updated documentation

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Comments added for complex logic
```

### 3. Review Process

1. Submit PR against `main` branch
2. Ensure CI passes
3. Address review feedback
4. Squash commits if requested

## Adding New Features

### 1. Discuss First

For major features:
- Open an issue for discussion
- Get feedback on approach
- Consider backward compatibility

### 2. Feature Structure

```
pyspark_analyzer/
├── new_feature.py      # Core implementation
tests/
├── test_new_feature.py # Comprehensive tests
docs/source/
├── new_feature.md      # User documentation
examples/
├── new_feature_demo.py # Usage example
```

### 3. Performance Considerations

- Profile performance impact
- Add benchmarks for significant features
- Consider memory usage
- Test with large datasets

## Reporting Issues

### Bug Reports

Include:
- Spark version
- Python version
- Minimal reproducible example
- Error messages
- Expected vs actual behavior

### Feature Requests

Include:
- Use case description
- Proposed API
- Alternative solutions considered
- Potential impact

## Community

### Code of Conduct

- Be respectful and inclusive
- Welcome newcomers
- Provide constructive feedback
- Focus on the issue, not the person

### Getting Help

- Check existing issues/PRs
- Read the documentation
- Ask in discussions
- Tag maintainers for urgent issues

## Release Process

Maintainers handle releases:

1. Update version in `pyproject.toml`
2. Update CHANGELOG.md
3. Create release tag
4. GitHub Actions publishes to PyPI

## License

By contributing, you agree that your contributions will be licensed under the same license as the project (Apache 2.0).