Skip to content

Latest commit

 

History

History
313 lines (234 loc) · 7.39 KB

File metadata and controls

313 lines (234 loc) · 7.39 KB

MVCS Design Document

Overview

MVCS (Minimal Version Control System) is an educational implementation of a content-addressable version control system inspired by Git. It demonstrates the core concepts of modern VCS without the complexity of a production system.

Architecture

Core Principles

  1. Content-Addressable Storage: All objects are stored by their SHA-256 hash
  2. Immutability: Once created, objects never change
  3. Three Object Types: Blobs (files), Trees (directories), and Commits (snapshots)
  4. Directed Acyclic Graph: Commits form a DAG structure through parent relationships

Repository Structure

.mvcs/
├── objects/           # Content-addressable object store
│   └── ab/           # First 2 chars of hash as subdirectory
│       └── cdef...   # Remaining hash chars as filename
├── refs/
│   └── heads/
│       └── master    # Branch pointer (contains commit hash)
├── HEAD              # Symbolic ref to current branch
└── index             # Staging area

Why This Structure?

  • objects/: Split into subdirectories to avoid filesystem limits on files per directory
  • refs/heads/: Separates branches from tags and other refs (though we only implement branches)
  • HEAD: Points to current branch, allows switching (in future implementation)
  • index: Simple text file for staging area

Object Format

All objects follow the same storage format:

<type> <size>\0<content>

Example:

blob 13\0Hello, world!

The SHA-256 hash is computed over the entire object (header + content).

Blob Objects

Stores file contents exactly as-is.

blob <size>\0<file content>

Tree Objects

Represents a directory, containing entries:

tree <size>\0<mode> <name>\0<hash><mode> <name>\0<hash>...

Each entry:

  • mode: Octal file mode (e.g., 100644 for regular file)
  • name: File/directory name
  • \0: Null separator
  • hash: 32 bytes of SHA-256 hash (binary)

Commit Objects

Represents a snapshot of the repository:

commit <size>\0tree <tree-hash>
parent <parent-hash>
author <author>
time <timestamp>

<commit message>

Fields:

  • tree: Root tree hash
  • parent: Parent commit hash (optional, omitted for first commit)
  • author: Author name/email
  • time: Unix timestamp
  • Message after blank line

Staging Area (Index)

Simple text format:

<hash> <mode> <path>
<hash> <mode> <path>
...

Each line represents a staged file with its hash, mode, and path.

Reference Format

References (branches) are simple text files containing a commit hash:

<commit-hash>\n

HEAD can be:

  1. Symbolic: ref: refs/heads/master\n
  2. Detached: <commit-hash>\n

Currently only symbolic refs are used.

Algorithms

Adding a File

  1. Read file contents
  2. Create blob object with header
  3. Compute SHA-256 hash
  4. Write object to .mvcs/objects/<ab>/<cdef...>
  5. Update index entry with hash, mode, and path
  6. Write updated index

Creating a Commit

  1. Read index (staging area)
  2. Create tree entries from index
  3. Write tree object
  4. Get current HEAD commit (if exists)
  5. Create commit object with:
    • Tree hash
    • Parent hash (if not first commit)
    • Author from environment
    • Current timestamp
    • User's message
  6. Write commit object
  7. Update current branch reference

Showing Log

  1. Read HEAD to get current commit
  2. Read commit object
  3. Display commit info
  4. If commit has parent:
    • Load parent commit
    • Repeat from step 3

Hash Computation

SHA256(header + content) -> 32-byte hash

Stored as hex string in refs/index:

64 hex characters (0-9, a-f)

Stored in binary in tree entries:

32 bytes of raw hash

Memory Management

  • All object data is malloc'd
  • Caller responsible for freeing returned data
  • Objects are read entirely into memory (no streaming)
  • Maximum file size: 10MB (configurable via MAX_CONTENT_SIZE)

Error Handling

  • Functions return 0 on success, -1 on error
  • Errors printed to stderr
  • No exceptions (C doesn't have them)
  • Simple errno-style error propagation

Limitations

Intentional (Educational Focus)

  • No branching/merging
  • No diff computation
  • No compression (could add zlib)
  • No network operations
  • No garbage collection
  • No file permissions beyond basic mode
  • No symbolic links
  • No large file support
  • Single-threaded only

Technical Constraints

  • Maximum 1000 tree entries per tree
  • Maximum 1000 index entries
  • Maximum 10MB file size
  • Maximum 4096 character paths
  • No Unicode normalization
  • No line-ending conversion

Performance Characteristics

  • Init: O(1) - just create directories
  • Add: O(1) - hash file + update index
  • Commit: O(n) where n = files in index
  • Log: O(d) where d = depth of history
  • Hash Lookup: O(1) - direct file access by hash

Security Considerations

SHA-256 Collision Resistance

Using SHA-256 (not SHA-1) provides:

  • 2^256 possible hashes
  • Collision resistance for our use case
  • Industry standard cryptographic hash

File System Safety

  • Objects stored in subdirectories (first 2 chars)
  • No user input in file paths (hashes only)
  • No symlink following
  • Simple permission model (0755 for dirs)

Input Validation

  • Hash strings validated (64 hex chars)
  • File size limits enforced
  • Path length limits enforced
  • Array bounds checked

Comparison with Git

Similarities

  • Content-addressable storage
  • Same object types (blob/tree/commit)
  • DAG structure for commits
  • Staging area concept
  • Similar directory layout

Differences

  • Git uses SHA-1 (now SHA-256 optional)
  • Git compresses objects with zlib
  • Git uses pack files for efficiency
  • Git has complex branch/merge
  • Git has network protocols
  • Git has index caching
  • Git has extensive optimization

Future Enhancements

Possible additions while maintaining simplicity:

  1. Compression: Add zlib compression to objects
  2. Branches: Implement branch creation/switching
  3. Tags: Add lightweight/annotated tags
  4. Diff: Basic file diff computation
  5. Status: Show working directory status
  6. Reset: Reset HEAD to specific commit
  7. Checkout: Restore files from commits
  8. Ignore: Support .mvcsignore file

Code Organization

  • mvcs.h: All type definitions and function declarations
  • repository.c: Init and repository checks
  • object.c: Low-level object read/write
  • tree.c: Tree object operations
  • commit.c: Commit object operations
  • index.c: Staging area management
  • refs.c: Reference operations
  • commands.c: High-level command implementations
  • main.c: CLI and command dispatch

Testing Strategy

Manual testing focuses on:

  1. Creating repositories
  2. Adding various file types
  3. Multiple commits
  4. History traversal
  5. Object inspection
  6. Edge cases (empty files, large files, special characters)

Educational Use

This implementation is designed to teach:

  1. How content-addressable storage works
  2. Why hashing provides data integrity
  3. How commits form a graph structure
  4. What a staging area does
  5. How references point to commits
  6. Filesystem-based storage design

Students can:

  • Read the entire codebase (~1000 LOC)
  • Understand all data structures
  • Trace command execution
  • Modify and extend functionality
  • Build from scratch

References