DataGator
Portable, forensic-capable indexing and deduplication for filesystems, archives, containers, and raw devices.
What Makes DataGator Unique?
- π§ Forensic Precision β Cryptographically fingerprints every file, stream, and path segment using MD5 + SHA256
- 𧬠Unicode-Aware Indexing β Handles normalization, control chars, raw vs canonical paths, and metadata-safe comparisons
- π Deduplication Engine β Detects and deduplicates identical content across nested archives, volumes, and snapshots
- π Composite Archive Scanning β Recursively explores .zip, .iso, .dmg, .vhd, MDOS/QIC, even inside disk images
- π§Ύ Time-Stamped, Immutable Snapshots β For audit logs, forensic trails, backup validation, and SBOM generation
- π Human-Friendly Dashboards β Local HTML summaries of indexed content, snapshots, diffs, and drift
- π ACL-Aware Permissions β Captures and compares ACLs across NTFS, APFS, ext4, and NFSv4
- π§ͺ Test-Driven Shell & AWK Codebase β Every core module ships with unit tests for sanity and longevity
π DataGator Core vs Pro
DataGator Core is built for power users, researchers, and forensic tinkerers who want full control using CLI tools and containers.
Pro adds real-time indexing, dashboards, syncing, and team workflows β built for compliance, operations, and distributed environments.
Feature |
Core (Open Source) |
Pro (Enterprise) |
π§ Cryptographic FS Indexing | β
| β
|
π Deduplication | Planned (external) | β
Chunk-level + overlay |
π Archive + Disk Image Support | β
| β
+ auto extraction |
π§Ύ Snapshots & Diffs | Manual only | β
Visual + timed |
π HTML Dashboards | β
Static export | β
Interactive + live |
π ACL + Metadata Comparison | β
| β
+ extended tagging |
π Web UI | π« | β
|
π‘ P2P Segment Sync | π« | β
Gossip protocol |
π Watch Mode (live indexing) | π« | β
|
π Audit-Ready Compliance Reports | Manual | β
Exportable + signed |
π Encryption | BYO only | β
Integrated |
πΌ Support | Community | β
SLA + direct access |
Use Cases Across Domains
- π΅οΈ Digital forensics and chain-of-custody proof
- πΎ Smart backup with dedup + drift detection + rehydration
- π©ββοΈ Legal compliance (21 CFR Part 11, GDPR, audit diffs)
- π¦ Container and firmware reverse engineering
- π§ Academic and AI dataset lineage attestation
- π Long-term archival + digital preservation (NARA-grade)
- π Reproducible build tracing (distro package walker)
Comparison: DataGator vs the World
- β
Versioned Index β Unlike rsync or basic backup tools, DG tracks exact changes and content hashes
- β
Cross-Platform β POSIX-native, works in containers, bind-mounts, and loopback devices
- β
Filesystem-Agnostic β NTFS, ext4, APFS, HFS+, FAT, ZFS, raw .img/.dsk/.iso β all ingestible
- β
Audit-Ready β All snapshots include cryptographic identity, timestamp, and provenance chains
- β No Vendor Lock-In β 100% local, no cloud required, open output formats
Get Started in 3 Steps
All you need is a bind-mount and Docker or Podman.
docker run --rm \
-v /your/data:/mnt:ro \
-v ~/.DataGator.id:/etc/DataGator.id:ro \
datagator scan /mnt
Want to detect changes over time?
./drift_detector.sh
Need a web report?
open __postgresql/dashboard/index.html
How DataGator Sees Your Filesystem
DataGator doesnβt just index file paths β it models your storage with cryptographic and structural fidelity.
- π Canonical Paths: Paths are stored as UTF-8 encoded, percent-escaped, and normalized segments
- π‘ PUA-stripping and normalization: Ensures consistency across case/encoding quirks
- π Hashing Policy: Each file, segment, stream, and directory gets an MD5 + SHA256 identity
- 𧬠Raw vs Canonical: Stores both for forensic transparency
Test-Driven from Day One
Every core shell or AWK module in DataGator is paired with a unit test. No test, no merge. Run everything at once:
find scripts -name '*_test.sh' -exec bash {} \;
This ensures stability across updates, formats, and edge cases.
Our Philosophy
βWe do not store files β we store truth.
Names are not paths β they are sequences of intention.β
β DataGator Design Memo #001
Edge Cases and Real-World Gotchas
- π£ Filenames with control characters are safely encoded, raw form preserved
- π File deletions are never removed β they're tracked as disappeared
- π NTFS alternate data streams are treated as child files
- π§² Snapshots and USB mounts are treated as distinct indexable volumes
- π Sync logic maps chunk hashes and path segments, not whole files
Who Is DataGator For?
- π Home Lab Architects β Multi-device ZFS setups, Pi clusters, NAS sprawl
- π§ͺ Researchers & Developers β Indexing legacy data and academic corpora
- π Digital Forensics Pros β Path+hash snapshots for provable evidence trails
- πΎ Backup & Archival Tinkerers β Smart deduping, drift detection, and restoration
- π§ββοΈ Compliance Officers β 21 CFR Part 11, GDPR, and audit traceability
What DataGator Replaces
- π Dev Environments β replaces Dockerfiles and Nix with rehydrated snapshots
- π₯ Package Managers β segment-based deduped installs instead of dependency hell
- π§Ύ SBOM Tools β cryptographic manifests at source, not post-facto guesswork
- π Vaults/KMS β signed secrets with ACL lineage and timestamped provenance
- π§ ML Registries β hash-chain verified corpora and models for reproducibility
- π¦ Backup/Sync Tools β smarter than rsync, more forensic than NAS snapshots
CLI Command Overview
Command |
Description |
Key Flags |
scan |
Ingest volumes from scan.yaml into indexed DB |
--volume , --path |
condense |
Deduplicate, archive, and compress old sessions |
--older-than , --archive-out |
export |
Export sessions or host views as JSON/YAML |
--session , --out |
doctor |
Run diagnostics on system, mountpoints, and schemas |
--details , --json |
Built for Compliance and Courtroom
- π
Immutable, time-sealed manifests for 21 CFR Part 11, GDPR
- π Signed content blobs for tamper-evident trails
- π Full ACL lineage and permission diffs for NTFS/APFS/ext4
- π§Ύ Court-admissible snapshots with hash+path chains
- π Machine-verifiable diff reports over time
Beyond Indexing: What Sets DataGator Apart
- 𧬠Recursive archive extraction inside disk images and formats like QIC/MDOS
- π§ Sparse file detection using native system APIs
- π Overlay fingerprinting for containers, VMs, and firmware
- π‘ Metadata normalization for broken encodings and legacy filesystems
- π‘οΈ Entropy heatmaps to visualize compressed or encrypted regions
AI Dataset Integrity & Model Provenance
- β
Legal authorship tracking with input β model β export hashes
- β
GDPR/EU AI Act compliance via full source traceability
- β
Poisoned dataset detection with diffable manifest snapshots
- β
Reproducibility for academic or benchmark use
Extend DataGator with Custom Parsers
Want to analyze a weird format or ancient archive? You can build your own extractor.
- π¦ Add support with
infogator add-parser my_format.py
- π Use declarative format schemas for reproducible exports
- π Join a growing community of open-source plugin developers
Future & Expansion Plans
- π Web UI Explorer with overlay tree viewer
- π§ AI summarization of suspicious archives or entropy zones
- π PostgreSQL/SQLite backend for diffable index snapshots
- 𧬠FUSE-based virtual mount of indexed archives (planned)
- π₯ Entropy heatmap per file and segment (with SVG export)