Tiny MapReduce

This project is an independent implementation of a MapReduce-like distributed framework for learning purposes.

Overview

Q: Whats MapReduce?

A: MapReduce is a distributed programming model developed by Jeffrey Dean and Sanjay Ghemawat at Google in 2004 for processing and generating large data sets with a parallel algorithm on a cluster. It consists of two main steps:

the Map step, where the input data is processed and transformed into intermediate key-value pairs.
the Reduce step, where the intermediate key-value pairs are aggregated to produce the final output.

Although MapReduce as a system is considered a legacy framework today, its programming model's design principles continue to have a significant impact on both industry and academia. The model has influenced the development of modern distributed data processing systems such as Hadoop, Spark, and Flink, and remains a foundational concept for large-scale parallel data processing.

Q: What does MapReduce solve?

A: MapReduce provides a framework for processing and generating large data sets on a distributed cluster. It abstracts away the complexities of a distributed system such as parallelization, fault tolerance, data distribution, and load balancing, allowing developers to focus on the map and reduce functions (business core logic).

In other words, with this kinda framework, teams can build scalable distributed data processing pipelines without requiring every engineer to be an expert in distributed systems.

Example to Run (Sequential Version)

# Build the plugin
go build -buildmode=plugin -o build/wc.so ./plugins/wc/

# Build sequential runner
go build -o build/seq ./cmd/sequential/

# Run
cd build
rm -f mr-out*
./seq wc.so ../testdata/pg*.txt

# View sorted output
sort mr-out-0

References

MapReduce: Simplified Data Processing on Large Clusters
MIT 6.5840 Distributed Systems (previously known as "6.824: Distributed Systems")

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
cmd		cmd
plugins		plugins
testdata		testdata
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
coordinator.go		coordinator.go
go.mod		go.mod
image.png		image.png
rpc.go		rpc.go
worker.go		worker.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiny MapReduce

Overview

Example to Run (Sequential Version)

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tiny MapReduce

Overview

Example to Run (Sequential Version)

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages