Table of contents
Have you ever worked with Node.js streams? What was your experience like?
When I first tried to work with streams, I was confused, to say the least. The concept was completely new to me. I thought I could just ignore them, but it turns out they're everywhere in Node.js. Even core modules like fs
and http
use streams under the hood. So, I had to learn them and understand how they work.
What helped me was building a strong mental model that consists of multiple concepts. In this article, we'll explore these concepts and build a mental model of Node.js streams together.
What are Node.js Streams?
The main idea behind streams is that they take pieces of data from one place and transfer them to another. There are 4 important parts that I want to highlight based on this definition:
Streams transfer data in pieces, not as a whole
Streams transfer pieces of data in a specific size
Streams aren't interested in the transferred data
Streams simply provide a mechanism for data transfer
A common analogy used to describe streams is a pipe. However, this analogy often misses 2 crucial parts: the producer and the consumer. Let's use the same analogy but make it more complete.
Imagine a huge reservoir of water, and you have a house nearby. To supply water to your house, you need to build a pipe from the reservoir to your home.
P.S. I'm not a plumber, so don't take this drawing too seriously.
This analogy illustrates the three key parts of a stream:
The water reservoir is a producer of water
The pipe is a stream that transfers water from the reservoir to your home
Your home is a consumer of water
Coming back to Node.js streams. Let's compare the pipe analogy to how they behave:
The pipe doesn't transfer the entire reservoir of water all at once
The pipe transfers water in pieces, each of a specific size that it can handle
The pipe is not interested in the water itself, and it's just a way to transfer it
The pipe is just a mechanism to transfer water from one place to another
Looks pretty similar to Node.js streams, right?
When are Node.js streams used?
Before going into the specific details of what streams are and how they work, let's first understand when they’re used.
Real-time data processing
Streams work great for processing when we deal with data that is partial or generated incrementally over time. Streams are highly effective for processing data that is generated incrementally or received in parts over time.
An ideal example of this is a WebSocket protocol. In short, it's a protocol that allows you to establish a two-way communication channel between the client and the server.
We'll get into more details on this protocol in the upcoming articles. We'll take the WS library as an example. It uses streams heavily. Here is an example where the abstraction called Sender
implements a backpressure mechanism.
We'll talk about the backpressure in the upcoming section. And it is just one example. You can explore the library further and see other use-cases.
Network interactions
Every time you create a server using Node.js API, you're creating a duplex stream. HTTP module in Node.js uses the abstraction called Scoket
to create a connection with a network socket. This Socket
abstraction extends from the Duplex
stream.
ObjectSetPrototypeOf(Socket.prototype, stream.Duplex.prototype);
ObjectSetPrototypeOf(Socket, stream.Duplex);
Whenever you see a construction like the following:
import { createServer } from 'http';
const server = createServer();
Know that under the hood, you're creating a duplex stream.
Working with large datasets
Imagine that you have a file that is 100GB in size. You need to parse it and process some data. How would you do it?
If you try to read the file using API, like readFileSync
or readFile
you'll crash your program.
import { readFileSync, readFile } from 'fs';
const largeFilePath = 'path/to/large/file.txt';
// Both of these will crash your program
const data = readFileSync(largeFilePath);
const asyncData = await readFile(largeFilePath);
The problem is that you're trying to load the whole file content into memory using these read APIs. Doesn't sound efficient at all. What we can do instead is to process the file's content in chunks.
import { createReadStream } from 'fs';
const largeFilePath = 'path/to/large/file.txt';
const stream = createReadStream(largeFilePath);
stream.on('data', (chunk) => {
// Process the chunk here
});
With this approach, we're not waiting for the whole file to be loaded into memory. Whenever a chunk of data is ready, we're processing it.
Data transformation
All previous examples were about the cases where we either read data from somewhere or write data to somewhere. But we can also use streams to transform data that we already have in memory.
A good example of this is data compression/decompression. Here is an example taken from the zlib module in Node.js documentation.
async function do_gzip(input, output) {
const gzip = createGzip();
// Create a read stream to read data from the input
const source = createReadStream(input);
// Create a write stream to write data to the output
const destination = createWriteStream(output);
// Pipe the source stream to the gzip stream,
// then to the destination stream
await pipe(source, gzip, destination); }
}
In this code snippet, we're creating a read stream, and whenever data comes from this read stream, we pass it down to the gzip. When the gzip stream compresses the data, we pass it down to the write stream.
You don't have to understand how this code works just yet. Just understand that streams can be used to transform different data.
Don't use streams in this case
You don't want to use streams when the data you're working with is already in memory. There is just little to no benefit you can gain from using streams.
So please, try to avoid using streams when all pieces of data that you need are already in memory. Don't make your life harder.
Core concepts on Node.js streams
You understand what streams are, when to use them, and when not to. Now, you're ready to dive deeper into some of the core concepts of streams in Node.js.
Event-driven architecture
You know that streams are like pipes. But what exactly makes them work this way? It is all thanks to even-driven concepts that streams are built upon. In particular, all streams in Node.js are extended from the EventEmitter
class.
The way EventEmitter
works is very simple. It has some internal state where it stores all events and listeners of these events.
class EventEmitter {
// Map of events and their listeners
// Each event can have multiple listeners
#events = new Map<string, (() => void)[]>();
// Register a new listener for the event
on(eventName: string, callback: () => void) {
if (!this.#events.has(eventName)) {
this.#events.set(eventName, [callback]);
}
this.#events.get(eventName).push(callback);
}
// Triggers all listeners related to the event.
emit(eventName: string) {
const listeners = this.#events.get(eventName);
if (!listeners) {
return;
}
listeners.forEach((listener) => listener());
}
}
It is a very simplified version, but it gives you an idea of how EventEmitter
works. You can read the full implementation in the Node.js source code.
When you work with streams, you can add a listener to some predefined set of events.
stream.on('data', () => {});
In this example, we add a listener to the data
event. Whenever a chunk of data is ready, the stream calls the emit
with the data
event name, and all listeners are called.
It's the exact mechanism that makes streams work like pipes, where we get data from one end and pass it through to the other end.
Backpressure
Streams can be used to process large datasets efficiently. But there is a catch: what if the rate of data production is so high that at some point in time, we have more data in our program than allocated memory can handle? Right, the program will crash.
This means that just the abstraction of a stream is not enough to prevent such cases from happening. Streams have a backpressure mechanism in place for such cases.
Backpressure might sound like a fancy term, but in reality, it is quite simple. The main idea of backpressure is that we have some limit on how much data we can process at a time.
Let's get back to the example with reading a large file. There are 2 parts of this process that we're interested in: the producer of data and the consumer of data. The producer of data is the underlying OS mechanism that reads the file and produces the data.
If the producer tries to push too much data, a stream can signal to the producer that it needs to slow down because it can't take any more data at the moment. But how does the stream know when it's full?
Each stream has an internal buffer, and whenever new data comes in and the old one comes out, the "buffering" mechanism comes into play.
Buffering
Each stream has an internal buffer. If we work with API that enables backpressure mechanism, then this buffer is used to store data that comes into the stream.
If data comes into the stream but doesn't come out of the stream, the buffer steadily gets filled until it reaches the cap. The cap, in this case, is highWaterMark
property set for each individual stream.
Here is an example of how we can set highWaterMark
property when reading a file.
import { createReadStream } from 'node:fs';
const filePath = 'path/to/file.txt';
const writeStream = createReadStream(filePath, { highWaterMark: 1024 });
The highWaterMark
is set to 64KB for createReadStream
function by default. When the internal buffer frees up some space, the stream can start reading more data from the source.
Piping and chaining
In more or less complex Node.js applications you'll need to transform data that comes from a stream or send this data to some other destination. In cases like this, a concept called as "piping" comes useful.
You can create a chain of streams where one stream is connected to another stream and whenever data comes into the first stream in the chain it goes through the whole chain of streams. If you're familiar with reactive programming and things like RxJS, then this concept should be familiar to you.
import { createReadStream, createWriteStream } from 'node:fs';
import { createGzip } from 'node:zlib';
import { pipeline } from 'node:stream';
const source = createReadStream('path/to/file.txt');
const destination = createWriteStream('path/to/file.txt.gz');
const gzip = createGzip();
await pipeline(source, gzip, destination);
In this example the source
stream triggers the whole pipeline. It goes like this:
source
stream reads data from the filesource
stream passes this data to thegzip
streamgzip
stream compresses the datagzip
stream passes the compressed data to thedestination
streamdestination
stream writes the compressed data to the fileThe whole pipeline is finished
Every stage of the pipeline has its own internal buffer and backpressure mechanism. It means that if the gzip
stream can't handle the data that comes from the source
stream, it can signal to the source
stream to slow down. The same thing goes for the destination
stream.
Conclusion
Streams are at the heart of any Node.js application, whether you use them explicitly or not. It is also one of the most powerful features existing in Node.js. Streams are used in many different places in Node.js, from network interactions to file processing.
They are especially useful when you need to process large datasets or work with real-time data. The core mental model of streams is built around the following concepts:
Data over time
Event-driven architecture
Backpressure
Buffering
Piping and chaining
By understanding these concepts and having a clear picture of how streams operate at a conceptual level, you can build more efficient Node.js apps.