Building a Mental Model of Node.js Streams

Building a Mental Model of Node.js Streams

Featured on Hashnode

Have you ever worked with Node.js streams? What was your experience like?

When I first tried to work with streams, I was confused, to say the least. The concept was completely new to me. I thought I could just ignore them, but it turns out they're everywhere in Node.js. Even core modules like fs and http use streams under the hood. So, I had to learn them and understand how they work.

What helped me was building a strong mental model that consists of multiple concepts. In this article, we'll explore these concepts and build a mental model of Node.js streams together.

What are Node.js Streams?

The main idea behind streams is that they take pieces of data from one place and transfer them to another. There are 4 important parts that I want to highlight based on this definition:

  • Streams transfer data in pieces, not as a whole

  • Streams transfer pieces of data in a specific size

  • Streams aren't interested in the transferred data

  • Streams simply provide a mechanism for data transfer

A common analogy used to describe streams is a pipe. However, this analogy often misses 2 crucial parts: the producer and the consumer. Let's use the same analogy but make it more complete.

Imagine a huge reservoir of water, and you have a house nearby. To supply water to your house, you need to build a pipe from the reservoir to your home.

Reservoir of water connected to a house through a pipe

P.S. I'm not a plumber, so don't take this drawing too seriously.

This analogy illustrates the three key parts of a stream:

  • The water reservoir is a producer of water

  • The pipe is a stream that transfers water from the reservoir to your home

  • Your home is a consumer of water

Coming back to Node.js streams. Let's compare the pipe analogy to how they behave:

  • The pipe doesn't transfer the entire reservoir of water all at once

  • The pipe transfers water in pieces, each of a specific size that it can handle

  • The pipe is not interested in the water itself, and it's just a way to transfer it

  • The pipe is just a mechanism to transfer water from one place to another

Looks pretty similar to Node.js streams, right?

When are Node.js streams used?

Before going into the specific details of what streams are and how they work, let's first understand when they’re used.

Real-time data processing

Streams work great for processing when we deal with data that is partial or generated incrementally over time. Streams are highly effective for processing data that is generated incrementally or received in parts over time.

An ideal example of this is a WebSocket protocol. In short, it's a protocol that allows you to establish a two-way communication channel between the client and the server.

We'll get into more details on this protocol in the upcoming articles. We'll take the WS library as an example. It uses streams heavily. Here is an example where the abstraction called Sender implements a backpressure mechanism.

We'll talk about the backpressure in the upcoming section. And it is just one example. You can explore the library further and see other use-cases.

Network interactions

Every time you create a server using Node.js API, you're creating a duplex stream. HTTP module in Node.js uses the abstraction called Scoket to create a connection with a network socket. This Socket abstraction extends from the Duplex stream.

ObjectSetPrototypeOf(Socket.prototype, stream.Duplex.prototype);
ObjectSetPrototypeOf(Socket, stream.Duplex);

Whenever you see a construction like the following:

import { createServer } from 'http';

const server = createServer();

Know that under the hood, you're creating a duplex stream.

Working with large datasets

Imagine that you have a file that is 100GB in size. You need to parse it and process some data. How would you do it?

If you try to read the file using API, like readFileSync or readFile you'll crash your program.

import { readFileSync, readFile } from 'fs';

const largeFilePath = 'path/to/large/file.txt';

// Both of these will crash your program
const data = readFileSync(largeFilePath);
const asyncData = await readFile(largeFilePath);

The problem is that you're trying to load the whole file content into memory using these read APIs. Doesn't sound efficient at all. What we can do instead is to process the file's content in chunks.

import { createReadStream } from 'fs';

const largeFilePath = 'path/to/large/file.txt';
const stream = createReadStream(largeFilePath);

stream.on('data', (chunk) => {
  // Process the chunk here
});

With this approach, we're not waiting for the whole file to be loaded into memory. Whenever a chunk of data is ready, we're processing it.

Data transformation

All previous examples were about the cases where we either read data from somewhere or write data to somewhere. But we can also use streams to transform data that we already have in memory.

A good example of this is data compression/decompression. Here is an example taken from the zlib module in Node.js documentation.

async function do_gzip(input, output) {
  const gzip = createGzip();

  // Create a read stream to read data from the input
  const source = createReadStream(input);

  // Create a write stream to write data to the output
  const destination = createWriteStream(output);

  // Pipe the source stream to the gzip stream,
  // then to the destination stream
  await pipe(source, gzip, destination); }
}

In this code snippet, we're creating a read stream, and whenever data comes from this read stream, we pass it down to the gzip. When the gzip stream compresses the data, we pass it down to the write stream.

You don't have to understand how this code works just yet. Just understand that streams can be used to transform different data.

Don't use streams in this case

You don't want to use streams when the data you're working with is already in memory. There is just little to no benefit you can gain from using streams.

So please, try to avoid using streams when all pieces of data that you need are already in memory. Don't make your life harder.

Core concepts on Node.js streams

You understand what streams are, when to use them, and when not to. Now, you're ready to dive deeper into some of the core concepts of streams in Node.js.

Event-driven architecture

You know that streams are like pipes. But what exactly makes them work this way? It is all thanks to even-driven concepts that streams are built upon. In particular, all streams in Node.js are extended from the EventEmitter class.

The way EventEmitter works is very simple. It has some internal state where it stores all events and listeners of these events.

class EventEmitter {
  // Map of events and their listeners
  // Each event can have multiple listeners
  #events = new Map<string, (() => void)[]>();

  // Register a new listener for the event
  on(eventName: string, callback: () => void) {
    if (!this.#events.has(eventName)) {
      this.#events.set(eventName, [callback]);
    }

    this.#events.get(eventName).push(callback);
  }

  // Triggers all listeners related to the event.
  emit(eventName: string) {
    const listeners = this.#events.get(eventName);

    if (!listeners) {
      return;
    }

    listeners.forEach((listener) => listener());
  }
}

It is a very simplified version, but it gives you an idea of how EventEmitter works. You can read the full implementation in the Node.js source code.

When you work with streams, you can add a listener to some predefined set of events.

stream.on('data', () => {});

In this example, we add a listener to the data event. Whenever a chunk of data is ready, the stream calls the emit with the data event name, and all listeners are called.

It's the exact mechanism that makes streams work like pipes, where we get data from one end and pass it through to the other end.

Backpressure

Streams can be used to process large datasets efficiently. But there is a catch: what if the rate of data production is so high that at some point in time, we have more data in our program than allocated memory can handle? Right, the program will crash.

Example of how memory can overflow while transferring all data at once

This means that just the abstraction of a stream is not enough to prevent such cases from happening. Streams have a backpressure mechanism in place for such cases.

Backpressure might sound like a fancy term, but in reality, it is quite simple. The main idea of backpressure is that we have some limit on how much data we can process at a time.

Let's get back to the example with reading a large file. There are 2 parts of this process that we're interested in: the producer of data and the consumer of data. The producer of data is the underlying OS mechanism that reads the file and produces the data.

If the producer tries to push too much data, a stream can signal to the producer that it needs to slow down because it can't take any more data at the moment. But how does the stream know when it's full?

Each stream has an internal buffer, and whenever new data comes in and the old one comes out, the "buffering" mechanism comes into play.

Buffering

Each stream has an internal buffer. If we work with API that enables backpressure mechanism, then this buffer is used to store data that comes into the stream.

Illustration of how a stream internal buffer changes when data comes in and out

If data comes into the stream but doesn't come out of the stream, the buffer steadily gets filled until it reaches the cap. The cap, in this case, is highWaterMark property set for each individual stream.

Here is an example of how we can set highWaterMark property when reading a file.

import { createReadStream } from 'node:fs';

const filePath = 'path/to/file.txt';

const writeStream = createReadStream(filePath, { highWaterMark: 1024 });

The highWaterMark is set to 64KB for createReadStream function by default. When the internal buffer frees up some space, the stream can start reading more data from the source.

Piping and chaining

In more or less complex Node.js applications you'll need to transform data that comes from a stream or send this data to some other destination. In cases like this, a concept called as "piping" comes useful.

You can create a chain of streams where one stream is connected to another stream and whenever data comes into the first stream in the chain it goes through the whole chain of streams. If you're familiar with reactive programming and things like RxJS, then this concept should be familiar to you.

import { createReadStream, createWriteStream } from 'node:fs';
import { createGzip } from 'node:zlib';
import { pipeline } from 'node:stream';

const source = createReadStream('path/to/file.txt');
const destination = createWriteStream('path/to/file.txt.gz');
const gzip = createGzip();

await pipeline(source, gzip, destination);

In this example the source stream triggers the whole pipeline. It goes like this:

  1. source stream reads data from the file

  2. source stream passes this data to the gzip stream

  3. gzip stream compresses the data

  4. gzip stream passes the compressed data to the destination stream

  5. destination stream writes the compressed data to the file

  6. The whole pipeline is finished

Every stage of the pipeline has its own internal buffer and backpressure mechanism. It means that if the gzip stream can't handle the data that comes from the source stream, it can signal to the source stream to slow down. The same thing goes for the destination stream.

Conclusion

Streams are at the heart of any Node.js application, whether you use them explicitly or not. It is also one of the most powerful features existing in Node.js. Streams are used in many different places in Node.js, from network interactions to file processing.

They are especially useful when you need to process large datasets or work with real-time data. The core mental model of streams is built around the following concepts:

  1. Data over time

  2. Event-driven architecture

  3. Backpressure

  4. Buffering

  5. Piping and chaining

By understanding these concepts and having a clear picture of how streams operate at a conceptual level, you can build more efficient Node.js apps.