Exploring Node.js Readable Streams

In Node.js we have different types of streams, and one of them is the Readable stream. You may have heard of it, or perhaps even used it a few times.

But do you know how to use it effectively? This question of efficiency comes when we're dealing with cases that go beyond basics. In such cases, a deeper understanding of the underlying mechanisms is important for making informed decisions.

This article explores the core concepts of Node.js Readable streams. After reading it you'll deepen your understanding how they work and when they can be used. As a bonus point, we'll see why you should be careful when playing with the highWaterMark property of readable streams.

Use cases of readable streams

Here are few examples of of readable streams can be used.

Streaming data from database

If we have a large dataset in a database or each single document is large by itself, we might want to stream documents from the database, instead of trying to load all them into memory at once.

Here is an example of we can do so by using `Readable` stream and MongoDB.

import { Readable } from 'node:stream'; 

// Leaving behind the scene all MongoDB configuration and connection setup
// before getting the actual reference to `db` object.

// Collection cursor
const cursor = db.collection('documents').cursor();
const collectionStream = new Readable({
  objectMode: true,

  // We're streaming objects, not buffers
  async read(size) {
    try {
      const result = await cursor.next();
      if (result) {
        this.push(result);
      } else {
        this.push(null); // Signal the end of the stream
        await client.close();
      }
    } catch (err) {
      this.destroy(err); // Handle errors by destroying the stream
    }
  },
});

And after that we can use it in the following way:

collectionStream.on('data', (doc) => {
  // Process the document
});

Diagram of the workflow.

Streaming file from S3 bucket directly into the application

Most of applications have some kind of workflow that involves files. Often, it is made by leveraging cloud storage services like AWS S3.

If at some point you need to download a file from S3 and process it in your application, Readable streams are one of the best options to do so.

import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3';

// Configure AWS credentials and region
const s3Client = new S3Client({
  region: 'YOUR_AWS_REGION',
  credentials: {
    accessKeyId: 'YOUR_ACCESS_KEY_ID',
    secretAccessKey: 'YOUR_SECRET_ACCESS_KEY',
  },
});

const command = new GetObjectCommand({
  Bucket: 'my-bucket',
  Key: 'path/to/file.txt',
});
const response = await s3Client.send(command);
const s3Stream = response.Body;

s3Stream.on('data', (chunk) => {
  // Process the data chunk
});

This approach makes the process of downloading files from S3 more efficient since we're not waiting for the whole file to be transferred through the network.

Zlib compression and decompression

As Node.js documentation states:

Compression and decompression are built around the Node.js Streams API.

Meaning that by using zlib API, you're working with streams. You can find the following example in the official Node.js documentation.

import { createGzip } from 'node:zlib';
import { pipeline } from 'node:stream';
import { createReadStream, createWriteStream } from 'node:stream';

const gzip = createGzip();
const source = createReadStream('input.txt');
const destination = createWriteStream('input.txt.gz');

pipeline(source, gzip, destination, (err) => {
  // Handle the error
});

There are other Node.js APIs and modules that leverage the readable streams:

TCP Socket
HTTP request and response
Process sdtin and stderr

Now that we've explored some common use cases, let's dive deeper in how readable streams work under the hood and learn about different reading modes and flowing states.

Reading modes and flowing states

Every readable stream in Node.js operates in one of two modes: flowing or paused. These modes dictate how you receive data from a readable stream, much like how you might control water flow in a plumbing system.

P.S. If you're not familiar with the analogy of pipes and plumbing system, check out the previous article where we build a mental model of how streams work in Node.js using the pipe analogy.

Flowing mode: The automatic approach

In flowing mode, data is read from the underlying system automatically and provided to your application as quickly as possible. This is similar to water flowing freely through an open pipe. One way to turn the flowing mode on is to attach the `data` event listener to the stream:

import { createReadStream } from 'node:fs';

const filePath = 'path/to/a/file.txt';
const stream = createReadStream(filePath);

// Once we attach this listener, data starts flowing automatically
stream.on('data', (chunk) => {
  // Process the data chunk
});

This approach is perfect for scenarios where you want to process data as quickly as possible, such as streaming log files or processing real-time data.

Paused mode: The manual approach

In paused mode, you control the flow of data. One way to explicitly request each chunk of data is by using the `stream.read()` method. Think of paused mode like a water dispenser with a button; you press the button only when you want water.

import { createReadStream } from 'node:fs';

const filePath = 'path/to/a/file.txt';
const stream = createReadStream(filePath);

// Later in your code, when you need to read data
const chunk = stream.read();

if (chunk !== null) {
  // Process the data chunk
}

Warning, don't try to mix these two modes. It will lead to unexpected behavior that is hard to debug.

Readable flowing states

These two reading modes are a simplified view of the underlying abstraction that Node.js operates with, called the readable flowing state. This state is represented by the readableFlowing property of the readable stream.

The readableFlowing state can contain one of three values: `null`, `false`, and `true`.

Null: The initial state of a newly created stream. No consumers are attached to the stream, so it's not actively reading data.

import { Readable } from 'node:stream';
const stream = new Readable({ read(size) {} });

console.log(stream.readableFlowing); // null

False (Paused): The stream has consumers but is temporarily paused. Data might be available but won't be delivered until the stream is resumed. This is common when using pause() or switching to manual mode.

stream.on('data', (chunk) => {
  // Handle data
});

stream.pause(); // Enters paused state

console.log(stream.readableFlowing); // false

True (Flowing): The stream is actively delivering data to consumers. Data events are being emitted automatically. This is common when using event-based consumption.

stream.on('data', chunk => {
  // Handle data
});
// Enters flowing state

console.log(stream.readableFlowing); // true

Warning, don't try to mix these three states. It will lead to unexpected behavior that is hard to debug.

Consuming readable streams data

The sole purpose of a readable stream is to deliver data to consumers. In Node.js, there are several ways to consume data from a readable stream, each with its own characteristics and use cases.

In this article we'll review only methods that we can use when dealing with a single readable stream such as data event, readable event, and async iterators. In later articles we'll get familiar with pipe and pipeline functions.

Using the `data` event

This is probably the most common approach of consuming data from a readable stream. All we have to do is to attach an event listener to `data` event.

import { createReadStream } from 'node:fs';

const stream = createReadStream(
  'path/to/a/file/text.txt',
  { encoding: 'utf8' },
);

stream.on('data', (chunk) => {
  // Process the data chunk
});

Whenever the internal buffer is filled with data the readable stream offloads this data into the chunk and you receive it in the callback.

And it happens no matter what. For example, if you have some asynchronous processing of the data chunks it might not be ideal for you.

import { createReadStream } from 'node:fs';

const stream = createReadStream(
  'path/to/a/file/text.txt',
  { encoding: 'utf8' },
);

stream.on('data', async (chunk) => {

  // Imitation of data processing.
  await new Promise((resolve) => setTimeout(resolve, 3000));
  console.log('Data chunk: ', chunk);
});

In this example you'll see most of the console logs pretty much at the same time. The reason is after the buffer gets empty streams starts reading a new chunk of data and it doesn't matter if a processing of a previous hasn't been finished yet.

Using the readable event

The `readable` event is somewhat similar to `data` event in a way that it invoked when the internal buffer of the readable stream is filled with the data.

However, the handler of the `readable` doesn't offload the buffer automatically. It only gives a signal to a listener that internal buffer is loaded and ready to be ridden. To explicitly read from an internal buffer you can use the `read` method.

import { createReadStream } from 'node:fs';

const stream = createReadStream(
  'path/to/a/file/text.txt',
  { encoding: 'utf8' },
);

// When stream emits the event it means that the internal buffer is filled with the data
stream.on('readable', () => {

  // We have to read the data manually using the `read` method
  const chunk = stream.read();

  // Process data chunk;
});

Here is a diagram to better show you how data flows to buffer and from buffer when dealing with the `readable` event.

As you can see, the buffer is still filled with data after event fired.

Such manual handling of when we read the data can be quite handy when you want to control the data flow explicitly. For example, you can read the stream data only after some asynchronous processing has finished the work.

import { createReadStream } from 'node:fs';

const stream = createReadStream(
  'path/to/a/file/text.txt',
  { encoding: 'utf8' },
);

stream.on('readable', async () => {
  await new Promise((resolve) => setTimeout(resolve, 3000));
  const chunk = stream.read();
  console.log('Readable chunk: ', chunk);
});

In this example you'll see console logs print one by one with an interval of approximately 3 seconds.

Using async iterator

Readable streams implement async iterator interface. It is the most recent API for consuming streams and you can find that members of the core Node.js team recommend using it over `data` or `readable` events most of the time.

The benefit of working with async iterator is that you don't have to deal with any events by yourself. Everything is handled internally.

import { createReadStream } from 'node:fs';

const stream = createReadStream(
  `${import.meta.dirname}/new-text.txt`,
  { encoding: 'utf8' },
);

for await (const chunk of stream) {
  // Process the data chunk
}

We're using `for...await` loop to iterate over the data that stream emits.

This approach has the best from both `data` and `readable` events combined into a single API. We receive the chunk of data whenever it is ready and don't have to call the `read` method manually as we do it with the `readable` event.

At the same time, we can perform some asynchronous handling of each chunk and async iterator won't be rushing to read all of the data as `data` event do. It will wait until the processing on a single chunk is finished and only then move forward.

You know at least 3 different ways to consume data from a readable stream now. The other thing that can affect how fast you're getting data from a stream is `highWaterMark` property.

Impact of ‘highWaterMark’ on readable streams performance

While the Node.js documentation states that streams in flowing mode emit data as quickly as possible, there's a nuance to this behavior. Data is emitted only after an internal buffer is filled. This buffer size is controlled by the `highWaterMark` property of the stream.

If you set a smaller `highWaterMark` value, the first chunk will be emitted faster because the buffer fills up more quickly. However, this doesn't necessarily mean the overall execution time will be faster. In fact, for larger files, a smaller `highWaterMark` can lead to slower processing time.

import { createReadStream } from 'node:fs';

const filePath = 'data.txt';

// Stream with a small highWaterMark
const stream1 = createReadStream(filePath, { highWaterMark: 16 });

// The first chunk is emitted faster, but the overall processing is slower
stream1.on('data', (chunk) => {
  // Process the data
});

// Stream with the default highWaterMark (64KB)
const stream2 = createReadStream(filePath);

// The first chunk is emitted slightly slower, but the overall processing is faster
stream2.on('data', (chunk) => {
  // Process the data
});

Here are the reasons why:

Increased overhead: With a smaller buffer, the stream needs to process and emit chunks more frequently, resulting in increased overhead.
Reduced throughput: The constant filling and emptying of a small buffer can limit the overall data throughput compared to a larger buffer.

Therefore, while a smaller `highWaterMark` might provide a quicker initial response, the default `highWaterMark` is generally optimized for efficient processing of larger data streams.

Conclusion

Node.js readable streams are not as simple as you might’ve thought initially. Especially when it comes to understanding of how they work in a certain use cases under highload or complex manipulation. Of course, we haven’t touched on all points but those should be enough for you to improve the overall understanding of the readable steams and start your own researched.

Exploring the Core Concepts of Node.js Readable Streams

Use cases of readable streams

Streaming data from database

Streaming file from S3 bucket directly into the application

Zlib compression and decompression

Reading modes and flowing states

Flowing mode: The automatic approach

Paused mode: The manual approach

Readable flowing states

Consuming readable streams data

Using the `data` event

Using the readable event

Using async iterator

Impact of ‘highWaterMark’ on readable streams performance

Conclusion

Comments

More from this blog

Writable Streams in Node.js: A Practical Guide

Building a Mental Model of Node.js Streams

Profiling Node.js application with VS Code

Building Semaphore and Mutex in Node.js

Command Palette

Use cases of readable streams

Streaming data from database

Streaming file from S3 bucket directly into the application

Zlib compression and decompression

Reading modes and flowing states

Flowing mode: The automatic approach

Paused mode: The manual approach

Readable flowing states

Consuming readable streams data

Using the `data` event

Using the readable event

Using async iterator

Impact of ‘highWaterMark’ on readable streams performance

Conclusion

Comments

More from this blog