Lucene, Not As You Know It — Part 1: Indexing Without Tears — The Prologue

May 20, 2025

“Lucene is just a library, not a search engine.”
“It builds inverted indices.”
“There’s an Analyzer, Directory, and IndexWriter involved…”

You’ve probably seen those phrases before. But if you’re like me, they didn’t really mean anything at first. They were just vocabulary. This post is my attempt to build a mental model of Lucene that sticks. Not surface-level docs, but actually understanding how the pieces fit. And since no one wants to read a 20-minute post anymore, I’m breaking it down into Parts Let’s see how many Parts will be coming…

Let’s get into Part 1.

🔧 What Even Is Lucene?

Lucene is a Java-based indexing and search library. Think of it as the engine that powers tools like:

Elasticsearch
Solr
OpenSearch
Vespa (some parts)

It doesn’t come with a REST API, cluster management, or dashboards.

Lucene just gives you the core ingredients to:

Index documents
Search those documents
Score results based on relevance

You can wire it into your own app or use bindings like PyLucene to hook it into Python.

🧱 Core Concepts

Here’s the real-world analogy I built to understand Lucene’s internals

You feed raw text through an Analyzer → wrap it in a Document → send it to IndexWriter → and it writes index files into a Directory.

But “Directory” is where things get really interesting.

🗂️ Directories in Lucene (Yes, Plural — Because One Is Never Enough)

BaseDirectory

This is an abstract class. You won’t use it directly. Think of it like the concept of “Storage” — not a specific place until you say where. Kind of like saying “I’m going to store my snacks,” but not saying if it’s the fridge or your secret drawer.

FSDirectory and RAMDirectory

FSDirectory → writes index files to disk (durable but slower)
RAMDirectory → keeps everything in memory (fast but volatile, like your short-term memory on Monday mornings)

Great for local testing or small setups where you don’t want to wait forever.

CompoundDirectory – The Storage Combiner

Lucene creates many files per segment (.fdt, .fdx, .tim, .doc, .nvd, etc). That’s cool, until you try uploading to cloud storage like S3 or dealing with remote filesystems. More files = more network calls = bad latency (and nobody likes waiting).

Enter CompoundDirectory, the Marie Kondo of index files — it bundles lots of little files into one neat package, a big .cfs file. You don’t call it yourself; it sneaks in via codecs like Lucene84Codec.

Why care? Imagine indexing a 1GB file that spits out hundreds of files. CompoundDirectory tidies up that mess like a pro.

Okay, sounds awesome, right? So why not always use it? Because life’s not that simple. It’s a trade-off.

Pros:

Fewer files on disk → Avoids inode exhaustion (Linux users, you feel me? Windows users: “inode what?”)
Simpler file management
Better for cloud storage (S3, GCS, you name it)
Less file fragmentation (Yeah, that’s a win — fewer heap exploits 👀)

But reality check:

Extra merge cost → slower indexing
Large segments with one giant .cfs can be slower to read than a handful of small files
Less granular caching — you can’t cache small files like .tim separately anymore
Hinders parallelism — multi-disk or multi-thread read/write gets tricky

Default Lucene behavior:

Small segments = compound format (.cfs) — tidy but slower
Large segments = raw format (many files, clutter it is if it means speed)

This decision is made per segment, not global. So you’ll have some tidy segments and some messy ones coexisting.

Of course, you can override this, but honestly… let’s not go there. Even I haven’t dared to explore that rabbit hole. 🤫

FileSwitchDirectory – The File Router

Now this one is powerful. Imagine you want to store:

Some files in RAM (fast reads)
Some on disk (space saver)
Some in cloud storage (because why not?)

With FileSwitchDirectory, you map file types to storage locations like a boss:

Map<String, Directory> fileTypes = Map.of(
    "fdt", new RAMDirectory(),
    "tim", FSDirectory.open(Paths.get("/disk")),
    "dvd", new S3Directory()  // your custom Directory impl
);
Directory smartStorage = new FileSwitchDirectory(fileTypes);

⚙️ But Who Tells Lucene What To Write?

Spoiler: You don’t need to sweat filenames like segments_1, _0.fdt, _0.nvd, or any of that cryptic stuff. Lucene is the boss here — it decides what to write, when to merge files, and how to manage the whole file circus.

Your job? Just feed it:

An Analyzer — the text chef that chops your raw input into searchable tokens
A Document — The dish you prepare and serve to Lucene to index
A Directory — the storage spot where Lucene drops its index files

That’s it. Lucene handles the rest like a pro.

Quick peek at Lucene’s magic files — the raw ingredients behind the scenes:

Lucene doesn’t create a single index file per input document. Instead, it generates multiple internal files per segment. Each segment might look like this:

Per segment, you’ll see anywhere from 8 to 20+ of these files.

Think of it like a well-organized kitchen — lots of ingredients prepped and ready, so Lucene can serve up your search results fast and fresh.

Let’s talk numbers

Got a huge log file? Imagine 1TB of raw text — Lucene can index about 800GB per hour on modern hardware. That means your entire log is searchable in just over an hour.

And it only needs about 1MB of heap memory to do this. No massive servers required.

Plus, the index takes up just 20–30% of the original size — saving storage while making searches lightning fast.

Incremental indexing? Just as fast as batch.

That’s why Lucene powers some of the biggest search systems out there.

You might ask: 20–30% storage overhead? That’s like 200–300GB just for index files. Is it even worth it?

Great question! Whether that overhead is “worth it” depends on your priorities:

Why it’s worth it:

Lightning-fast search: Instead of scanning 1TB of raw data, Lucene finds relevant info in milliseconds.
Rich querying: Full-text search, filtering, scoring — impossible or painfully slow on raw logs.
Incremental updates: The index grows smoothly without full reprocessing.
Storage optimization: 20–30% is quite good for the search power you get.

Why it might not be:

Storage cost: If storage is super tight or expensive, that overhead matters.
Complexity: You need infrastructure to build, maintain, and optimize the index.
Not all use cases: For simple lookups, a full-text index may be overkill.

Bottom line: If you need powerful, scalable search on huge data, paying 20–30% extra storage is a small price for huge performance and capabilities. If your data is small or queries are simple, simpler approaches might be better.

So yeah, for most big data search cases, it’s absolutely worth it.

While many use Elasticsearch or Solr out of the box, companies like DoorDash build their own custom search engines on top of Lucene’s powerful core — architecting everything else themselves for maximum flexibility and performance.