Meilisearch & Laravel Scout (Part 1)

This week’s pet project has been moving an awful fulltext index on a 19 million row table into Meilisearch with Laravel Scout.

First thing to do was verify whether my backup 32GB RAM server would be able to home a Meilisearch instance and keep up with day-to-day indexing and searching. The memory is probably too low, being significantly below the actual size of the index (the existing fulltext index accounts for about 300GB and I guess the Meilisearch equivalent will probably be similarly sized, but who knows! We’ll see at the end of this).

According to the documentation Meilisearch works better when the entire database can fit in memory, but since I have no idea how much better we’re talking about I wanted to at least try it out (also because a search feature is not so valuable to me that I’m willing to pay for 300GB of RAM).

I set up the empty index and plugged in Laravel Scout to get an idea of whether it could keep up with regular site activity, and immediately ran into problems.

The native Laravel Scout driver attaches an observer to the Searchable model which queues a request to Meilisearch with every model update. This means a single Meilisearch task is fired every time the model is updated, created, or deleted. It’s possible to pare this down with searchIndexShouldBeUpdated() but the problem remains that each update is sent and queued as an individual task by Meilisearch.

Meilisearch does not handle single document updates very well. The fresh index was backlogged with indexing tasks within a few minutes. During peak hours this backed up to 60+ minute queues which would slowly drain during off-peak.

Verifying the contents of the http://meilisearch/tasks?status=enqueued&limit=1000 endpoint —

{
    "results": [
		{
            "status": "succeeded",
            "type": "documentDeletion",
            "details": {
                "receivedDocumentIds": 1,
                "deletedDocuments": 0
            },
            "duration": "PT0.018378150S",
            "enqueuedAt": "2022-11-30T08:50:29.231052541Z",
            "startedAt": "2022-11-30T09:45:12.624531907Z",
            "finishedAt": "2022-11-30T09:45:12.642910057Z"
        },
        {
            "status": "succeeded",
            "type": "documentAdditionOrUpdate",
            "details": {
                "receivedDocuments": 1,
                "indexedDocuments": 1
            },
            "duration": "PT2.080888276S",
            "enqueuedAt": "2022-11-30T08:50:28.647606627Z",
            "startedAt": "2022-11-30T09:45:10.540158361Z",
            "finishedAt": "2022-11-30T09:45:12.621046637Z"
        },
        {
            "status": "succeeded",
            "type": "documentAdditionOrUpdate",
            "details": {
                "receivedDocuments": 1,
                "indexedDocuments": 1
            },
            "duration": "PT2.080887604S",
            "enqueuedAt": "2022-11-30T08:50:28.180774421Z",
            "startedAt": "2022-11-30T09:45:10.540158361Z",
            "finishedAt": "2022-11-30T09:45:12.621045965Z"
        },
        {
            "status": "succeeded",
            "type": "documentAdditionOrUpdate",
            "details": {
                "receivedDocuments": 1,
                "indexedDocuments": 1
            },
            "duration": "PT2.080886922S",
            "enqueuedAt": "2022-11-30T08:50:28.168919717Z",
            "startedAt": "2022-11-30T09:45:10.540158361Z",
            "finishedAt": "2022-11-30T09:45:12.621045283Z"
        },
        {
            "status": "succeeded",
            "type": "documentAdditionOrUpdate",
            "details": {
                "receivedDocuments": 1,
                "indexedDocuments": 1
            },
            "duration": "PT2.080884466S",
            "enqueuedAt": "2022-11-30T08:50:24.764755509Z",
            "startedAt": "2022-11-30T09:45:10.540158361Z",
            "finishedAt": "2022-11-30T09:45:12.621042827Z"
        },
        {
            "status": "succeeded",
            "type": "documentDeletion",
            "details": {
                "receivedDocumentIds": 1,
                "deletedDocuments": 0
            },
            "duration": "PT0.021788875S",
            "enqueuedAt": "2022-11-30T08:50:24.04404996Z",
            "startedAt": "2022-11-30T09:45:10.516279034Z",
            "finishedAt": "2022-11-30T09:45:10.538067909Z"
        },
        {
            "status": "succeeded",
            "type": "documentDeletion",
            "details": {
                "receivedDocumentIds": 1,
                "deletedDocuments": 0
            },
            "duration": "PT0.021943263S",
            "enqueuedAt": "2022-11-30T08:50:24.040298171Z",
            "startedAt": "2022-11-30T09:45:10.492173874Z",
            "finishedAt": "2022-11-30T09:45:10.514117137Z"
        },
        ...
    ],
    "limit": 10000,
    "from": 484651,
    "next": null
}

Thousands of single-document update/deletion tasks, with each operation taking 0-3 seconds to run. At this point in time, this index only had around ~200k entries and was barely making a blip in RAM or CPU so I felt I could improve the indexing behaviour here.

Batching

I wanted to try some naive Redis-backed batching. Below are POC MakeSearchableBatched & RemoveFromSearchBatched jobs that 1:1 replace Scout’s default MakeSearchable & RemoveSearchable. Instead of making an immediate call to Meilisearch, we’re going to push the task onto a Redis list.

class MakeSearchableBatched implements ShouldQueue
{
    use Dispatchable, SerializesModels;

    public array $toQueue;

    public function __construct(Collection $models)
    {
        $this->toQueue = $models
            ->map(fn (Model $modelInstance) => json_encode([$modelInstance->getMorphClass(), $modelInstance->getKey(), 1]))
            ->toArray();
    }

    public function handle()
    {
        Redis::rPush('searchable', $this->toQueue);
    }
}

class RemoveFromSearchBatched
{
    use Dispatchable, SerializesModels;

    public array $toQueue;

    public function __construct(Collection $models)
    {
        $this->toQueue = $models
            ->map(fn (Model $modelInstance) => json_encode([$modelInstance->getMorphClass(), $modelInstance->getKey(), 0]))
            ->toArray();
    }

    public function handle()
    {
        Redis::rPush('searchable', $this->toQueue);
    }
}

Scout::makeSearchableUsing(MakeSearchableBatched::class);
Scout::removeFromSearchUsing(RemoveFromSearchBatched::class);

FWIW, I also overrode Scout’s Searchable class to dispatch after response instead of dispatching onto the queue (seemed a little silly to dispatch to a Redis queue a job that just adds stuff to another Redis queue… lol).

trait Searchable
{
    use \\Laravel\\Scout\\Searchable;

		public function queueMakeSearchable($models)
		{
		    if ($models->isEmpty()) {
		        return;
		    }

		    $job = (new Scout::$makeSearchableJob($models));
		    if (app()->runningInConsole()) {
		        dispatch_sync($job);
		    } else {
		        dispatch($job)->afterResponse();
		    }
		}

		public function queueRemoveFromSearch($models)
		{
		    if ($models->isEmpty()) {
		        return;
		    }

		    $job = (new Scout::$removeFromSearchJob($models));
		    if (app()->runningInConsole()) {
		        dispatch_sync($job);
		    } else {
		        dispatch($job)->afterResponse();
		    }
		}
}

And lastly the 5-minutely cron that processes each batch (pops default chunks of 10,000 off the Redis list and packs it off to Meilisearch). Duplicate events always get overwritten by the latest entry.

class ProcessSearchableBatch
{
    private int $chunk;

    public function __construct(int $chunk = 10000)
    {
        $this->chunk = $chunk;
    }

    public function handle()
    {
        $queueLength = Redis::lLen('searchable');
        if (!$queueLength) {
            return;
        }

        $modelMap = [];
        $queue = Redis::lPop('searchable', $this->chunk);

        /** @var string $queueEntry */
        foreach ($queue as $queueEntry) {
            // a first pass is needed to dedupe the queue
            [$modelMorph, $modelKey, $isSearchable] = json_decode($queueEntry, true);
            $modelMap[$modelMorph] ??= [];
            $modelMap[$modelMorph][$modelKey] = $isSearchable;
        }

        foreach ($modelMap as $modelMorph => $modelIds) {
            $modelClass = Relation::getMorphedModel($modelMorph);

            if (!method_exists($modelClass, 'searchable')) { //
                continue;
            }

            /** @var Model $modelInstance */
            $modelInstance = new $modelClass;
            $getModels = fn (bool $searchable) => $modelInstance->newQuery()
                ->whereIn($modelInstance->getKeyName(), collect($modelIds)->filter(fn ($x) => $searchable ? $x : !$x)->keys())
                ->select($modelInstance->getKeyName())
                ->get();

            $searchableModels = $getModels(true);
            if ($searchableModels->count()) {
                dispatch_sync(new MakeSearchable($searchableModels));
            }

            $unsearchableModels = $getModels(false);
            if ($unsearchableModels->count()) {
                dispatch_sync(new RemoveFromSearch($unsearchableModels));
            }
        }

        if ($queueLength > $this->chunk) {
            dispatch(new ProcessSearchableBatch($this->chunk));
        }
    }
}

Swapped over the indexing jobs and the updated task list looks more like this —

{
    "results": [
        {
            "status": "succeeded",
            "type": "documentAdditionOrUpdate",
            "details": {
                "receivedDocuments": 347,
                "indexedDocuments": 347
            },
            "duration": "PT44.055584527S",
            "enqueuedAt": "2022-11-30T15:35:03.591953331Z",
            "startedAt": "2022-11-30T15:35:03.593991274Z",
            "finishedAt": "2022-11-30T15:35:47.649575801Z"
        },
        {
            "status": "succeeded",
            "type": "documentDeletion",
            "details": {
                "receivedDocumentIds": 64,
                "deletedDocuments": 3
            },
            "duration": "PT0.007997731S",
            "enqueuedAt": "2022-11-30T15:30:03.331089225Z",
            "startedAt": "2022-11-30T15:31:09.918568002Z",
            "finishedAt": "202-11-30T15:31:09.926565733Z"
        },
        {
            "status": "succeeded",
            "type": "documentAdditionOrUpdate",
            "details": {
                "receivedDocuments": 302,
                "indexedDocuments": 302
            },
            "duration": "PT66.603082926S",
            "enqueuedAt": "2022-11-30T15:30:03.311123328Z",
            "startedAt": "2022-11-30T15:30:03.313418096Z",
            "finishedAt": "2022-11-30T15:31:09.916501022Z"
        },
        {
            "status": "succeeded",
            "type": "documentDeletion",
            "details": {
                "receivedDocumentIds": 86,
                "deletedDocuments": 4
            },
            "duration": "PT0.024325655S",
            "enqueuedAt": "2022-11-30T15:25:02.937239433Z",
            "startedAt": "2022-11-30T15:26:07.619418279Z",
            "finishedAt": "2022-11-30T15:26:07.643743934Z"
        },
        ...

The job batches every 5 minutes, with each batch taking around 60 seconds — it looks like everything’s running smoothly with no backlog, though it can take a minute or two to index (on top of up to 5 minutes to get batched). Under regular site load we’re looking at a few hundred updates per batch, though I found that ~5000 document batches would still only take about 3~5 minutes.

Improvements I’d probably make here —

Clean up the code since it’s currently brittle
Deliver the batch immediately once it hits the chunk limit instead of waiting 5 minutes (currently it just chunks by immediately re-rerunning the job if it’s over the chunk limit)

Next up is handling the import 🚛 💨