Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Fast regex search: indexing text for agent tools (cursor.com)

37 points by jxmorris12 3 days ago | 7 comments

boyter 5 hours ago [-]

I read this when it came out and having written similar things for searchcode.com (back when it was a spanning code search engine), and while interesting I have questions about,

    We routinely see rg invocations that take more than 15 seconds

The only way that works is if you are running it over repos 100-200 gigabytes in size, or they are sitting on a spinning rust HDD, OR its matching so many lines that the print is the dominant part of the runtime, and its still over a very large codebase.

Now I totally believe codebases like this exist, but surely they aren't that common? I could understand this is for a single customer though!

Where this does fall down though is having to maintain that index. That's actually why when I was working on my own local code search tool boyter/cs on github I also just brute forced it. No index no problems, and with desktop CPU's coming out with 200mb of cache these days it seems increasingly like a winning approach.

siva7 7 hours ago [-]

I don't get grep in agentic settings for natural language queries. You want to optimize for best results with as few tokens/round trips as possible, not for speed.

mpalmer 9 hours ago [-]

> No matter how fast ripgrep can match on the contents of a file, it has one serious limitation: it needs to match on the contents of all files.

The omission of rg's `-g` parameter is unsurprising in one sense, because it would mostly obviate this entire exercise. How often do you need to search what sounds like hundreds of millions of lines of source for a complex pattern, with zero constraints on paths searched?

> We routinely see rg invocations that take more than 15 seconds

I'm trying to understand the monorepo that is so large that ripgrep takes 15 seconds to return results, when it's benchmarked as searching for a literal in a 9.3GB file in 600ms, or 1.08s to search for `.*` in the entire Linux repo.

And again, that's without using `-g`.

piker 9 hours ago [-]

> -g GLOB, --glob=GLOB

> Include or exclude files and directories for searching that match the given glob. This always overrides any other ignore logic. Multiple glob flags may be used. Globbing rules match .gitignore globs. Precede a glob with a ! to exclude it. If multiple globs match a file or directory, the glob given later in the command line takes precedence. As an extension, globs support specifying alternatives: -g 'ab{c,d}*' is equivalent to -g abc -g abd. Empty alternatives like -g 'ab{,c}' are not currently supported. Note that this syntax extension is also currently enabled in gitignore files, even though this syntax isn't supported by git itself. ripgrep may disable this syntax extension in gitignore files, but it will always remain available via the -g/--glob flag.

> When this flag is set, every file and directory is applied to it to test for a match. For example, if you only want to search in a particular directory foo, then -g foo is incorrect because foo/bar does not match the glob foo. Instead, you should use -g 'foo/*'.

https://man.archlinux.org/man/rg.1.en*

(for those who were unfamiliar with the switch like me)

mpalmer 9 hours ago [-]

Has the Cursor team considered (for instance) that on MDM-managed machines, binaries which do pervasive arbitrary FS reads may be monitored, throttled, or otherwise controlled by overseer programs? That kitchen-sink Electron apps like Cursor using those binaries might compound the red-flag signals?

open-paren 8 hours ago [-]

The creator of fff.nvim[0], Dmitriy Kovalenko, had an interesting analysis of this on Xitter[1]. The TL;DR of this is that Anysphere/Cursor is being somewhat disingenuous and does not include the index-creation and recreation time in the comparison nor do they include the CPU or memory overhead, where rg (and his tool, fff.nvim) are indexless.

---

0: https://github.com/dmtrKovalenko/fff.nvim

1: http://x.com/i/article/2036558670528651264

oefrha 5 hours ago [-]

This is interesting and I’d like to see a follow-up from Cursor, but the tone is unbearable and egregiously misrepresent the Cursor blog post, I guess for a circle of followers who won’t bother to check the original anyway and is just there for the dunking.

> So how cursor came up with such a beautiful solution only in 2026? Is everyone around dumb and never did anything like this before?

Cursor post doesn’t claim anything original, they attribute every approach discussed to someone else, including the one they claim to have settled on:

> Here's another very smart idea. You may have seen it used in ClickHouse for their regular expression operator, and also at GitHub, in the new Code Search feature that shipped a couple years ago and which does allow matching regular expressions. It's called Sparse N-grams, and it is the sweetest of the middle grounds.

The very next sentence in the fff article is amusing

> No, actually all the theory in the blog post they made (that makes sense) is coming from the paper https://swtch.com/~rsc/regexp/regexp4.html that is stated behind google code search project.

Because 1. the paper is prominently cited in the original, and 2. no it doesn’t cover all the subsequent optimizations discussed. “That makes sense” is doing a lot of work apparently.

Now, the main claims in the fff article are:

- Few/no people need to search entire repos that large;

- For large repos (no one needs to search), fff’s index is smaller (~100MB for chromium vs ~1GB for Cursor) and faster to create (~8s vs ~4m) and still fast (~100ms vs ?).

But all the comparisons are weirdly fixated on the MAX_FILE_SIZE query used for algorithm demonstration purposes in the original. That’s hardly a fucking regex search. Readers have no idea of how, say, MAX_.+_SIZE does after reading that rebuttal.

So, again, interesting, unbearable tone and egregious misrepresentation, would like a follow up.

Disclosure: no affiliation, not using either now.

maxbeech 3 days ago [-]

[dead]

Rendered at 06:20:55 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.