Skip to content

Speedup indexing interactions with sqlite#86

Merged
fendor merged 9 commits intohaskell:masterfrom
crtschin:crtschin/speedup-sqlite-insertions
Jan 9, 2026
Merged

Speedup indexing interactions with sqlite#86
fendor merged 9 commits intohaskell:masterfrom
crtschin:crtschin/speedup-sqlite-insertions

Conversation

@crtschin
Copy link
Contributor

@crtschin crtschin commented Nov 29, 2025

Noticed that sqlite interactions could be improved somewhat I think. Landed on these set of changes. I included benchmarks.

  • The first change is to use prepared statements. I added statement preparation on all queries that were repeated. Did require me shuffling code around somewhat. The most obvious way to do this led me to this setup, which unfortunately also means that the database setup is always ran, not just on index and init commands, which it previously was. I could take a better look at this and avoid this behavior on request, if it's better to keep the old behavior for backwards-compatibility purposes.
  • The second change is to set PRAGMA synchronous = NORMAL. The default here is FULL. Considering the journal was set to already use WAL, the difference is the loss of durability. Practically this means that a committed but not yet fsync'd transaction, may be rolled back on system failure. I think this is fine in hiedb's case, as this would be automatically fixed on the follow-up run.
Benchmarks These were run on ghc 9.6.7 on hie files generated from hls on commit 88ccebe0649f7c41be97d49a986bbfd4185982f6. Benchmarks were setup with hyperfine with warmpup and 10 runs, dropping page caches in between each run. Runs are in reverse chronological order, (3) is baseline.
Benchmark 1: Set `synchronous = NORMAL`
  Time (mean ± σ):      3.991 s ±  0.285 s    [User: 2.239 s, System: 0.782 s]
  Range (min … max):    3.694 s …  4.504 s    10 runs

Benchmark 2: Use prepared statements when indexing
  Time (mean ± σ):      4.652 s ±  0.336 s    [User: 2.272 s, System: 0.785 s]
  Range (min … max):    4.444 s …  5.465 s    10 runs

Benchmark 3: Use unreserved tag name when creating NameCache to prevent collisions with wired-in names (#85)
  Time (mean ± σ):      5.992 s ±  0.578 s    [User: 3.570 s, System: 0.769 s]
  Range (min … max):    5.718 s …  7.625 s    10 runs

Summary
  Set `synchronous = NORMAL` ran
    1.17 ± 0.12 times faster than Use prepared statements when indexing
    1.50 ± 0.18 times faster than Use unreserved tag name when creating NameCache to prevent collisions with wired-in names (#85)

This should be fine for hiedb's purposes. The loss of durability means
that a committed but not yet fsync'd transaction, may be rolled back.
In hiedb's case, this would be automatically fixed on the follow-up
run.
@fendor fendor requested review from fendor, jhrcek and wz1000 December 1, 2025 08:23
Copy link
Collaborator

@jhrcek jhrcek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the performance improvements you were able to squeeze out 👍

But can we do that without changing the public api of the library?


{-| Initialize database schema for given 'HieDb'.
-}
initConn :: HieDb -> IO ()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably not change the library's api functions without good reason.
E.g. this functions is exposed in multiple versions of the library (https://hackage-content.haskell.org/package/hiedb-0.7.0.0/docs/HieDb-Create.html#v:initConn) and people using hiedb are probably using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there's an easy way to avoid changing the visible API of either initConn or deleteInternalTables. I think the types are wrong relative to setting up the prepared statements.

In this case, HieDb is the handle used to do operations on the sqlite file, so it's the likeliest place to put the prepared statements. But initConn, is the function that sets up the tables, but also takes a HieDb, so it can't contain the prepared statements (without doing something like lazy IO). Sqlite doesn't allow statement preparation on tables that don't exist yet. There's a loop I need to break here.

WDYT? Am I missing overlooking an obvious option here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized after looking at deleteInternalTables, I could also just keep this function for API purposes but also otherwise not use it. I've pushed a commit that re-adds back both functions.

execute_ conn "PRAGMA optimize;"
changes conn

deleteInternalTables :: Connection -> FilePath -> IO ()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - probably not a good idea to change public api (unless we want to do major version bump, which I don't think is necessary).

https://hackage-content.haskell.org/package/hiedb-0.7.0.0/docs/HieDb-Create.html#v:deleteInternalTables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approximately the same problem occurs here as with initConn. In this case I could keep this function for API purposes and have it keep calling the non-prepared deletions, but otherwise not call it ourselves in the library? And also perhaps decorate it with a DEPRECATED pragma?

Copy link
Contributor Author

@crtschin crtschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving the function signatures aside, is it acceptable that setupHieDb/initConn is always called?

The code in this PR implies that Init is essentially a noop relative to the other commands, the tables will always be instantiated if the db file didn't already exist (looking at runCommand).


{-| Initialize database schema for given 'HieDb'.
-}
initConn :: HieDb -> IO ()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there's an easy way to avoid changing the visible API of either initConn or deleteInternalTables. I think the types are wrong relative to setting up the prepared statements.

In this case, HieDb is the handle used to do operations on the sqlite file, so it's the likeliest place to put the prepared statements. But initConn, is the function that sets up the tables, but also takes a HieDb, so it can't contain the prepared statements (without doing something like lazy IO). Sqlite doesn't allow statement preparation on tables that don't exist yet. There's a loop I need to break here.

WDYT? Am I missing overlooking an obvious option here?

execute_ conn "PRAGMA optimize;"
changes conn

deleteInternalTables :: Connection -> FilePath -> IO ()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approximately the same problem occurs here as with initConn. In this case I could keep this function for API purposes and have it keep calling the non-prepared deletions, but otherwise not call it ourselves in the library? And also perhaps decorate it with a DEPRECATED pragma?

It's already called as part of `withHieDb` that sets up and provides the
`HieDb` handle.
@jhrcek
Copy link
Collaborator

jhrcek commented Dec 2, 2025

Sorry, I'm just a secondary maintainer and I'm kind of busy at work. I'd need to dive deeper to understand the implications of these changes. I'll reserve some time to look deeper into this PR later this week.

In the meantime, I have couple questions:

the database setup is always ran, not just on index and init commands

Is there a way we could make the change more isolated to get 80% of the benefit with fewer changes (like only do this prepared statement stuff in indexing, which we know is generally the bottleneck, unlike other places)

  • The use of ContT seems to introduce an inconsistency with some api methods. It may clean up code locally, but overall it seems to add another thing to the api mix, which makes it feel a bit more chaotic.

@crtschin
Copy link
Contributor Author

crtschin commented Dec 2, 2025

why do we need to use direct-sqlite - is the prepared statement api provided by sqlite-simple not sufficient to implement this? See https://hackage.haskell.org/package/sqlite-simple-0.4.19.0/docs/Database-SQLite-Simple.html#g:17

Good question! I did this to get rid of the argument checking that sqlite-simple does that's useful when writing queries, but less so when running queries. I realize I didn't benchmark this change. I'll do so!

the database setup is always ran, not just on index and init commands

Is there a way we could make the change more isolated to get 80% of the benefit with fewer changes (like only do this prepared statement stuff in indexing, which we know is generally the bottleneck, unlike other places)

I'd have to experiment a bit, but I think it'll be hard without changing the API more than I already did. But I can probably try only adding more things.

Sorry, I'm just a secondary maintainer and I'm kind of busy at work. I'd need to dive deeper to understand the implications of these changes. I'll reserve some time to look deeper into this PR later this week.

No worries! No rush needed at all. I'm also not super satisfied with the API changes, so I'm also keen to improve it.

Benchmarking gives a negligible difference between binding
via sqlite-simple, which does additional checks on binding
parameters, and direct-sqlite that only calls the underlying
sqlite3 function. Considering there is no difference stick
to using helpers from the same library.
@fendor
Copy link
Collaborator

fendor commented Dec 3, 2025

Leaving the function signatures aside, is it acceptable that setupHieDb/initConn is always called?

Always called for each query individually or just whenever hiedb is used, i.e., when the db connection is opened? (EDIT: it is the latter)
Afaict, this should be completely fine, our most important use case is HLS, which opens the connection once and then keeps it open, and running the code once is no issue for us.
Perhaps ghciwatch folks call hiedb on the cli for each request, then this change might be relevant to them. I doubt that this change has a huge effect in this case either, as it is only a few queries?

I'd have to experiment a bit, but I think it'll be hard without changing the API more than I already did.

Keeping the API is in my opinion less of an issue. Sure, let's try to avoid unnecessary changes, but otherwise, a noticeable performance win is worth a breaking change to me :)

@crtschin
Copy link
Contributor Author

crtschin commented Dec 4, 2025

I'd have to experiment a bit, but I think it'll be hard without changing the API more than I already did.

For the curious, I gave this an attempt at crtschin@7cfe675. Though I (subjectively) like that less than this PR. So I'd prefer we stick to the setup here if acceptable.

why do we need to use direct-sqlite - is the prepared statement api provided by sqlite-simple not sufficient to implement this? See https://hackage.haskell.org/package/sqlite-simple-0.4.19.0/docs/Database-SQLite-Simple.html#g:17

Good question! I did this to get rid of the argument checking that sqlite-simple does that's useful when writing queries, but less so when running queries. I realize I didn't benchmark this change. I'll do so!

I did so, it didn't make much difference, so I removed the reference to direct-sqlite. Though do note that hiedb already depends on it indirectly.

Perhaps ghciwatch folks call hiedb on the cli for each request, then this change might be relevant to them. I doubt that this change has a huge effect in this case either, as it is only a few queries?

I think the only noticeable behavioral change is if hiedb is called with a database filepath that doesn't exist, it will create that database file (even if doing only a readonly query), with the appropriate tables. In the previous scenario hiedb wouldn't create the database file, only execute the queries on that non-existing database.

Running ref-graph on a non-existing db file
# Before (file didn't exist prior, also does not exist after call):
> hiedb -D /tmp/non-existing-file ref-graph
hiedb: SQLite3 returned ErrorError while attempting to perform prepare "SELECT  mods.mod,    decls.hieFile,    decls.occ,    decls.sl,    decls.sc,    decls.el,    decls.ec,rmods.mod, ref_decl.hieFile, ref_decl.occ, ref_decl.sl, ref_decl.sc, ref_decl.el, ref_decl.ec FROM decls JOIN refs              ON refs.hieFile  = decls.hieFile JOIN mods          ON mods.hieFile  = decls.hieFile JOIN mods  AS rmods    ON rmods.mod = refs.mod AND rmods.unit = refs.unit AND rmods.is_boot = 0 JOIN decls AS ref_decl ON ref_decl.hieFile = rmods.hieFile AND ref_decl.occ = refs.occ WHERE ((refs.sl > decls.sl) OR (refs.sl = decls.sl AND refs.sc >  decls.sc)) AND ((refs.el < decls.el) OR (refs.el = decls.el AND refs.ec <= decls.ec))": no such table: decls

# After (file didn't exist prior, does exist after call):
> hiedb -D /tmp/non-existing-file ref-graph
<no output>

The total number of queries didn't actually increase, it's actually less. The previous setup used prepared statements as well, but re-prepared them for every .hie file, instead of globally.

Keeping the API is in my opinion less of an issue. Sure, let's try to avoid unnecessary changes, but otherwise, a noticeable performance win is worth a breaking change to me :)

Performance is a feature after all :D

Copy link
Collaborator

@fendor fendor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you very much for optimising this crucial piece of the IDE infrastructure!
I merely have documentation requests, if you still have the head for it!

Waiting for @jhrcek's review, assuming he finds the time for it :)

Copy link
Collaborator

@wz1000 wz1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice performance improvement, a few stylistic suggestions but otherwise this seems ready.

newtype HieDb = HieDb { getConn :: Connection }
runStatementFor :: (ToRow a, FromRow b) => StatementFor a -> a -> IO (Maybe b)
{-# INLINE runStatementFor #-}
runStatementFor (StatementFor statement) params = do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used. Perhaps we should delete it. If we are to keep it, ideally the return variable wouldn't be unconstrained and we would have something like StatementFor a b -> a -> IO b, but we don't seem to need that functionality right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do use it in this line.

I agree on including the output variable. In a different branch where I played around, I had that exact setup, but I omitted it for simplicity's sake. Might be good to explore that if the functions in Query.hs get the same benefit from this treatment.

- Adds some needed documentation
- Cleans up some helper function names
- Splits off statements into its own datatype.
Copy link
Collaborator

@jhrcek jhrcek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, tested this out with hls master and didn't notice any issues, so +1 to merge from me.

@crtschin crtschin requested a review from wz1000 December 10, 2025 09:35
@crtschin
Copy link
Contributor Author

crtschin commented Jan 8, 2026

Heya, gentle ping. Do I need to do anything else? I changed the API so I can make a follow-up to HLS perhaps?

@fendor fendor merged commit 1544d7c into haskell:master Jan 9, 2026
0 of 14 checks passed
@fendor
Copy link
Collaborator

fendor commented Jan 9, 2026

Thank you for the ping, merged :)

A follow up PR would be much appreciated! @jhrcek Do we have a policy for releasing hiedb? Should we just do a release?

@jhrcek
Copy link
Collaborator

jhrcek commented Jan 9, 2026

I'm not aware of any policy. I can work on releasing new version to hackage tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants