Speedup indexing interactions with sqlite#86
Conversation
This should be fine for hiedb's purposes. The loss of durability means that a committed but not yet fsync'd transaction, may be rolled back. In hiedb's case, this would be automatically fixed on the follow-up run.
jhrcek
left a comment
There was a problem hiding this comment.
Nice! I like the performance improvements you were able to squeeze out 👍
But can we do that without changing the public api of the library?
|
|
||
| {-| Initialize database schema for given 'HieDb'. | ||
| -} | ||
| initConn :: HieDb -> IO () |
There was a problem hiding this comment.
We should probably not change the library's api functions without good reason.
E.g. this functions is exposed in multiple versions of the library (https://hackage-content.haskell.org/package/hiedb-0.7.0.0/docs/HieDb-Create.html#v:initConn) and people using hiedb are probably using it.
There was a problem hiding this comment.
I'm not sure there's an easy way to avoid changing the visible API of either initConn or deleteInternalTables. I think the types are wrong relative to setting up the prepared statements.
In this case, HieDb is the handle used to do operations on the sqlite file, so it's the likeliest place to put the prepared statements. But initConn, is the function that sets up the tables, but also takes a HieDb, so it can't contain the prepared statements (without doing something like lazy IO). Sqlite doesn't allow statement preparation on tables that don't exist yet. There's a loop I need to break here.
WDYT? Am I missing overlooking an obvious option here?
There was a problem hiding this comment.
Just realized after looking at deleteInternalTables, I could also just keep this function for API purposes but also otherwise not use it. I've pushed a commit that re-adds back both functions.
| execute_ conn "PRAGMA optimize;" | ||
| changes conn | ||
|
|
||
| deleteInternalTables :: Connection -> FilePath -> IO () |
There was a problem hiding this comment.
Same as above - probably not a good idea to change public api (unless we want to do major version bump, which I don't think is necessary).
There was a problem hiding this comment.
Approximately the same problem occurs here as with initConn. In this case I could keep this function for API purposes and have it keep calling the non-prepared deletions, but otherwise not call it ourselves in the library? And also perhaps decorate it with a DEPRECATED pragma?
crtschin
left a comment
There was a problem hiding this comment.
Leaving the function signatures aside, is it acceptable that setupHieDb/initConn is always called?
The code in this PR implies that Init is essentially a noop relative to the other commands, the tables will always be instantiated if the db file didn't already exist (looking at runCommand).
|
|
||
| {-| Initialize database schema for given 'HieDb'. | ||
| -} | ||
| initConn :: HieDb -> IO () |
There was a problem hiding this comment.
I'm not sure there's an easy way to avoid changing the visible API of either initConn or deleteInternalTables. I think the types are wrong relative to setting up the prepared statements.
In this case, HieDb is the handle used to do operations on the sqlite file, so it's the likeliest place to put the prepared statements. But initConn, is the function that sets up the tables, but also takes a HieDb, so it can't contain the prepared statements (without doing something like lazy IO). Sqlite doesn't allow statement preparation on tables that don't exist yet. There's a loop I need to break here.
WDYT? Am I missing overlooking an obvious option here?
| execute_ conn "PRAGMA optimize;" | ||
| changes conn | ||
|
|
||
| deleteInternalTables :: Connection -> FilePath -> IO () |
There was a problem hiding this comment.
Approximately the same problem occurs here as with initConn. In this case I could keep this function for API purposes and have it keep calling the non-prepared deletions, but otherwise not call it ourselves in the library? And also perhaps decorate it with a DEPRECATED pragma?
It's already called as part of `withHieDb` that sets up and provides the `HieDb` handle.
|
Sorry, I'm just a secondary maintainer and I'm kind of busy at work. I'd need to dive deeper to understand the implications of these changes. I'll reserve some time to look deeper into this PR later this week. In the meantime, I have couple questions:
Is there a way we could make the change more isolated to get 80% of the benefit with fewer changes (like only do this prepared statement stuff in indexing, which we know is generally the bottleneck, unlike other places)
|
Good question! I did this to get rid of the argument checking that sqlite-simple does that's useful when writing queries, but less so when running queries. I realize I didn't benchmark this change. I'll do so!
I'd have to experiment a bit, but I think it'll be hard without changing the API more than I already did. But I can probably try only adding more things.
No worries! No rush needed at all. I'm also not super satisfied with the API changes, so I'm also keen to improve it. |
Benchmarking gives a negligible difference between binding via sqlite-simple, which does additional checks on binding parameters, and direct-sqlite that only calls the underlying sqlite3 function. Considering there is no difference stick to using helpers from the same library.
Keeping the API is in my opinion less of an issue. Sure, let's try to avoid unnecessary changes, but otherwise, a noticeable performance win is worth a breaking change to me :) |
For the curious, I gave this an attempt at crtschin@7cfe675. Though I (subjectively) like that less than this PR. So I'd prefer we stick to the setup here if acceptable.
I did so, it didn't make much difference, so I removed the reference to
I think the only noticeable behavioral change is if hiedb is called with a database filepath that doesn't exist, it will create that database file (even if doing only a readonly query), with the appropriate tables. In the previous scenario hiedb wouldn't create the database file, only execute the queries on that non-existing database. Running ref-graph on a non-existing db file# Before (file didn't exist prior, also does not exist after call):
> hiedb -D /tmp/non-existing-file ref-graph
hiedb: SQLite3 returned ErrorError while attempting to perform prepare "SELECT mods.mod, decls.hieFile, decls.occ, decls.sl, decls.sc, decls.el, decls.ec,rmods.mod, ref_decl.hieFile, ref_decl.occ, ref_decl.sl, ref_decl.sc, ref_decl.el, ref_decl.ec FROM decls JOIN refs ON refs.hieFile = decls.hieFile JOIN mods ON mods.hieFile = decls.hieFile JOIN mods AS rmods ON rmods.mod = refs.mod AND rmods.unit = refs.unit AND rmods.is_boot = 0 JOIN decls AS ref_decl ON ref_decl.hieFile = rmods.hieFile AND ref_decl.occ = refs.occ WHERE ((refs.sl > decls.sl) OR (refs.sl = decls.sl AND refs.sc > decls.sc)) AND ((refs.el < decls.el) OR (refs.el = decls.el AND refs.ec <= decls.ec))": no such table: decls
# After (file didn't exist prior, does exist after call):
> hiedb -D /tmp/non-existing-file ref-graph
<no output>The total number of queries didn't actually increase, it's actually less. The previous setup used prepared statements as well, but re-prepared them for every
Performance is a feature after all :D |
fendor
left a comment
There was a problem hiding this comment.
LGTM, thank you very much for optimising this crucial piece of the IDE infrastructure!
I merely have documentation requests, if you still have the head for it!
Waiting for @jhrcek's review, assuming he finds the time for it :)
wz1000
left a comment
There was a problem hiding this comment.
Nice performance improvement, a few stylistic suggestions but otherwise this seems ready.
| newtype HieDb = HieDb { getConn :: Connection } | ||
| runStatementFor :: (ToRow a, FromRow b) => StatementFor a -> a -> IO (Maybe b) | ||
| {-# INLINE runStatementFor #-} | ||
| runStatementFor (StatementFor statement) params = do |
There was a problem hiding this comment.
This doesn't seem to be used. Perhaps we should delete it. If we are to keep it, ideally the return variable wouldn't be unconstrained and we would have something like StatementFor a b -> a -> IO b, but we don't seem to need that functionality right now.
There was a problem hiding this comment.
I do use it in this line.
I agree on including the output variable. In a different branch where I played around, I had that exact setup, but I omitted it for simplicity's sake. Might be good to explore that if the functions in Query.hs get the same benefit from this treatment.
- Adds some needed documentation - Cleans up some helper function names - Splits off statements into its own datatype.
jhrcek
left a comment
There was a problem hiding this comment.
Thank you, tested this out with hls master and didn't notice any issues, so +1 to merge from me.
|
Heya, gentle ping. Do I need to do anything else? I changed the API so I can make a follow-up to HLS perhaps? |
|
Thank you for the ping, merged :) A follow up PR would be much appreciated! @jhrcek Do we have a policy for releasing hiedb? Should we just do a release? |
|
I'm not aware of any policy. I can work on releasing new version to hackage tomorrow. |
Noticed that sqlite interactions could be improved somewhat I think. Landed on these set of changes. I included benchmarks.
indexandinitcommands, which it previously was. I could take a better look at this and avoid this behavior on request, if it's better to keep the old behavior for backwards-compatibility purposes.PRAGMA synchronous = NORMAL. The default here isFULL. Considering the journal was set to already use WAL, the difference is the loss of durability. Practically this means that a committed but not yet fsync'd transaction, may be rolled back on system failure. I think this is fine in hiedb's case, as this would be automatically fixed on the follow-up run.Benchmarks
These were run on ghc 9.6.7 on hie files generated from hls on commit 88ccebe0649f7c41be97d49a986bbfd4185982f6. Benchmarks were setup with hyperfine with warmpup and 10 runs, dropping page caches in between each run. Runs are in reverse chronological order, (3) is baseline.