Conversation
|
This is however very breaking change. Due to design of unboxed vector (which I consider a mistake) this change will break all unboxed vectors. They require defining data instances and |
|
@Shimuuar what exactly would break? I tried couple of packages with manual instances for unboxed vectors, they seem to compile fine. |
|
Sorry, I missed mutually recursive |
| | otherwise = case basicUnsafeIndexM v i of | ||
| Box x -> return $ Yield x (i+1) | ||
| step (I# i) | ||
| | I# i >= n = return Done |
There was a problem hiding this comment.
Here and everywhere in this PR, when we pattern-match on I# but then construct it again in function's body is GHC 100% guaranteed to optimise allocation of fresh Int away and will just reuse whatever we pattern-matched on?
There was a problem hiding this comment.
GHC should be quite good at this. So it should eliminate Int allocations.
|
I tried to measure impact of this optimization. To do so I added simple benchmark which computed variance of vector of doubles (branches varianceNoInline :: (VG.Vector v Double) => v Double -> Double
{-# NOINLINE varianceNoInline #-}
varianceNoInline xs
= VG.sum (VG.map (\x -> (x - s)^(2::Int)) xs) / n
where
n = fromIntegral $ VG.length xs
s = VG.sum xs / nFunction specialized by GHC runs in constant space. Changes to indexing make no difference and performance is stable for all GHC version with 6 cycles/element. For non-specialized situation is much more interesting. Adding strictness (#485) make no difference al all in this benchmark. It seems programs are identical, al least number of instruction is exactly same, so I'll compare baseline ( AllocationsBelow are allocations per array element: This PR is clear improvement for GHC<=9.4 but somehow it become a pessimization for GHC>=9.6. I haven't looked into core so I have no idea why PerformanceRuntime performance follows same pattern: 25% win for GHC<=9.4 and 100% loss for GHC>=9.6. And latter performs worse even without this optimization. It looks we have some regression in GHC optimizer. But I have looked into core yet and answer lies there. |
|
First of all some estimations. We need to perform indexing twice, which al least means allocating 2 Cleaned up core for nonspecialized function could be seen in gist. Core is very similar between GHC versions and with/without this optimization. Notable differences: GHC 9.4 → 9.8GHC changed how worker-wrapper transformation is done. In GHC9.4 it passed 9.4varianceNoInline :: forall (v :: * -> *). Vector v Double => v Double -> Double
varianceNoInline
= \ (@(v_ :: * -> *))
($dVector_s2fQ :: Vector v_ Double)
(xs_s2g2 :: v_ Double) ->
case $dVector_s2fQ of
{ C:Vector ww_s2fS ww1_s2fT ww2_s2fU ww3_s2fV ww4_s2fW ww5_s2fX
ww6_s2fY ww7_s2fZ ww8_s2g0 ->
case $wvarianceNoInline @v_ ww3_s2fV ww6_s2fY xs_s2g2 of ww9_s2g6
{ __DEFAULT -> D# ww9_s2g6 }}9.8-- RHS size: {terms: 10, types: 9, coercions: 0, joins: 0/0}
varianceNoInline
:: forall (v :: * -> *). Vector v Double => v Double -> Double
varianceNoInline
= \ (@(v_ :: * -> *))
($dictVec :: Vector v_ Double)
(vec0 :: v_ Double) ->
case $wvarianceNoInline @v_ $dictVec vec0 of ww_sfOR
{ __DEFAULT -> D# ww_sfOR }Apparently lookup of functions in the dictionary caused performance degradation (44 CYC/elt → 65 CYC/elt) But allocations picture is a mystery:
P.S. Box trick seems to be terribly wasteful in case when GHC can't specialize. It doubles allocations for small values |
In stddev benchmark with NOINLINE it gives quite significat improvements accross all compiler versions: - 3-10% reduction on CPU cycles depending on GHC version - -2 branches/per indexing for all cases. No change for inlined version Overall this is cheap and nice change.
|
I found that using No changes for case when specialization happens. |
#485 did only half of the job: while GHC now knows that index is used strictly it still would not necessarily unpack it, because
basicUnsafeIndexMmust receiveIntnotInt#. Only after inlining an opportunity to erase boxing will arise.This patch introduces
basicUnsafeIndexM#to help GHC further. If it looks good, I'll go forbasicUnsafeRead#/basicUnsafeWrite#in another PR.