For layer sparse regularization, why not use L1 loss？

You said in your paper :
“we aim to penalize the L0 norm of these components, i.e., k [𝑊1 , . . . , 𝑊𝐾 ] k 0 . Since the L0 norm is not differ-
entiable, we design a specialized soft counting norm”.
But the L1 norm is the optimal convex approximation of the L0 norm, and it is easier to optimize and solve than the L0 norm, so why don't you use the L1 norm to achieve regularization.