This explanation is intended as a general introduction to the techniques used by La. Note that simply reading this will not provide you with enough information to write an encoder with comparable compression to La, however may help you get started. With lossless audio compression, the devil is in the details, and there are a lot of details which I haven't even begun to cover here.
Also, it should be noted that the most of the ideas presented here are not original to La, but were already present in existing coders such as Wavpack and Monkey's Audio.
Lossless audio compression is split into two main parts - filtering and entropy coding. This explanation will focus mainly on the filtering aspect, as this is
the most important part.
A filter essentially takes a set of numbers and returns a new set. For the purposes of lossless audio compression, the transformation must be done in such a way that it is reversible and hopefully the transformation will reduce the range of numbers so that they compress better.
In La, most of the filters used are constructed out of predictors. A predictor here being a function which is passed the previous sample and returns a prediction of the next. The predictor may of course internally store some state / history. A filter can thus be created out of any predictor, such that the output value (residual) is the difference between the actual sample and the prediction, i.e.
residual = sample - predictor(lastSample)
and then to recover the original sample when decoding, we simply use:
sample = residual + predictor(lastSample)
A number of different predictors are used in La. The simplest is simply a delta filter, that is prediction = lastSample.
In order to begin building more complicated filters, the first step is to adapt the 'correlation' between the lastSample and the prediction. That is, we use a formula of the form prediction = lastSample * weight, and we adaptively adjust the value weight. A simple method to adapt weight would be to increase it when the last prediction was too low, and decrease it when the last prediction was too high (taking into account the possibilities of negative correlation of course).
From here we can improve things in a number of ways. The first step is to use multiple filters, applying them to the data successively (i.e. the second filter takes the output (that is, the residual and not the prediction!) of the first filter as input).
We can also create predictors which instead of correlating the previous sample correlate the sample n samples previous. Or, instead of using the previous sample as input, we could use the delta of the previous two, or we could group the previous n samples and use them as one input. We can further also play a lot with our weight change rule.
Another approach to improving upon this predictor is to create a single predictor which takes into account the past n samples. It then needs to store an corresponding array of n weights, and will require loops to adapt the weights and to calculate the prediction.
Most of La's filters are based on these ideas. It might be clear to the reader that a predictor using such an array as input is in some ways similar to a neural network. This is true, and one can transfer some principles involved in neural networks to the predictor. However, in my experience actual neural networks tend to perform relatively badly, and speed is obviously also an issue when using them.
Higher level adaptation
La uses around 8 layers of filters. One problem which has to solved in using so many layers is that adding an extra filter can sometimes actually decrease performance. In La this was solved by adapting the weight given to each layer's prediction. One basic method of doing this is to use the following formula adapted_prediction = prediction * weight. Here, similar weight update rules can be applied as above.
The method used here can have a very large impact on performance, and La uses some quite complicated methods to achieve its high levels of compression. Note that it is also a hard to define the problem or find a perfect solution here, as in early layers we may wish to reduce the weight of a filter even if it is currently reducing the error, as it may be increasing the complexity of the data and thus inhibiting the performance of later predictors! As always, trial and error tends to be the best way of determining what works best here.
Most lossless audio compressors, La included, try to take into account the similarity between channels in stereo audio to improve compression performance. The standard way to do this is to convert the L+R signals to X+Y, where X = L - R and Y = R + (X / 2).
La doesn't generally do this, as for files with low correlation between channels, this tend to decrease performance. A simple example is a file where one channel is silent - after the X+Y transformation both channels would contain the signal, thus potentially doubling the resultant filesize.
La works by having some of its predictors take into account samples from both channels. Thus, more complex correlation between the channels is better taken into account of, and it adapts better to the actual level of correlation existing in the signal rather than simply assuming that the channels are correlated as with the X+Y transformation.
The whole purpose behind the filters reducing the range of the samples is the assumption that smaller numbers can be stored more efficiently. How to actually efficiently store these numbers is the part of the entropy coder.
Typically, lossless audio coders have been based on an approach where an estimate is formed for each sample regarding the number of bits required to store it. The number is then split into the base (number%bits) and residual (number/bits). The base is stored using the estimated number of bits, and then the value of the residual is stored using Rice codes or similar. For a good explanation of this, see http://www.firstpr.com.au/audiocomp/lossless/.
La works a bit differently. Instead of estimating the number of bits, it estimates an actual value. So instead of estimating say 12 bits we would estimate a range of say 4938. Thus, sample/4938 would be stored as the base and sample%4938 as the residual. An arithmetic coder is used to efficiently store both the base and residual. The advantage offered here is that for both the base and residual we are not forced to round everything to use bits (we could estimate a range of 2731 for example and store that using effectively 11.2 bits or whatever). This form of entropy coding is thus significantly more efficient than the standard methods.
One further issue is the estimate of the number of bits required, or in La's case the range. Some encoders calculate the optimum value for a set of n numbers, and store this estimate in the output. La on the other hand adaptively estimates this value, updating the estimate after each sample. For example we could use the formula estimate = lastSample + estimate*c, where c is some constant between 0 and 1.