Updated Proposal: Chaos Labs - Risk & Simulation Platform

Thanks for bringing this up, @Alok_StandardCrypto. One major tenet of risk management, both in traditional finance and DeFi, is a notion of statistical rigor when choosing high value parameters. While we agree that there are many benefits to multiple risk assessments (which usually involve methodologies that can be aggregated in a statistically coherent manner), there seems to be little to no description of either a coherent statistical methodology in this proposal nor any level of rigor befitting billions of dollars of AUM in the case studies provided. Let’s perform a pedagogical understanding of why this is harmful and we would argue that any additional risk vendor should be subject to providing a real risk methodology rather than summary statistics.

Risk Analysis methodology shares similarities with Smart Contract security

The Aave DAO has employed a number of continuous auditing entities, such as Certora, to provide formal verification services. A number of formal verification tools (such as the K framework and Slither) exist so one may naturally ask the question, “Why do we need to pay Certora? Can’t we provide bounties to developers to write the formal verification code?” The answer is that it is extremely hard to construct a comprehensive set of tests and missing one can be deadly.

Let’s see why with a simple example.

There can technically be an unbounded set of properties that one needs to prove for a contract to be safe under all circumstances (especially if off-chain interaction is involved [0]). Suppose that there exists a set of 10,000 properties (or invariants) that a smart contract needs to satisfy in order for it to provide the expected operation under all expected circumstances. Furthermore, suppose a new developer who didn’t know that this was the requisite universe of properties and instead constructed a set of 100 properties that they were able to prove held true in the contracts using an open source tool. This developer might then publish these tests, add them to a continuous integration tool chain, and lead everyone to believe that the contracts were safe in spite of not actually having full coverage. The reason one employs OpenZeppelin or Certora is because they have a better sense of the universe of possible properties and tests for a formal verifier and can ensure that their committed properties (including those found via contest) are closer to covering 10,000 than 100.

Agent-based simulation (ABS) is in a similar spot to formal verification — it is a tool that only operates as well as its inputs are good. Like formal verification, there is a set of properties or expected behaviors that are tested based on the definition of the agents in the system and the context and/or environment for evolving these agents. However, unlike formal verification, the statistical assumptions made dramatically impact the output quality of the results. For instance, one may naively think that a high liquidation volume on a particular collateral asset, borrowed asset pair is a sign that a protocol parameter needs to be changed. However, liquidations are a healthy part of any leverage system and often, liquidations are anti-correlated with the worst type of behavior in the system: insolvencies. As such, an agent-based simulation that changes borrower behavior based on liquidations as a signal instead of something that constrains liquidation volume by insolvencies would deliver completely inaccurate parameter recommendations (much like covering only 100 tests instead of 10,000).

The original post’s linked case studies make many mistakes of this form, suggesting that the authors are treating ABS as a form of directed fuzzing rather than a statistically accurate, Bayesian learning tool. Some of the pieces of the case study that provide a severe lack of confidence in the statistical significance of the results include but are not limited to:

  1. The authors depeg stETH to a large deviation and measure whether liquidations occur as a metric of protocol health. This misses a few major statistical observations that need to be constantly retrained based on off-chain and on-chain data:
    a. Such a price level is usually unsustainable if other venues have enough liquidity and arbitrageurs can profit from converging prices. This model seems to ignore this effect and the volatility (or any higher order moments of the price process).
    b. There is no description of the liquidity model used. Liquidation profitability is extremely sensitive to the liquidity on both off-chain and on-chain venues and these have different empirical elasticities due to flash loans making on-chain liquidity cheaper for risk-averse liquidators. We demonstrate all of these effects in our Aave market risk assessment from 2020 and urge the original poster (or anyone who wants to do risk for a large protocol) to read this carefully. This post suggests that they clearly did not.
    c. There is no stability analysis. The impact of such a shock on the system depends on volatility conditions throughout DeFi — the smooth curves rendered in the case study assume purely deterministic behavior which is very much not what you see on-chain or in the mempool.
    d. The assumptions are wholly unrealistic:
    1. “In this simulation, we ignore the effect of stETH de-peg on other asset prices.”
    - This is completely not true in practice and moreover, the liquidity and/or slippage curves of other assets are also correlated to stETH/ETH!
    2. ”We do not simulate any stETH buy pressure on Curve in order to speed up the cascading liquidation effect.”
    - This was in fact what ensured that Aave was safe — the original posters clearly did not look at on-chain data during the large liquidation events.

  2. The simulations run for a period of 150 blocks which is too short to realize realistic arbitrageur behavior
    a. The actual stETH depeg took place over a much longer time frame and liquidity conditions changed dramatically within that interval. For reference, Gauntlet runs simulations for a minimum of 1 day, which simulates 5760 blocks and we run over 40,000 simulations per day sampling different statistical configurations. It is much harder to get convergence and confidence intervals when you’re running 0.00000065x the number of simulations.
    b. The authors seem to not understand that making an infinite CEX liquidity assumption and only running for 150 blocks implicitly chooses a volatility (which they don’t specify to the reader!). Choosing a stopping time for a simulation implicitly impacts the statistical quality of the results.

In general, the above mistakes make someone with even a barely trained eye have to question the conclusions and parameter choices made. While the original posters’ dashboards are a good way to visualize high level statistics, their inferences get so many basic things incorrect in a way that dramatically impacts parameter recommendations so much that one should be nervous of using them. Access to data sources is not the same as being able to derived insights from that data that you have immense confidence in.

High value parameters are hard to “throw machine learning at blindly”

At Gauntlet, we’ve pride ourselves on our two main contributions to the DeFi space: developing the core research needed for constructing good models of DeFi protocols [0] and ensuring that our continuously retrained and optimized models match real life predictions. We spend much more time evaluating the out-of-sample complexity error of our parameter recommendations than we do make new agent logic. And there are a number of good reasons for this:

  1. If you can predict out-of-sample insolvencies with a high precision-recall curve, then you can be confident that your parameters match an optimal value
  2. Being able to test deviations from no-arbitrage assumptions allows one to improve liquidity models throughout DeFi (which in turn, impacts liquidators’ realized costs, which is the most important part of the economic model)
  3. Understanding the trade-off between prediction quality for liquidations, revenue generated, and insolvencies and how the trade-off changes as the Aave loan book evolves is crucial to being able to automatically submit recommendations.

We run over 300,000 simulations of the Aave protocol a week, with an eye towards how model quality changes when we use a large-scale hyperparameter optimization algorithm for choosing financially important values like loan-to-value. Being able to measure model quality relative to realized behavior on-chain is one of the most beautiful things about DeFi — in traditional finance you wouldn’t have the entire data set at your fingertips. Model quality in DeFi is equal to dollars earned for tokenholders and lenders.

Unfortunately, the OP seems to not understand this. Other hand risk management entities in DeFi, such as Risk DAO and Block Analitica, which differ in methodology and precision requirements to us clearly understand this fact based on their track record of published research (which Chaos appears to have none of, especially to the caliber expected of a protocol managing billions in assets).

Finally, a prior version of OP’s proposal mentioned that they would compute “VaR” without describing any methodology for doing so. We’ve written a number of articles, from our Aave market risk report to our description of VaR and LaR calculations that provide a precise methodology for computing these values. Chaos seems to not understand that VaR is sensitive to one’s distributional assumptions in a simulation. After all, VaR is simply defined as a probabilistic tail bound. But how do you compute that tail? The empirical cumulative distribution function depends not only on your agents’ logic but on the statistical assumptions you make of the environment (prices, liquidity, other protocol behavior) and needs to be constrained by that. Having a risk provider who doesn’t understand such basic facts (especially when there are at least three others who do!) seems malevolent towards users of the protocol.

Finally, one other major flaw in the OP’s post is that they choose some protocol parameters and not others. In particular, they clearly miss all of the liquidation parameters which are obviously crucial to incentivizing safe operation of the protocol. Moreover, the OP doesn’t seem to understand that when you optimize LTVs, you need to do this jointly (e.g. optimizing the n² LTV parameters and the O(n) liquidation parameters) simultaneously. If you don’t, you run into an issue where you have optimized for an LTV that doesn’t cause insolvencies but only at an unrealistic liquidation threshold (e.g. 50%, which would be unacceptable to users). Such an oversight suggests the OP hasn’t even thought about how to do the task they’re asking $500,000 for!

Track record

Much as formal verification companies earn a track record by proving that they are able to cover the space of properties well, so too should risk assessors. We have one of the longest on-chain track records for providing risk recommendations to protocols throughout DeFi. Moreover, we’re even putting money on the line for it and putting our money where our mouth is. When making risk parameter changes, you are making changes that impact billions of dollars of AUM and user welfare — we don’t take that lightly and are committed to putting up capital to cover any losses that are inured from bad predictions. The OP seems to not recognize this, which perhaps explains their flippant modeling style. We note that this was also brought up during @bgdlabs’s comments on the original proposal (and we note that this proposal does not elide all of BGD’s prior considerations).

We note that we take integrations (and the statistical reliability of our simulations — which are 7 orders of magnitude per day more than what OP seems comfortable using to manage billions of dollars) seriously and make sure that our model quality is up to the highest standards before we support a new version of a protocol. Given the careful rollout of Aave V3, we’ve spent an enormous amount of effort making sure that we model the liquidation mechanics and liquidity impact of e-mode in a manner that is of the highest quality (as we of course care about our track record).

Conclusion

Decentralization brings a lot of really amazing things to the risk management community. Firstly, the free access to data from every protocol and liquidity source allows for one to create rich models of user behavior using tools such as agent-based modeling. Moreover, many entities can develop their own tools and insights on this data and the aggregate can be better than the whole. These tools and aggregative insight is the thing actuaries in traditional finance only dream of! But at the same time, one can only extract as much from their tooling as the inputs they put in. As they say in statistics and machine learning, “garbage in, garbage out.”

The above proposal seems to be more of a proposal for setting parameters via fuzzing. The lack of care in the prior art, seemingly large omissions, and dramatic changes from the last proposal do not inspire confidence. Given that there are a number of other organizations, Gauntlet included, who are much more rigorous about this, it behooves tokenholders to be careful when considering such a proposal.

If the goal of Aave tokenholders and @AaveLabs is to help grow the protocol, bring in institutional capital flows (via mechanisms such as Aave Arc), and take advantage of the benefits of DeFi, then we need to have risk management that can be trusted. Institutions demand a high level of rigor in risk management, especially with regards to statistical rigor and a careful calibration of predictions to realized outcomes. To get DeFi to that level, we as a community need to continue to invest in building better risk methodologies and creating research and tools for the community to understand the inherently complex properties of risk in DeFi. This proposal achieves none of that, making it impossible for us to support nor recommend anyone else support it either.


[0] In fact, the name ‘oracle’ derives itself from referring to an arbitrary Turing machine providing you with the output on given input. This, of course, includes the halting problem and hence you have a technically unbounded state space

[1] We are coauthors of over 24 papers in DeFi, including the highest cited paper within DeFi. We’ve also written numerous papers on how to analyze lending protocols, including modeling how LTVs/collateral factors should change when one borrows capital against an LP share.

25 Likes