We clobbered ourselves last night with Chia logging

Well it looks like our anti-chia tool clobbered everything last night. We had a clever way of detecting the plots and without going into a lot of detail it required us to log a whole lot of specifics about the size of read requests. This logging itself wasn’t the problem but the database it was going into became an issue at about 1am.

When the database failed the local (On each storage server) logging daemons that we’d written went back into their debug mode and started logging to syslog…. Rapidly filling every /var/log partition on each storage node.

We’re cleaning them out now. We rushed this tool out to combat Chia without testing it as heavily as we should and removing some debugging options. I still think it was the right thing to do, we HAVE to keep the plots off the network but in the future we’ll make it handle a full database more elegantly.

Expect service to return to normal in 2 hours.

2 thoughts on “We clobbered ourselves last night with Chia logging”

  1. Thanks for fixing the servers on a holiday! Happy 4th of July.

    Suggestion about chia code: After the holidays maybe consider morphing the Chia code into abuse prevention / resource allocation code.

    It all boils down to Chia users consuming significantly more resources than any other customer, essentially taking more than their Fair share.

    I think there’s either one very large customer or a lot of smaller customers who have their daily backup set at 12:00 a.m. California time (us cluster). This itself is not a bad idea, as there are less people using the service. For about 30 to 45 minutes listing directories is >30 seconds if it works at all.

    Then, it clears up and everything starts moving again.

    If you had some load balancing code where everyone got a share of the resources so one or a few people couldn’t consume all of them I think it would good. this would slow down the Chia people if they did slip through the cracks, and then legitimate users who are consuming all the resources won’t accidentally do this.

    I think that your servers have plenty of ram/SSD/cpu to give every customer reasonable resources at the same time, but there is not a mechanism for one or a small group of customers to consume all of the resources at once. (I may be wrong about this and I could have something else wrong on my end) Maybe something that lowers their priority when they are using more than a fixed threshold? So if the servers are not busy a late night backup can run at ludicrous speed (haha), but if another tries to recover a backup they can consume their fair amount resources instead of it being a consumption battle.

    This would deal with chia derivatives also, as new forks of the crypto can change how the files are stored. Or some of the other storage based cryptos like BHD or MASS where the plots are less defined and reads more random. (but consumes unfair levels of usage)

    Just a suggestion.
    Also, thanks for fixing this on the holidays. Happy 4th.

    1. Thanks Robert! We’re about 75% back online now and it certainly has been an uphill fight this holiday weekend.
      We’re working as fast as we can to recover from our self inflicted Chia wound here.

      Longer response to follow but for now it’s back to repairing CEPH nodes for me.

Leave a Reply

Your email address will not be published. Required fields are marked *