Monday, November 16, 2020

Moving a Golang project from panics to errors. Wouldnt wish this on my enemies.

When Tim, Aaron, and I started working on this problem in August 2018 we immediately began playing with Noms. It was an open source project that gave us a lot of the things Aaron and I had been talking about in order to deliver the features we felt were critical to our success. Noms gave us the ability to efficiently diff across commits, and merge commits. Though Noms had become an inactive project, we felt it would tremendously improve our go to market time to use it as our storage layer, so we decided it was worth diving deeper.

I built a demo of a command line tool that used the Git commands people already know in order to version structured table data. As we played with the demo, we felt like we were on to something, and decided to really go after the "Git for Data" product that is Dolt today. After additional investment in delivering a fully featured Git-like command line we began to think about DoltHub, a place for our users to find, collaborate on, and store data projects. This was the first time we needed to access data stored using Noms on backend servers, and it was at this time we discovered issues with how Noms handled errors.

The Problem with Panics

I'm not going to go into detail on the language design decisions behind Go's panics and errors. If you would like that background before going further I'd take a look at this article which covers it in depth. However, a nasty problem with panics is that a panic without a deferred recover on any go routine will crash your program. If you are writing a server handling many concurrent requests, any panic that escapes on any go routine will cause every single one of the concurrently executing requests to fail. All those connections will be closed, and the server will restart. In a sufficiently complicated server where go routines handling requests spawn additional go routines and use CSP to do work, using panics for flow control is a recipe for disaster. With so much invested in using Noms, we decided to "hard fork" the project and eliminate the heavy use of panics.

Eliminating Panics

The process of eliminating panics started out pretty straight forward. Search the codebase for a panic, change the signature of the method to return an error, and replace the panic. Then find all the call sites where the method and handle the error, or return it up the callstack. I thought I'd be able to migrate from panics to errors in about a week. Boy was I wrong.

Bro, I'm straight up not having a good time

As soon as I realized that Noms could panic when calculating the hash of its types I knew I was way off. In the end I think I spent closer to 6 weeks on this. It was time-consuming, and boring, but it couldn't be automated and every one of the changes was an opportunity to introduce bugs. Today, searching the updated Noms codebase for if err != nil returns 1370 results, and I'm guessing more than half of those were introduced while removing panics. It effected hundreds of function signatures. Because Noms lies below our main application's code, the changes bubble up to all the code which depends on Noms directly and indirectly.

The interesting cases

GO advocates for using CSP in order to write concurrent code. New go routines are spawned in order to do parallel processing and channels are used to communicate between go routines. It was in these cases where panics were causing the biggest problems, and they were the more interesting cases to talk about when converting panics to errors, as converting these improperly can introduce deadlocks.

Lets start by defining a simple function we want to run concurrently:

func process(ctx context.Context, workCh chan WorkItem, resCh chan WorkItemResult) error {
    for workItem := range workCh {
        res, err := doWork(workItem)

        if err != nil {
            return err
        }       

        resCh <- res
    }

    return nil
}

The function takes a channel that it reads work items from, and a channel that it writes results to. In the event that an error is encountered the function returns the error. A straightforward, but erroneous way to handle this would be:

func myConcurrentFunc(ctx context.Context, workItems WorkItem) ([]WorkItemResult, error) {
    workCh := make(chan WorkItem, len(workItems))
    resCh := make(chan WorkItemResult, len(workItems))
    wg := &sync.WaitGroup{}
    wg.Add(NumProcessors)
    
    var err error
    for i := 0; i < NumProcessors; i++ {
        go func() {
            defer wg.Done()
            err = process(ctx, workCh, resCh)
        }()
    }
    
    for _, workItem := range workItems {
        workCh <- workItem
    }

    close(workCh)
    wg.Wait()
    close(resCh)

    if err != nil {
        return nil, err
    }

    var workItemResults []WorkItemResult
    for _, results := range resCh {
        workItemResults = append(workItemResults, result)
    }
    
    return workItemResults, nil
}

The first problem here, is that the err instance is modified and read in an unsafe manner. This can be resolved using a mutex or Go's atomic primitives, and this is a simplified version of how I handled these errors in many cases. I created a class AtomicError that made these types of operations thread safe, but this example still has other issues.

If process() errors on one of the go routines, all the other ones will continue running until they all error, or all the work items have been completed despite not having any impact (No matter what happens we are going to return nil, err). Depending on the use, maybe this behavior is acceptable, but in cases where the operation takes a long time, this does not work. This can be fixed by testing the state of the atomic error in the process loop but it can make the code pretty ugly.

Another suboptimal characteristic of this code is that you need to create the channels with buffers large enough to hold every work item, and result. It is possible to avoid this also, but it involves additional synchronization in order to close the result channel just a single time when an error is encountered, or when all work is done and the last go processing routine exits.

Finally, when we are talking about this being run on a server as a part of a request, there is the very real possibility that the request may get cancelled for some reason, such as exceeding some pre-defined deadline. In order to handle this and the other errors in a cleaner way I wish I had known about errgroup.

errgroup

errgroup is an extremely simple package with just a single small file. It uses the context.Context that is used throughout Golang to implement errgroup.Group which will update the context if any one of them returns an error. Our updated code looks like this:

func process(ctx context.Context, workCh chan WorkItem, resCh chan WorkItemResult) error {
    for {
        select {
        case workItem, ok :=  <- workCh:
            if !ok {
                break
            }   
        case <-ctx.Done()
            break
        }  

        res, err := doWork(workItem)

        if err != nil {
            return err
        }       

        select {
        case resCh <- res:
        case <-ctx.Done()
            break
        }
    }

    return nil
}

func myConcurrentFunc(ctx context.Context, workItems WorkItem) ([]WorkItemResult, error) {
    workCh := make(chan WorkItem, BufferSize)
    resCh := make(chan WorkItemResult, BufferSize)
    eg, ctx := errgroup.WithContext(ctx)
    
    for i := 0; i < NumProcessors; i++ {
        eg.Go(func() error {
            return process(ctx, workCh, resCh)    
        })
    }
    
    eg.Go(func() error{
        for _, workItem := range workItems {
            select {
            case workCh <- workItem:
            case <-ctx.Done():
                break
            }   
        }

        close(workCh)
        return nil
    })

    var workItemResults []WorkItemResult
    for len(workItemResults) < len(workItems) {
        select {
        case result <- workCh
            workItemResults = append(workItemResults, result)
        case <-ctx.Done():
            return nil, eg.Wait()
        }
    }
    
    eg.Wait()
    close(resCh)
    
    return workItemResults, nil
}

This code handles all the issues enumerated above, and it also allows the results to be handled concurrently with the processing. In the cases where you are doing something expensive with each result this could provide a performance improvement.

Conclusion

Dont Panic

There seems to be only one reason to use panics for flow control. Littering your code with explicit error checking after every function call is tedious and ugly. You are right, but the beauty of your code, won't improve your product. Go forces explicit error handling that results in programs that do the appropriate things when error conditions are encountered. It may be ugly, but I hope my pain and suffering serve as yet another reason not to use panics instead of returning errors.



from Hacker News https://ift.tt/35zJIXM

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.