The architecture was inspired by U-Net: Convolutional Networks for Biomedical Image Segmentation.
The data for training contains 30 512*512 images, which are far not enough to feed a deep learning neural network.
This deep neural network is implemented with Pytorch, which makes it extremely easy to experiment with different interesting architectures.
Output from the network is a 512*512 which represents mask that should be learned. Sigmoid activation function makes sure that mask pixels are in [0, 1] range.
The model is trained for 25 epochs.
After 25 epochs, calculated mIoU is 65% and overall validation accuracy is about 91%.
Loss function for the training is basically just cross-entropy.