I tried resnext-101(32x4d) for 160x160 crops and also the torchvision resnet-101. For me resnext needs nearly double the time to train compared to the resnet. (Only thing I changed is the final GlobalAveragePooling and classifier layer)
Is this expected behavior? Here it says the required FLOPs are the same, but obviously frameworks like Pytorch may be slower because of a more complicated network structure)
Yea, I already thought about upsampling, but came to the conclusion that most probably it makes more sense to use dilated convolutions instead, since then my input is not interpolated, but it still maintains high resolution feature maps.
The low-level layers can still stay the same and use the imagenet weights as initialization since these are pretty much only simple filters (which for example react to edges, like Sobel). The mid- to high-level filters will probably have to change a lot though.
I did not try any of this yet. These are just my thoughts. I actually do not know about a state of the art paper for classification using dilated convolutions. But in semantic segmentation they are often used to achieve a greater field of view. Intuitively it seems like in general the field of view can be increased by either some pooling operation/strided convolution (which reduces resolution, that's why 224 input is better than 160) or by using dilated convolutions (resolutions stays the same, but computation increases).
Since I assume that low-level filters are not a accuracy-bottleneck, but mid-/high-level features are it may make sense to use dilated convolutions there.
So I maybe will try to use the normal architecture for 160 -> 80 -> 40 -> 20 and then remove the remaining two pooling operations and use dilated convolutions on 20x20.