Empirical research on End-to-End FCOS
Inspired by the YOLOv10, I recently make the empirical research on FCOS to evaluate the End-to-End detection paradigm.
Experiments
Incredibly, the FPS of the three FCOS are almost the same!
| Model |
Sclae |
FPSFP32 RTX 4060
| APval 0.5:0.95
| APval 0.5
| Weight |
Logs |
| FCOS_RT_R18_3x |
512,736 |
56 |
35.8 |
53.3 |
ckpt |
log |
| FCOS_RT_R18_3x (O2O) |
512,736 |
56 |
30.9 |
48.8 |
ckpt |
log |
| FCOS_E2E_R18_3x |
512,736 |
56 |
34.1 |
50.6 |
ckpt |
log |
For FCOS_RT_R18_3x, we only use one-to-many assinger to train FCOS-RT-R18-3x and evaluate it with NMS.
For FCOS_RT_R18_3x (O2O), we only use one-to-one assinger to train FCOS-RT-R18-3x and evaluate it without NMS.
For FCOS_E2E_R18_3x, we deploy two parallel detection head, one using one-to-many assinger (o2m head) and the other using one-to-one assinger (o2o head). To avoid conflicts between the gradients returned by o2o head and o2m head, we truncate the gradients returned from o2o head to the backbone and neck, and only allow the gradients returned from o2m head to update the backbone and neck. This operation is consistent with the practice of YOLOv10. For evaluation, we remove the o2m head and only use o2o head without NMS.