This work proposes a MIL-based method to jointly learn both instance- and bag-level embeddings in a single framework that can accurately predict instance labels and leverages robust hierarchical pooling of features to obtain bag- level features without sacrificing accuracy.