The field of 3D object detection and perception in autonomous driving is witnessing a shift towards more efficient and integrated solutions. Recent advancements emphasize the importance of multimodal alignment, efficient view transformation, and adaptive input aggregation to enhance both accuracy and computational efficiency. Innovations in temporal modeling and query-based approaches are also pushing the boundaries of performance, particularly in handling dynamic scenes and reducing computational overhead. Notably, the integration of state space models and novel transformer designs are showing promising results in improving detection and segmentation tasks. These developments collectively suggest a trend towards more sophisticated, yet efficient, methods that leverage the strengths of various data modalities and temporal information to advance the field.