Description
When initializing VILA-HD PS3 models for inference, an AttributeError is thrown due to a typo in the code: the attribute s3_scales is referenced, but the correct attribute name (as defined in the PS3VisionEncoder class) is ps3_scales.
Error Log
Traceback (most recent call last): File "<vila-infer-executable>", line 6, in <module> sys.exit(main()) ... File "<builder.py>", line 75, in build_vision_tower config.mm_scale_num = len(vision_tower.vision_tower.vision_model.s3_scales) ... AttributeError: 'PS3VisionEncoder' object has no attribute 's3_scales'. Did you mean: 'ps3_scales'?
Root Cause
The PS3VisionEncoder class explicitly defines the multi-scale configuration attribute as ps3_scales (consistent with the PS3 model naming convention), but the code in builder.py incorrectly uses s3_scales (missing the "p" prefix), leading to a failed attribute lookup.
Suggested Fix
In the build_vision_tower function of multimodal_encoder/builder.py (line 75):
Change:
config.mm_scale_num = len(vision_tower.vision_tower.vision_model.s3_scales)
To:
config.mm_scale_num = len(vision_tower.vision_tower.vision_model.ps3_scales)
Additional Context
This typo blocks the initialization of VILA-HD PS3 models during inference.
The PS3VisionEncoder class uses ps3_scales consistently for multi-scale processing configuration (e.g., defining resolution scales for the vision encoder).
Correcting this single typo resolves the AttributeError and allows the model to load successfully.
Thanks for maintaining this great project! Let me know if any additional details are needed to validate this fix.
Description
When initializing VILA-HD PS3 models for inference, an AttributeError is thrown due to a typo in the code: the attribute s3_scales is referenced, but the correct attribute name (as defined in the PS3VisionEncoder class) is ps3_scales.
Error Log
Traceback (most recent call last): File "<vila-infer-executable>", line 6, in <module> sys.exit(main()) ... File "<builder.py>", line 75, in build_vision_tower config.mm_scale_num = len(vision_tower.vision_tower.vision_model.s3_scales) ... AttributeError: 'PS3VisionEncoder' object has no attribute 's3_scales'. Did you mean: 'ps3_scales'?Root Cause
The PS3VisionEncoder class explicitly defines the multi-scale configuration attribute as ps3_scales (consistent with the PS3 model naming convention), but the code in builder.py incorrectly uses s3_scales (missing the "p" prefix), leading to a failed attribute lookup.
Suggested Fix
In the build_vision_tower function of multimodal_encoder/builder.py (line 75):
Change:
config.mm_scale_num = len(vision_tower.vision_tower.vision_model.s3_scales)To:
config.mm_scale_num = len(vision_tower.vision_tower.vision_model.ps3_scales)Additional Context
This typo blocks the initialization of VILA-HD PS3 models during inference.
The PS3VisionEncoder class uses ps3_scales consistently for multi-scale processing configuration (e.g., defining resolution scales for the vision encoder).
Correcting this single typo resolves the AttributeError and allows the model to load successfully.
Thanks for maintaining this great project! Let me know if any additional details are needed to validate this fix.