Vision Language Models that is. Does anyone make extensive use of them? I don't mean the occasional image upload to ChatGPT per se, but rather some automated flow that involves taking constant visual input and getting language output.
What type of application is it for, how have you found the results/quality?
Do you use a stock model or do you finetune something for a custom dataset?
If you haven't used VLMs yet but are curious or interested, what would you try using them for?
.png)

