Microsoft Research makes great strides in automatic image captioning
Microsoft Research is home to all manner of interesting projects and experiments, and one of the latest that the team is keen to share news about is automatic image captioning. There are no prizes for guessing what this is -- it's very much what it says on the tin -- and the technology has now reached a stage where the automatically generated captions for an image are at least as good as those thought of by people.
A team of just 12 worked on the project, and the results are pretty impressive. The system analyzes an image and identifies its key components. After determining objects and characteristics, these can then be evaluated in relation to each other to help decide what is important, and what can be ignored.
A series of possible image descriptions is then created, and these are ranked according to how much sense they make in natural language, and whether importance has been attributed to the correct components. A post on the Machine Learning Blog gives you the opportunity to see if you can determine which image captions were written by hand and which came from a machine. You might be surprised at just how difficult it is to distinguish between the two!
There are lots of potential applications for this technology, but two of the most obvious are voice control and accessibility options. At the moment it is possible to conduct web searches for images by controlling a computer with your voice, but this is reliant on images having been appropriately tagged and captioned already. If software can be used to automatically caption images on the fly, voice-activated image searches can cast a wider net.
But perhaps a more interesting use of the technology would be to make computers in general, and the internet specifically, more accessible to people with sight problems. Text-to-speech is great for hearing what has been written on a particular page, but what about the accompanying imagery. With automatic captioning, pictures that have been added to an article could be described by text-to-speech software regardless of whether a descriptive caption had been added by the author.