Overview

Automatic creation of minutes of meetings using OpenAI’s Whisper API and Google Cloud Speech-to-Text API.

Business Challenge

In a multinational organization like SHIFT ASIA, where staff is from different countries, we often communicate using languages other than our native languages, such as Vietnamese, Japanese, and English. Therefore, miscommunication between multinational members may become a major issue at offshore development companies like SHIFT ASIA and various companies and organizations.

Our solution

The recent advanced development of AI has opened new opportunities and possibilities. By applying AI, we could develop an automatic tool that can create highly accurate minutes of meetings in order to reduce miscommunication in conversations to a certain extent.

Specific initiatives

Using Whisper API and Google Cloud Speech-to-Text API, we built our own automatic minutes generation system on a trial basis. The content of meetings held by multiple people is recorded, and minutes are automatically generated as text data based on the content.
By comparing the output of Whisper API and Google Cloud Speech-to-Text API, we compared the characteristics and accuracy of each model.

Research results and future prospects

At present, the output accuracy of both the Whisper API and the Google Cloud Speech-to-Text API is not high enough to be used for minutes of public meetings.

There are clear differences in automatic language detection and speaker dialect between Whisper API and Google Cloud Speech-to-Text API. We found that the existing main speech recognition models, Whisper API and Google Cloud Speech-to-Text API, still have a long way to go to automatically generate meeting minutes from conversations with multiple voices with high accuracy.

However, it is possible to generate simple memos, and it is at a sufficiently useful level to be used as a reference document and subsequently update the minutes manually or with the generation AI.

In order to generate official sentences, it is still necessary to improve accuracy through improvements in technical capabilities and system customization, but SHIFT ASIA can transcribe audio data with high precision and use that text data to create even more sophisticated transcriptions. We will continue to research and develop AI transcription using speech recognition models to enable comprehensive analysis.