This repo contain the server developed for the DigiTala in Action project.
Project requirement:
- Develop a server to processing speech send by a mobile app, and return the 5 speech ratings scores:
fluency, pronunciation, range, accuracy, holistic - Our main goals are as follow:
- The server is secure and reliable: it should be able to run with minimum maintenance for the next 4-5 years. If the instance restart, server will also automatic start again and data is not lost.
- Simplicity for maintenance/setup: as the maintainer may not familiar with server development. We may need to migrate our server to a better/worse CSC instance in the future, so easier to setup is also important.
- For such reasons, we prefer Docker/Podman, but if it is possible to run without any container (simpler) then we are also ok with your solution.
- Please comments your code.
- We would want to have a script for setup the server (see Server_setup.md as an example). And another script to rebuild the container when we made change (see SaySvenska server).
- The processing time of 45s of audio is usually quite long with AI (we are looking at 10~20s here), so take that into account. (What will happen if there is 2 person speak at the same time on the mobile app, does it crash?).
- We also need the server to store some pseudonymized data from users: UUID (generated by mobile app), consent, timestamp, speech (depend on their consent), their scores... . You can use mongodb, but for simplicity saving it in a csv file (and audio files in folder) is preferred (for simplicty).
- Other requirements may come up during the project, but it must fit with your timeline (a complex feature request at the end of the project timeline is a no no).
For more information, you can look at SaySvenska server: https://github.com/Usin2705/SaySvenska/tree/main/Server
You can look at an example of the API (a bit old now) from SaySuomi Readme file: https://github.com/Usin2705/CaptainA_unity/tree/main
Input and Output format:
The server expect to receive the following data and format via RESTful:
- audio data: from attached file, in .wav format
- guid: in text form, to identify users
- Other text (or number) data collected from user feedback.
The server expect to return the minimum data:
from flask import Flask, jsonify, request # You will use FastAPI, so the code will be slighly different: def func_assess_speech(): # Function for assessing user's speaking skill wav_file = request.files['file'] # We receive the attached audio file here guid = request.form["guid"] # We get the GUID sent to us, to know who is this user other_info = request.form["other_info"] # Other info needed/collected fluency, pronunciation, range_score, accuracy, holistic = ai_model.process(wav_file) # This is just an example that use AI Model to process the audio file # For example, you will get fluency = 4.3 pronunciation = 2.4 range_score = 3.3 accuracy = 4.9 holistic = 4.0 return jsonify({ "fluency" : fluency, "pronunciation" : pronunciation, "range" : range_score, "accuracy" : accuracy, "holistic" : holistic, }), 200