GitHub - aalto-speech/dta-server: This repo contain the server developed for the DigiTala in Action project

This repo contain the server developed for the DigiTala in Action project.

Project requirement:

Develop a server to processing speech send by a mobile app, and return the 5 speech ratings scores: fluency, pronunciation, range, accuracy, holistic
- Some python library is require: pytorch and transformer (to run wav2vec2 and Whisper model).
- For TLS cerfiticate, Caddy and Let's Encrypt is the prefered choice.
- We suggest using FastAPI instead of Flask for Web API. We used Flask in our server but FastAPI is faster for large scale.
Our main goals are as follow:
- The server is secure and reliable: it should be able to run with minimum maintenance for the next 4-5 years. If the instance restart, server will also automatic start again and data is not lost.
- Simplicity for maintenance/setup: as the maintainer may not familiar with server development. We may need to migrate our server to a better/worse CSC instance in the future, so easier to setup is also important.
  - For such reasons, we prefer Docker/Podman, but if it is possible to run without any container (simpler) then we are also ok with your solution.
  - Please comments your code.
  - We would want to have a script for setup the server (see Server_setup.md as an example). And another script to rebuild the container when we made change (see SaySvenska server).
  - The processing time of 45s of audio is usually quite long with AI (we are looking at 10~20s here), so take that into account. (What will happen if there is 2 person speak at the same time on the mobile app, does it crash?).
- We also need the server to store some pseudonymized data from users: UUID (generated by mobile app), consent, timestamp, speech (depend on their consent), their scores... . You can use mongodb, but for simplicity saving it in a csv file (and audio files in folder) is preferred (for simplicty).
- Other requirements may come up during the project, but it must fit with your timeline (a complex feature request at the end of the project timeline is a no no).

For more information, you can look at SaySvenska server: https://github.com/Usin2705/SaySvenska/tree/main/Server

You can look at an example of the API (a bit old now) from SaySuomi Readme file: https://github.com/Usin2705/CaptainA_unity/tree/main

Input and Output format:

The server expect to receive the following data and format via RESTful:

audio data: from attached file, in .wav format
guid: in text form, to identify users
Other text (or number) data collected from user feedback.

The server expect to return the minimum data:

from flask import Flask, jsonify, request
# You will use FastAPI, so the code will be slighly different:

def func_assess_speech():
	# Function for assessing user's speaking skill

	wav_file = request.files['file'] # We receive the attached audio file here
	guid = request.form["guid"] # We get the GUID sent to us, to know who is this user
	other_info = request.form["other_info"] # Other info needed/collected

	fluency, pronunciation, range_score, accuracy, holistic = ai_model.process(wav_file) # This is just an example that use AI Model to process the audio file

	# For example, you will get
	fluency = 4.3
	pronunciation = 2.4
	range_score = 3.3
	accuracy = 4.9
	holistic = 4.0
	
    return jsonify({
        "fluency"		: fluency,
        "pronunciation" : pronunciation,
		"range" 		: range_score,
		"accuracy" 		: accuracy,
        "holistic"		: holistic,
    }), 200