Introduction
At Tarteel, we’re building the world’s first AI-powered Quran application.
We enable Muslims to improve their recitation and memorization workflows with the Quran by using AI to follow along with their recitation, highlight any mistakes they might have made, and automatically track their progress.
The technology that powers Tarteel has gone through years of development until it reached a mature stage where it can be deployed in the manner it’s being used today. It’s no exaggeration that it took blood, sweat, and tears to bring this technology to fruition.
Quran recitation is unlike any other spoken language: it has multiple layers of vocal nuance that are extremely difficult to identify, even by ordinary humans. As a sacred text, it is paramount that the AI be accurate to avoid reinforcing incorrect recitation.
Many before us have tried to achieve what we have done, but either couldn’t due to resource or technological limitations of their time.
Despite all the obstacles that lay ahead of us, it is by the bounty of Allah - SWT - that he has enabled our team to build and deliver this technology to the Muslim Ummah. The Speech-To-Text technology we’ve built is a gateway to many technical solutions that can uplift Muslims spirituality and bring them closer to their faith.
With this background in mind, I’m going to tell the story of our ML journey and dive a bit into the technical weeds, starting from the first steps, and highlighting our ML process till the date this article is published.
I also want to highlight what this article is and what it isn’t.
What this article is:
- A historical reference to the technical journey behind Tarteel’s Speech-To-Text technology.
- A resource on technical decisions made with context to help other hard-tech startups frame and scope their initiatives.
- A technical deep dive into what worked (and what didn’t) and what processes we put in place to achieve our goals.
What this article isn’t:
- A “Best Practices for ML in prod.” We are far from it and I doubt there is a one-size fits all framework for ML startups (we don’t have that luxury like other SaaS companies but I think HuggingFace is getting there!)
- An optimal path for developing your STT model for your next/hobby startup. Take from this article what meets your use case. Rarely should you copy it verbatim.
- Disregarding or calling out the work the OG Tarteel contributors did. While there were many sub-optimal decisions made (mostly by me 🙃), they made sense at the time given our background, expertise, and deadlines (we were all working in a volunteer capacity on this project).
This article will be broken up into multiple parts since there is a lot of information and it doesn’t make sense to distill it in one go.
The order of these sections is not necessarily the chronological order we went through. Instead, the article is split up into the logical sections of an end-to-end machine learning project. Sometimes we’d go back and forth between stages when we realize there were bugs or changes in the product that required a refactor in the way we collect data/train models/etc.
Before we start, I’d like to acknowledge some important contributors who helped us along this journey:
- Abubakar Abid
- Abdulrahman Al-Fozan
- Ali El-Gabri
- Fahim Dalvi
- Hamzah Khan
- Hanad Sharmake
- Subha Pushpita
- Uzair Shafiq
And many others! If I missed you, you know how to reach me!
Data Collection
All deep learning systems start through the collection of their source of knowledge: data. Tarteel’s AI is no different.
Recently, SOTA self-supervised learning methods for ASR have come up, but they still require some form of data collection and annotation.
Tarteel actually started out as a data collection challenge. As part of the “Muslim Hacks 2.0” Hackathon in Millbrae, CA back in 2018. The OG hackers sought out to build a platform that collects and curates Quran recitation data that can be used to train AI. The platform was called “Tarteel Challenge” as it was a challenge to collect 10,000 recitations of Quran verses from around the world.
How we did it
The first iteration of this platform was a very simple Django web app that would randomly select a verse from the Quran and display it for a user. The user would hit a record button, recite the verse, then hit the stop button. When done, they’d hit “Next verse” and repeat. Very minimalistic. No automation.
The web app was modeled after Mozilla’s Common Voice web site. It may or may have not been an almost exact ripoff 👀.
The (Django) data model for the recording was very simple as well, it looked something like this:
class Recording(Model):
# The audio file, stored in S3
file = FileField(blank=True, null=True)
# Chapter (Surah) and Verse (Ayah) numbers
surah_num = IntegerField(blank=True, null=True)
ayah_num = IntegerField(blank=True, null=True)
# A hash to prevent naming collisions for files
hash_string = CharField(max_length=32)
# Continuous or Analog, more on that later
recitation_mode = CharField(max_length=32, blank=True, null=True)
timestamp = DateTimeField(auto_now_add=True)
session_id = CharField(max_length=32)
It’s pretty self explanatory, but if you want more info on Django field types, checkout the docs.
The audio file had a naming scheme of <surah_num>_<ayah_num>_<hash_string>.wav
.
The recitation mode allowed us to know whether this recording was recorded as an individual verse or multiple verses. This was a toggle available to the user on the web site.
Instead of asking people to recite any verse the Quran, we decided to exploit the structure of the Quran and hardcode the task for them. This way, we have an (almost) accurate transcript of what was being recited without doing a lot of manual labeling (an assumption we later realized was invalid).
This was great for an MVP, but bring in any data scientist and they’d probably scold you for the lack of information/metadata, naming scheme, and format, most of which limited the ability to generalize and make future changes. Moreover, there was basically no input validation being done client or server side, which lead to a bunch of problems in the future. There is also an interesting flaw in the business logic of the data model, which I'll describe in a few paragraphs.
Regardless of the technical issues, we hit 10k audio recordings in less than a month alhamdulilah and decided to “upgrade” the Tarteel Challenge to 2.0 where we asked people to collect 50,000 recitations.
We also realized that we need more metadata on the audio recording reciters, so we introduced a new model.
class DemographicInformation(Model):
session_id = CharField(max_length=32, blank=True)
# This could be used to store different platforms such as android,
# ios, web if different identification methods are used for each one.
platform = CharField(max_length=256, default="web")
gender = CharField(max_length=32)
# Recitation style
qiraah = CharField(max_length=32, blank=True, null=True)
age = CharField(max_length=32)
ethnicity = CharField(max_length=32, blank=True, null=True)
timestamp = DateTimeField(auto_now_add=True)
This was linked back to the recording using a foreign key relationship.
While this was a good amount of information on the reciter of the audio, we never properly surfaced the ability to update this information to our users nor encourage them to do so.
We were basically left with a table that had a bunch of null
values and the number of users that actually updated their demographic information was very, very limited (no fault of their own though).
How we should have done it
If you're an experienced data scientist/ML practitioner and you give this data collection task a bit more thought, you can see multiple opportunities to improve the collection process. The bigger issue at hand though was actually the user interface! Something many ML/DS folks neglect all too often.
While the website was great and minimal, it was too minimal. A few things we should’ve included that would’ve improved the data collection process.
Frontend
1. Request user demographic information on first visit
When a new user visits the site, we should prompt the user to enter their demographic information, with the option to skip entering their information. This would have significantly improved the amount of metadata we could collect for each recording.
2. Have a simple login interface
The implementation of the app relied on persistent sessions in the cookies to store user information (django-sessions). Unfortunately, this isn’t the best way to retain user info.
And no, we don’t have to use OAuth, Social Auth, or any of that fancy stuff.
The simplest way we could have gone about this was adding email login or exposing the session ID to the user so that they can enter it the next time they visit the site to populate their information.
3. Audio standardization
Something that caused issues later on is that the audio format we use to record was not standardized across devices. If we invested a bit more time into the MediaStream
API and did some server-side validation, we could have save ourselves a lot of headache (more on that in the next section).
As a backend engineer, I hope I never have to deal with the trouble of handling all the audio library permutations on all the browsers, OS’, and devices out there.
Backend
1. Simplify the Recording
model
The Recording
model made way too many assumptions on the format of the recording and the fields:
- It assumed the recording was of a single verse
- It assumed the client would populate information correctly
- It did not use the proper fields (Boolean, UUID, etc.)
Instead the model should’ve probably looked something like this:
class Recording(Model):
class RecitationModeTypes(TextChoices):
CONTINUOUS = "CONTINUOUS"
NON_CONTINUOUS = "NON_CONTINUOUS"
pkid = BigAutoField(primary_key=True, editable=False)
# We can use the UUID to generate a unique name for the audio file
id = UUIDField(default=uuid.uuid4, editable=False, unique=True)
# Use a custom Storage to separate the recordings from other files.
# django-storages has an option to prevent file collisions by adding a hash
# so use that instead of our own `hash_string`
file = FileField(storage=MyS3Storage())
# List of Chapter/Verse tuples in the format they are in the Quran
# (ChapterNum:SurahNum)
verses = ArrayField(models.CharField(max_length=10))
# Or if you have a "Verse" model
# verses = ManyToManyField(Verse, related_name="recording")
recitation_mode = CharField(max_length=32, choices=RecitationModeTypes.choices)
created_at = DateTimeField(auto_now_add=True)
updated_at = DateTimeField(auto_now=True)
session_id = CharField(max_length=32)
demographic = ForeignKey(
DemographicInformation, on_delete=SET_NULL, null=True
)
duration_milliseconds = PositiveIntegerField(default=0)
Notice how we now have the ability to not only generate unique recordings that are queryable, but also can be potentially extended to support future use cases.
The most important field to me is the duration_milliseconds
because this allows us to know the size of my dataset without writing complex queries that require intensive file I/O operations.
We could also potentially extend the data model to include more audio metadata (such as channels, sampling rate, bitrate, etc.) however, I think those fields are not needed in the database because there’s no use case for querying them.
They’re almost exclusively used when the data is being processed at train/pre-processing time.
If we were a big company aggregating audio from 10+ different sources, then I could make the argument for including that metadata along with other fields.
2. Input sanitization
While this applies to the frontend as well, it’s much more important when done server side.
Initially, we did no input sanitization whatsoever - which of course lead to a lot of errors, mostly business/logical ones. More importantly though, we should’ve done some basic validation on the integrity of the audio file.
Later on when we wanted to process the data for training we realized:
- A lot of audio files were empty files
- All files were named with a
wav
extension, but the audio file was actually anmp4
file - Some files had no sound/data and/or were corrupted somehow.
Had we done input sanitization when we received the audio files, our API might have been a bit slower, but it would have saved us a lot of headache in the future.
How we do it now
I won’t go into too much details about the inner workings of the current system for the sake of user security, privacy, and our IP (Maybe in the future or if I get a sign off it’s fine).
But the high level approach is mainly based off how users currently use the app.
Whenever someone starts a recitation session in Tarteel, we maintain a buffer of their audio in-memory. Every 20 seconds, we upload a portion of the audio and link it to their current session and user in the backend.
Why 20 seconds? This is a short enough timeframe that there’s a good amount of audio in it so we can annotate in the future, but not long enough that it might overload our server or upload times.
When a user requests a copy of their file, we stitch it together for them and keep a record of the stitched recording so they can share it with others.
The contents of the file are verified using PyDub
and the request is sanitized using Django Rest Framework serializers.
I'll update this post with a link to the next part - Annotation - once it's ready!