Adding a score to each pre-annotation result on Label Studio

Mon, 09 Feb 2026 14:06:27 +0100

Label Studio is an annotation tool that comes really handy when dealing with object detection datasets. A major feature in my workflow is the ability to upload “pre-annotations”, which are used when first opening a task: a draft is automatically created with all objects present in the pre-annotation.

To speed up labeling, I often use this pre-annotation feature to label all images using a zero-shot model (such as YOLO-World or SAM 3). Once I’ve annotated enough images, I train a first object detection model, run this model on the full dataset again, and import these predictions as pre-annotations.

Never Trust Your Dataset

Mon, 10 May 2021 12:23:40 +0200

Datasets are ubiquitous in machine learning. There is literally nothing to learn without -labeled or unlabeled- datasets. Lack of datasets has impeded the progress in NLP for low-resource languages: most of the academic work in NLP focus on English and to a lesser extent to a couple of high-resource languages (Spanish, German, Japanese, French,…).

Recently, a diverse team of NLP researchers studied the quality of web-crawled corpora that are behind most of the progress in NLP in the last few years (Caswell et al. 2021). More specifically, they studied 3 parallel corpora used for machine translation (CCAligned, ParaCrawl, WikiMatrix) and two monolingual corpora (OSCAR and mC4) used to train language-specific language models.

How many layers of my BERT model should I freeze? ❄️

Fri, 04 Dec 2020 15:56:37 +0100

Since the advent of the Transformer architecture (Vaswani et al. 2017) and of BERT models (Devlin et al. 2019), Transformer models have become ubiquitous in NLP, achieving SOTA results on most NLP datasets.

Before Sesame Street puppets flooded on ArXiv, the de-facto method to train an NLP model leveraged word embeddings pre-trained using Glove or word2vec. These word embeddings were used to initialize the first embedding layer of your model, and you just had to plug the rest of your architecture above this first layer.

Prodigy, a must-have in the Data Scientist toolbox

Sat, 07 Nov 2020 17:41:06 +0100

If you sometimes find yourself annotating data for machine learning projects and you’ve never heard of Prodigy, it’s definitely a tool you would be interested in.

The project was initiated by the makers of spaCy - the well-known Python NLP library - after they realized that while supervised learning works well, data collection was broken. To this day, many data collection projects still rely on Amazon Mechanical Turk, a crowdsourcing platform with low wages, questionable UX, and low incentives for quality. Given the major impact of data quality on supervised learning model performances, data collection is too important to be outsourced on Mechanical Turk. Whenever possible, annotations should be done in-house and at least partially by the scientist in charge of the research project. The latter ensures that any quality or annotation issues that could impact the training is noticed beforehand.

About

Fri, 06 Nov 2020 15:33:01 +0100

I’m a machine learning engineer passionate about deep learning, machine learning and data science in general. I’m mainly focused on natural language processing, even if I enjoy computer vision as well.

Beyond data science, I’m fascinated by design and UX. I’m convinced that a good UX is sometimes a serious alternative to machine learning-based solutions.

I have a tendency of falling in the productivity tool rabbit hole, so I may occasionally post articles about productivity tool from a data science perspective.