Django Ninja Movie Database

Type hints first appeared in Python 3.5, and libraries like Pydantic soon followed, making it easy to use type hints for data validation. Any API involves a fair amount of data validation. It wasn't long until frameworks like FastAPI were introduced that use Pydantic to handle validation with type annotations.

Django Ninja is so far the most compelling library to offer type-based validation in the Django world.

In this post we'll build a movie API using the non-commercial IMDb data-set. We'll use Django Ninja to create the API and do validation of the requests.

The end goal is to have an API that can answer questions like:

What are the highest rated movies in a specific genre?
What was the best movie in a given year?
Given a specific writer, what other movies did this writer work on?

1. Setting Up the Django Project
2. Downloading the IMDb Movie Data
3. Implementing the Database Models
4. Importing the IMDb Data
5. Writing the Django Ninja API

Setting up the Django Project

In this section we'll set up the Django project and verify that all prerequisites are in place.

Begin by using the django-admin tool to create a new Django project.


django-admin startproject moviedex

Now create a new application within the moviedex directory.


cd moviedex
python3 manage.py startapp movie

Next create a file named requirements.txt in the project root.


Django==4.2.4
psycopg2-binary==2.9.1

We're going to use Docker and Compose to make getting started easier. Start by creating a new file named Dockerfile in the project root and entering the following.


FROM python:3.8

ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

WORKDIR /usr/src/app

COPY requirements.txt /usr/src/app/
RUN pip install --no-cache-dir -r requirements.txt

COPY . /usr/src/app/

RUN mkdir -p /usr/src/app/

Finally, create a file named docker-compose.yml in the project root with the following.


version: '3.8'

services:
  web:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: movie_app
    command: python manage.py runserver 0.0.0.0:8000
    volumes:
      - ${PWD}:/usr/src/app
    ports:
      - "8000:8000"
    env_file:
      app.env

  database:
    image: postgres:15
    restart: always
    volumes:
      - pg_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    env_file:
      app.env

volumes:
  pg_data:

Since we're using PostgreSQL for this project, you'll need to edit the settings.py file to update the connection settings.


DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.postgresql",
        "NAME": os.environ.get("POSTGRES_DB", ""),
        "USER": os.environ.get("POSTGRES_USER"),
        "PASSWORD": os.environ.get("POSTGRES_PASSWORD", ""),
        "HOST": os.environ.get("POSTGRES_HOST", ""),
        "PORT": os.environ.get("POSTGRES_PORT", ""),
    }
}

You should now be able to use Compose to bring up the app and database containers.


docker-compose up -d

Downloading the IMDb Movie Data

The IMDb data-set is available as a group of compressed and tab-delimited files. Since we're just interested in movies, we only need to work with 4 out of the 7 available files.

The name.basics.tsv file contains data on anyone involved in a movie production, including actors, directors, writers, and composers. The title.basics.tsv file contains the movie titles and limited metadata about each of the movies. The title.ratings.tsv file is fairly self-explanatory. Finally, title.principals.tsv links the names and titles together.

If you're working on Linux, you can use the following script to download and decompress the data files.


#!/bin/bash

# List of URLs to download
urls=(
  "https://datasets.imdbws.com/name.basics.tsv.gz"
  "https://datasets.imdbws.com/title.basics.tsv.gz"
  "https://datasets.imdbws.com/title.principals.tsv.gz"
  "https://datasets.imdbws.com/title.ratings.tsv.gz"
)

# Loop through each URL and download and decompress it
for url in "${urls[@]}"; do
  echo "Downloading $url ..."

  # Download the file using wget
  wget $url

  # Extract the filename from the URL
  filename=$(basename $url)

  # Decompress the downloaded gzip file
  echo "Decompressing $filename ..."
  gzip -d $filename

  echo "$filename has been downloaded and decompressed."
done

echo "All files have been downloaded and decompressed."

Implementing the Database Models

Next we'll implement the database models to represent the movies, people, and their relationships. The Name model corresponds to the name.basics.tsv file and the Movie model to the title.basics.tsv file. The schema includes a separate Genre model because each movie can have many associated genres.

The MovieCredit model serves to tie it all together, mapping names to movies, and includes what role they played in the movie production.

The original IMDb ID values are imported as well, to make it easier to populate the MovieCredit model later on. Without preserving these ID values we'd either have to keep everything in memory during the import, or do a huge number of database lookups, which would be far too slow.


from django.db import models

ROLE_CHOICES = (
    ("director", "Director"),
    ("producer", "Producer"),
    ("actor", "Actor"),
    ("writer", "Writer"),
)

class Genre(models.Model):
    id = models.AutoField(primary_key=True)
    name = models.CharField(max_length=255, unique=True)

class Name(models.Model):
    id = models.AutoField(primary_key=True)
    imdb_id = models.CharField(max_length=16, unique=True, null=True)
    name = models.CharField(max_length=255)
    birth_year = models.IntegerField(null=True, blank=True)
    death_year = models.IntegerField(null=True, blank=True)

class Movie(models.Model):
    id = models.AutoField(primary_key=True)
    imdb_id = models.CharField(max_length=16, unique=True, null=True)
    title = models.CharField(max_length=512)
    release_year = models.IntegerField(null=True)
    genres = models.ManyToManyField(Genre, related_name="movies")

class Rating(models.Model):
    id = models.AutoField(primary_key=True)
    imdb_movie_id = models.CharField(max_length=16, db_index=True, null=True)
    rating = models.FloatField(null=True)
    num_ratings = models.IntegerField(null=True)

class MovieCredit(models.Model):
    id = models.AutoField(primary_key=True)

    imdb_movie_id = models.CharField(max_length=16, db_index=True, null=True)
    imdb_name_id = models.CharField(max_length=16, db_index=True, null=True)
    role = models.CharField(max_length=20, choices=ROLE_CHOICES, null=True)

Importing the IMDb Data

We'll create a custom Django management command to perform the data import.

To give an idea of scale, the movie_name table will end up containing 12,850,210 names. After limiting it to movies released after 1980, the movie_movie table will have 492,064 rows.

The import performs batched inserts with the bulk_create method as an optimization, but it's still a time consuming process. Your mileage may vary, but the process took around 10 minutes on my hardware.

To add the management command, first create the management directory within the movie app.


mkdir -p management/commands

Now we'll create the import_imdb_data.py script.


from collections import defaultdict
import csv

from django.core.management.base import BaseCommand
from django.db import connection

from movie.models import Genre, Rating, Movie, Name, MovieCredit


class Command(BaseCommand):
    help = "Import IMDb data into the database."

    def handle(self, *args, **options):
        self.import_names("name.basics.tsv")
        self.import_movies("title.basics.tsv")
        self.import_principals("title.principals.tsv")
        self.import_ratings("title.ratings.tsv")

        self.stdout.write(self.style.SUCCESS("Successfully imported IMDb data!"))

    def import_names(self, file_name):
        self.stdout.write("Importing names...")

        total_names = 0
        with open(file_name, "r", encoding="utf-8") as tsvfile:
            reader = csv.DictReader(tsvfile, delimiter="\t")

            name_batch = []
            for row in reader:
                birth_year = (
                    None if row["birthYear"] == "\\N" else int(row["birthYear"])
                )
                death_year = (
                    None if row["deathYear"] == "\\N" else int(row["deathYear"])
                )

                name_batch.append(
                    Name(
                        imdb_id=row["nconst"],
                        name=row["primaryName"],
                        birth_year=birth_year,
                        death_year=death_year,
                    )
                )

                if len(name_batch) >= 5000:
                    Name.objects.bulk_create(name_batch)
                    total_names += len(name_batch)

                    self.stdout.write(f"Processed {total_names} names.")
                    name_batch = []

            if name_batch:
                Name.objects.bulk_create(name_batch)

    def import_movies(self, file_name):
        self.stdout.write("Importing movies...")

        movie_to_genre = defaultdict(list)
        total_movies = 0
        with open(file_name, "r", encoding="utf-8") as tsvfile:
            reader = csv.DictReader(tsvfile, delimiter="\t")

            movie_batch = []
            genre_records = {}
            movie_id = 1

            for row in reader:
                if row["titleType"] != "movie":
                    continue

                release_year = None if row["startYear"] == "\\N" else row["startYear"]
                if release_year and int(release_year) < 1980:
                    continue

                movie_batch.append(
                    Movie(
                        id=movie_id,
                        imdb_id=row["tconst"],
                        title=row["primaryTitle"],
                        release_year=release_year,
                    )
                )

                genres = row["genres"].split(",")
                for genre in genres:
                    genre = genre.strip()
                    if genre == "\\N":
                        continue

                    if genre not in genre_records:
                        genre_records[genre] = Genre.objects.create(name=genre)
                    movie_to_genre[movie_id].append(genre_records[genre])

                if len(movie_batch) >= 5000:
                    Movie.objects.bulk_create(movie_batch)
                    total_movies += len(movie_batch)

                    self.stdout.write(f"Processed {total_movies} movies.")
                    movie_batch = []

                movie_id += 1

            if movie_batch:
                Movie.objects.bulk_create(movie_batch)

        for movie_id, genres in movie_to_genre.items():
            movie = Movie.objects.get(id=movie_id)
            movie.genres.add(*genres)

    def import_principals(self, file_name):
        self.stdout.write("Importing principals...")

        total_credits = 0
        with open(file_name, "r", encoding="utf-8") as tsvfile:
            reader = csv.DictReader(tsvfile, delimiter="\t")

            credit_batch = []
            for row in reader:
                tconst = row["tconst"]
                nconst = row["nconst"]
                role = row["category"]

                if role == "actress":
                    role = "actor"

                if role not in ("actor", "writer", "producer", "director"):
                    continue

                credit_batch.append(
                    MovieCredit(
                        imdb_movie_id=tconst,
                        imdb_name_id=nconst,
                        role=role,
                    )
                )

                if len(credit_batch) >= 5000:
                    MovieCredit.objects.bulk_create(credit_batch)
                    total_credits += len(credit_batch)

                    self.stdout.write(f"Processed {total_credits} movie credits.")
                    credit_batch = []

            if credit_batch:
                MovieCredit.objects.bulk_create(credit_batch)

    def import_ratings(self, file_name):
        self.stdout.write("Importing ratings...")

        total_ratings = 0
        with open(file_name, "r", encoding="utf-8") as tsvfile:
            reader = csv.DictReader(tsvfile, delimiter="\t")

            rating_batch = []
            for row in reader:
                tconst = row["tconst"]
                rating = row["averageRating"]
                num_votes = row["numVotes"]

                rating_batch.append(
                    Rating(
                        imdb_movie_id=tconst,
                        rating=rating,
                        num_ratings=num_votes,
                    )
                )

                if len(rating_batch) >= 5000:
                    Rating.objects.bulk_create(rating_batch)
                    total_ratings += len(rating_batch)

                    self.stdout.write(f"Processed {total_ratings} ratings.")
                    rating_batch = []

            if rating_batch:
                Rating.objects.bulk_create(rating_batch)

Make sure all of the decompressed IMDb data files are present in the project root before running the import. To execute the import management command, run the following.


docker exec -it movie_app python manage.py import_imdb_data

Give it around 10 minutes, and you should have a fully populated database.

Writing the Django Ninja API

Now that the database is populated, we'll write the API using Django Ninja.

Here's an example of a query we should be able to make.


curl "http://localhost:8000/api/movies?rating=9.2&num_ratings=2000&sort_by=num_ratings&sort_dir=asc"

This query requests movies that have at least a 9.2 rating with 2000 or more votes, and sorts them according to the number of votes. We can also filter on movie title, release year, and more.

We'll start by writing the view function and dig into each part as we go.


@api.get("/movies", response=List[MovieSchema])
@paginate(PageNumberPagination)
def movies(
    request, filters: MovieFilterSchema = Query(...), sorting: MovieSorting = Query(...)
):
    movies = Movie.objects.exclude(rating__isnull=True)
    movies = filters.filter(movies)

    sort_by = sorting.sort_by
    sort_dir = sorting.sort_dir

    direction = "" if sort_dir == SortDirection.ASC else "-"

    if sort_by == MovieSortBy.TITLE:
        movies = movies.order_by(f"{direction}title")
    elif sort_by == MovieSortBy.RELEASE_YEAR:
        movies = movies.order_by(f"{direction}release_year")
    elif sort_by == MovieSortBy.NUM_RATINGS:
        movies = movies.order_by(f"{direction}rating__num_ratings")
    elif sort_by == MovieSortBy.RATING:
        movies = movies.order_by(f"{direction}rating__rating")

    return movies

The first thing to notice is the @api.get decorator. This decorator defines the function as a view, specifies the URL path, and sets the shape of the expected response.

The List[MovieSchema] response argument uses type annotations to define that this view returns a list of movie objects. That's why we can simply return movies in the function without worrying about serialization logic.

Schemas are a Django Ninja concept that define the expected attributes in a response. The ModelSchema subclass makes it easy to serialize Django models.


class MovieSchema(ModelSchema):
    rating: float = Field(None, alias="rating.rating")
    num_ratings: int = Field(None, alias="rating.num_ratings")

    class Config:
        model = Movie
        model_fields = ("id", "imdb_id", "title", "release_year")

Most fields can be defined as part of the model_fields attribute. More complex fields, such as rating and num_ratings, are defined as class attributes using Field objects. In this case, we use Field to export data that belongs to child models.

The @paginate(PageNumberPagination) decorator not only adds pagination, but specifies the kind of pagination we want. In this case, we specify that we'd like to pass in page numbers, as opposed to start and offset ranges.

The arguments to the function itself define the expected query parameters. The MovieFilterSchema looks similar to the MovieSchema but defines the expected input instead of the output.


class MovieFilterSchema(FilterSchema):
    title: Optional[str] = Field(q="title__icontains")
    rating: Optional[float] = Field(q="rating__rating__gte")
    num_ratings: Optional[int] = Field(q="rating__num_ratings__gte")
    release_year: Optional[int]

Here we use Fields to customize filtering behavior.

The title field, for example, is set to perform a contains search instead of an exact match. The release_year attribute, however, is fine as a standard field and doesn't require any customization. Every field is marked Optional since the end user isn't required to add filters to their query.

By inheriting from the FilterSchema class, we get filtering logic for free. When filters.filter(movies) is called, Django Ninja converts the query parameters into ORM filters that are added to our existing query.

The MovieSorting class defines the sort_by and sort_dir query parameters and their expected types.


class MovieSortBy(enum.Enum):
    TITLE = "title"
    RELEASE_YEAR = "release_year"
    NUM_RATINGS = "num_ratings"
    RATING = "rating"

class SortDirection(enum.Enum):
    ASC = "asc"
    DESC = "desc"

class MovieSorting(Schema):
    sort_by: MovieSortBy = MovieSortBy.TITLE
    sort_dir: SortDirection = SortDirection.DESC

Unlike the FilterSchema, however, we must add the logic for sorting to the function itself.

Conclusion

I hope this has been a fun example of Django Ninja in action! My own experience with it has been great. It's easy to use and integrates seamlessly with Django.

Of course, there's much more to building a solid API, such as authentication and rate limiting, but hopefully this has been a useful introduction to creating APIs with Django Ninja.