One of the biggest changes this year professionally was that I started appreciating machine learning again. Sometime during my undergrad, I remember getting annoyed at the field in general - the pace of it, how so much of it seemed random, and despite the amazing results many ML pipelines produced, I was less interested in the underlying logic. I learnt two things this year that changed my view (neural radiance fields and score distillation sampling), helping me better appreciate ML’s raw potential. More on the two things later.
ML Education is overly structured
The underlying issue I think is that I followed an overly structured view of ML - one that’s repeated a thousand times in almost identical phrasing across the internet - you have a dataset, and you want to fit a model on this dataset to predict some value or generate new similar data, and you want to avoid the awful overfitting problem with a bunch of tricks. You describe machine learning pipelines this way a thousand times, you start to think that this is all machine learning truly is:
- you need a dataset
- you need to have a validation set
- you **need **to train in epochs
- … etc
I remember once a teaching assistant even said that when you think of images in ML, think of convolutional neural networks (CNNs). I probably repeated that when TAing myself. That’s a narrow view.
These are essential components in most ML pipelines for sure - but when motivated without any alternatives, they can be limiting. This is where the first thing I learnt this year comes in - neural radiance fields.
The problem is: building a 3D model of objects / general environments from pictures captured on phones/cameras.
If asked how ML might factor into this before, I would’ve said something like - “build a massive 3D dataset, and train a model that accepts a set of images as input to predict the 3D model somehow”. That’s definitely a solid idea with many benefits but that’s still a conventional way of looking at ML and far from the only way of tackling it. That pipeline answers the question - given a set of images, what does a 3D model look like? (There are non ML methods to answer this question too, but we’re focusing on ML here). Is that the only question we can ask to solve our problem though? What I appreciate about the NeRF technique is that it tackles a different but equivalent question:
Q: Given a (x, y, z) 3D coordinate, what is the colour of that point (or is it just empty)?
Naturally, if I ask this question a million times - for all 3D coordinates in some grid, I have all the information I need to build a 3D model somehow; if I ask you what the colour of all pixels in an image are, you can build the image back, similarly for 3D. There’s still the question of how I aggregate this information but there is a solution for that in computer graphics (volumetric rendering). Volume rendering simply aggregates 3D information along your line of sight to produce the actual colour that you see. For instance, when looking at an apple, you’ll see the red skin of the apple, but not the white interior of the apple (the red skin is closer to you than the white interior along that line of sight).
Credit: Jon Barron
Now, to answer this question, we naturally need the supervision to be of a different form - since the question takes a 3D coordinate as input and produces a colour as output - we need supervision that has a mapping between the two. Turns out, we can do away with that 3D supervision and simply use the input images as supervision using volume rendering. I defer a deep explanation of this concept to other resources (such as https://dtransposed.github.io/blog/2022/08/06/NeRF/) but effectively, training a NeRF involves:
- Asking what the colour of (x, y, z) coordinates is across a grid
- Aggregating this information from the viewpoint of a camera (that we took our picture with) (volume rendering)
- Comparing how aligned this aggregate / render of the 3D model is with the picture taken (the actual supervision for our model)
As a result, the ML pipeline ends up looking quite different from the prototypical pipeline:
- This pipeline has no real dataset of 3D models but you are still teaching a model to answer a question - so it is machine learning. Further, the fact that there is no dataset, makes it self-supervised in a way.
- The domain of the ML model is different from that of the input data. We were given pictures of a 3D scene but the ML model ended up taking 3D coordinates as input. This shows how reframing the problem can change the way we approach the problem entirely.
- There is no generalization in this model really. Since the question is just about 3D coordinates, most often it’s not practical to train a single model to determine every single 3D object (there’s also the question about what coordinate system you’d use). Effectively, the neural network used in NeRF ends up representing a single object.
- Many classic ML ideas don’t fit into this context - such as data augmentation (why would it help?). Other concepts such as overfitting and underfitting take on slightly different meanings here.
All this is to say that if I had to teach machine learning today - I would probably do it by motivating machine learning simply as the art of function approximation. The first step is identifying what function you want to approximate (this should be an actual exercise too!), and why in the first place - doing so would also let one understand ML in various domains beyond generative modelling (such as NeRF which uses physics-inspired ideas via rendering for machine learning). Without such motivation, I fear that ML would be synonymous with data-driven ML. Still, this pipeline has some resemblance to a conventional ML pipeline in that there is a loss function and we have some data to work with (the input images). We can break that assumption too.
Code Restricts Thinking
When learning about optimization in ML, you typically start by implementing the forward pass and the backwards pass manually - to get a sense of how gradients are computed to understand the role of autograd. However, after relying on Pytorch’s autograd for a long while - I internalized the idea that you needed to define a loss function that you had to differentiate to update your parameters - what if you could just skip the loss function and directly provide the gradient?
That’s what score distillation sampling from Dreamfusion (https://dreamfusion3d.github.io/) and Score Jacobian Chaining (https://pals.ttic.edu/p/score-jacobian-chaining) comes in. The two papers tackle generating 3D objects from text prompts alone - describe a squirrel and obtain a 3D squirrel. Refer to the papers for an explanation on how they do this but for this discussion, note that theyuse diffusion models, which naturally predict the gradient of a distribution with respect to an input. This means that there is no real loss function to work with, just the gradient directly!
Further, you typically need a dataset in machine learning - but dreamfusion just uses another pretrained model in order to generate 3D objects!
There are also recent works that question whether the model you use for optimization needs to remain static throughout - what if you could manually tweak parts of it / remove parts / extend it and then continue your optimization? (Gaussian Splatting - https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/) That’s also something that code tutorials don’t naturally encourage! When your perception of ML is directly tied to the conventions of a certain software framework, it’s too easy to just think that’s how things **have to **be. But the past year of research has simply shown me that many things are just principled choices - sometimes random. Understanding the fundamentals truly means envisioning the possibility of things not encouraged by these frameworks too! That’s pretty exciting.