Why do we See? | Josheta Srinivasan

Note that the question isn’t how do we see, but why. In this post I aim to survey theories about the purpose of vision and the importance of the inquiry, focusing on the ideas of David Marr in his book, Vision.

Why is it important to understand the purpose of vision?

D. Marr summarizes this quite beautifully in his book. His explanation focuses on the applications of “Understanding Complex Information processing systems”; i.e. his work in the book is in view of the fields of Cognitive Science and AI and seeking the understanding of complex information-processing systems such as our brains, computer or more specifically our vision (though a subset of the ‘brain’ so to speak). He elucidates, quite fascinatingly, the importance of understanding the purpose of an information processing system in order to determine how it works. Then, he draws the distinction to Vision; we need to understand the purpose of vision in order to be able to understand its working, and the importance of understanding how (human) vision works is well known – diagnosis and applications in computer vision among others.

So perhaps, after having said what I have said, the question “Why is it important to understand the purpose of vision?” can be more specifically rephrased as “Why is it important to understand the purpose of vision in order to understand how vision works?”

This is what we’ll tackle now.

Marr divides ‘understanding’ an information-processing system into 3 levels:

Level 1: This level is called “Computational theory” and involves answering the question of purpose. In Marr’s words, it answers the what and why of the system. Answering the purpose of the system answers the why (pretty straightforwardly) but also the what. Think about it; The goal of a system is also just what it does. So more simplistically, we can refer to this level as the Purpose level.

Level 2: This level is called “Representation and Algorithm”. Though Marr more specifically defines a representation and Algorithm, we can just look at this as how: How the system does what it does. Essentially, this would involve a series of steps (or maybe just one step). I will call this the Method level.

Level 3: This level is called “Hardware implementation”. This is another layer of how but deals with it less abstractly; it considers the physical materialization of the system: how does the hardware of the system work? I will call this the Hardware level.

Note that the difference between level 2 and level 3 is essentially its abstractness: there may be many different hardware implementations for the same ‘Method’. For instance, take a simple example of turning on a lightbulb. Though the ‘Method’ remains the same: flick the switch on the wall, the hardware may be different: color and design of the switch can be different, is this the United States or the rest of the world? Because in that case, the direction to flick may be different. The lightbulb in and of itself may be different: is it an LED or a filament bulb?

To solidify the concept of these ‘Three levels’, I have created the table below that provides an example for each of the levels.

	A cash register	The Visual System
Level 1 (Purpose)	To provide the total cost of a customer’s purchase.	This is essentially the question we are trying to figure out.
Level 2 (Method)	Addition (more specifically the kind that machines do)	These are results that we get from certain physiological studies where patients with agnosia (a certain part of their brain damaged) can perform certain function and not others; hence telling us that there might be a hierarchal structure of vision, or that there may be two visual pathways etc.
Level 3 (Hardware)	The inner wirings of the cash register, with logic gates and resistors and such that allows for the Method to work physically.	This would point to the exact biological pathways/ circuitries that make up the visual system.

Now, let’s circle back to our question: “Why is it important to understand the purpose of vision in order to understand how vision works?”. Essentially, what this question asks is Why is it important for level 1 understanding to achieve level 2 understanding?

The answer is that a level 1 to level 2 understanding affords for a more complete understanding of level 2 than a level 3 to level 2 understanding. That is, if we are attempting to understand level 2 of a system (which is often times the case) such as ‘why does a cash register follow these series of steps every time I scan all the items for a customer’, we can derive a more complete understanding of it from answering the question of ‘what is the purpose of a cash register’ that we can from answering the question ‘what are the inner mechanisms of the hardware of the cash register?’.

Marr makes the point that most of the work at understanding the Method level of information-systems, be it the visual system or some artificial one, has been using a level 3 to 2 approach rather than level 1 to 2. Hence, he makes an emphatic claim of the importance of level 1 understanding; not because it is in and of itself any more important than any other level’s understanding.

That being said, what is the purpose of Vision?

This is a complicated question to answer just because our Vision allows us to do so many different things. Recognize things, read, obtain food, defend ourselves, navigate, grasp things…. Its purpose, then, seems to be all-encompassing (more accurately, just super vast) that it at once seems to say everything and nothing.

Marr takes the route of arranging all these purposes into some sort of hierarchal structure (he doesn’t physically do so of course, but conceptually speaking) which then allows for the isolation of a ‘primary’ function. For Marr, this was

“building a description of the shapes and positions of things from images”.

He acknowledges that this is not the only thing that the system can do, but that everything else falls under the umbrella of vision’s ability to do this. He comes to this conclusion by considering the work of Elizabeth Warrington whose study of patients with parietal lesions showed that those with right parietal lesions ‘recognized’ objects, albeit only when presented from a comprehensive perspective (meaning, like a head-on angle, or in a view and illumination where what the object was clear enough) and were able to name its purpose and uses. However, they denied the object vehemently to be itself when presented at odd angles of illuminations. Those with left parietal lesions, on the other hand, showed no signs of ‘recognizing’ an object (naming it or its uses) albeit being able to describe its shape and general 3-dimensional geometry, even when presented at odd, obscure angles and illumination. This gave Marr the impression that our Vision, more primitively could identify geometries of objects without any higher level of comprehension; hence defining its more general purpose to be creating an object-centered description of objects from the visual input (images).

What strikes me as interesting in this account is how similar the behavior of patients with right parietal lesions are to the behavior of our current computer vision systems. In fact, a major ‘problem’ with our current computer vision is its lack of ability to recognize objects from an ‘unconventional’ angle, or when it is partially obscured in some way. However, the biological ability to recognize geometries of objects regardless of how obscure the objects are (some were quite complexly obscured), is quite a remarkable feat.

In that regard, one can quite clearly see the importance of the isolation of level 1 understanding of the visual system on our understanding of the methods/ algorithms it uses to do so that we might, then aim to mimic electronically.

Considerations from future discoveries

The book by Marr was published posthumously in 1982 and since then, there have been numerous insightful discoveries about the visual system. I want to focus on that of the dorsal and ventral systems that divide vision into two capabilities: Vision for action and Vision for perception (Goodale and Milner, 2006). The most illuminating results regarding this came from two kinds of patients:

1. Visual agnosia without Optical ataxia: Patients seemed unable to ‘recognize’ objects or describe its orientation but could guide motor responses such as grasping or fitting of it into a shaped box perfectly.

2. Optical ataxia without Visual agnosia: Patients, though able to ‘recognize’ objects, were unable to guide accurate motor-responses towards the object.

From another incredible study, we gather that Marr’s purpose of Vision isn’t comprehensive; there seems to be broadly 2 different functions of Vision. The first is pretty much what Marr elucidates. The second, however, seems to be.

guiding motor-responses to the visual input (images) such as navigation or grasping

This is characterised by the building of a viewer-centered description of objects (differing from the first purpose which if you recall was the provision of an object-centered description).

Cognitive science Computation D. Marr Information processing system Levels of understanding vision Neuroscience Purpose Vision Visual system

Comments are closed