Africa has a data problem, and we need to be open about it.

The Problem

The saying that Africa has a data problem is not a new one, and for decades, it’s been an ongoing conversation. Some organizations like Stears are doing an excellent job to close this gap as much as possible and improve the current data dilemma.

However, it is not enough, and in recent times, nothing has shown this gap as much as using Large Language Models (LLMs) has. There’ve been issues about bias, misrepresentation, and underrepresentation.

Now, there’s the part of human factor in these problems, but there’s also the part where the reason why these data are missing or inadequate is because they do not exist, or do not exist in forms that can be used (structured and unstructured data).

Machines are ultimately that – machines.

They require information to be put into them, and can only give back what has been put into them. Even in the world of machine learning, which is a subset of AI, and the complex systems that enable them to learn on their own from studying human interactions, they ultimately still only work with the data that has been given to them.

We tell humans that what you don’t know, you don’t know, and if we’re expecting computers to reason like humans – albeit faster – the same principle would also apply to them too.

If data does not exist, it is not possible to learn from it, and by extension, provide knowledge on it.

Now, we see an issue and have identified what the problem is. The next step is: how can we solve it?

To solve it, or at least get on the process, we need to first identify why we have this problem. Some of them would be listed below based on observation and work experience.

Causes: Why do we have this problem?

Lack of trust: It is not uncommon to learn that organizations are unwilling to share data with others because they are uncertain about the extent of use or are just resistant to the idea of sharing.
Little to no digitization: As a continent, we are miles behind our counterparts (the Global South) when it comes to digitization. Yes, more people use smartphones and are able to connect to the internet, but our systems are still largely non-digitized. In Nigeria, government institutions and hospitals, for example, still have their records on paper. Aside from the detriment of natural disasters happening and wiping out important individual/patient records, it also makes it unable to process the data or structure them in a way that can be anonymized for future use.
Sharing culture: Similar to our lack of trust, we do not have a culture of sharing/continuity in the sense. More recently, people are able to talk through their processes and explain their activities, but that’s more on an individual level than system-wide. To ensure we have data/information that people can use and make assessments on, information about these actions has to be shared.
Lack of reproducibility: This is more of an effect than a cause. We do not share a lot, so there’s limited information to replicate. However, there is a ton of information out there, and in cases where things need to be built, the first approach is usually to build from scratch, as opposed to checking if an open-source version or boilerplate exists to build on top of.

Potential Solutions: How Can We Solve This?

Create data sharing policies: Create policies and agreements between partners and stakeholders. This sets clear expectations for the data being collected, processed, and used. It also possibly explains what tools are being used and how they’d be used. In the event of a breach, there are clear indications of measures to be taken out, and the appropriate personnel or authority to be referenced. An example of a project that has this is the Data Science Without Borders project, which has representation from three pathfinder countries and the African Population and Health Research Center (APHRC).
Digitize: We should have a system of digitization. There is sometimes the suspicion of records being susceptible to breaches or security threats when digitized, and while that is a valid concern, there are tools and measures created to secure data and systems. The risks of leaving records on pen/paper and other manual tools outweigh the benefits, and it is more advisable to make records digitized, especially if we hope to process them to be anonymized and publicly available.
Share more: We need to develop better sharing cultures. A balance should be found between protecting sensitive information and being able to share practices or processes being implemented for public benefit. This would lead to my next point…
Encourage and practice reproducibility: Build on and use existing solutions. This creates a culture of building and refining, reducing duplicity, and enabling collaboration across different demographics. When people improve on and build with an existing system, it enables these pre-existing solutions to be reviewed and improvements made to them. The initial creators can have better insights into their work, which leads to more refinement and easier detection of what might be false or needs updating. It also increases the chances of collaboration, which is important to build systems and tools that adequately serve the needs of the continent on a large scale.

“If you want to go fast, go alone. If you want to go far, go with others.”

With respect to this topic, I’d structure this to say that if we want to stay longer and build stronger systems, we need tools, policies, and practices that cater to this.

The beauty of open source is that by default, it already propagates the practices of accessibility, transparency, and collaboration.

To solve our data problem to a considerable level, we need to implement these practices at different scales, from proprietary to fully open source organizations. Talk to the OSPOs in your region, and learn how you can do your part.