The User Agent Project

User-Agents are one of my more arcane areas of interest. I am compiling a comprehensive list of unique user agent strings since 2006. The resulting list is probably the largest ever made available to the general public. This research project is accessible as a specialized search engine for everyone working in web analytics. This FAQ is a collection of the answers to the most common questions people asked me about this project.

General Questions

What are “User Agents”?
How does an user agent look like?
Who is interested in user agents?
Why is such a search engine useful?

Data

What is the source of all that million agents?

Privacy Issues

Am I personal identifiable through the user agent of my browser?
Are user agents never unique?
Is this user agent data a privacy risk?

General Questions

What are “User Agents”?

User agents are short data strings transmitted with most Hypertext Transfer Protocol (HTTP) requests. They are sent by clients (for example your web-browser) to servers (like the one hosting this homepage). In this process, the client usually states its type (Internet Explorer, Firefox, etc.) and version via this request-header. With the help of the user agent string, the server can in turn provide the client with the specific data it needs to properly render a web page.

How does a user agent look like?

A user agent has separate parts with different meanings. It is sectioned in tokens to identify elements like its application type, operating system, software vendor, or software revision. It looks something like this one:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/55.0.2883.87 Chrome/55.0.2883.87 Safari/537.36

For more information on the history of user agents and how to decode such strings, you could read this Wikipedia article.

Who is interested in user agents?

One main reason to deal with user agents has to do with web analytics. Different browsers support web technologies differently and the way the content of a website is displayed is varying between them. Naturally, the operator of a website wants to attract as many viewers as possible and not scare away potential customers. For this reason, it is vital for him to know which browsers are actually used to display his website. One way to find out the type of browser and its supported technologies is by checking the information given by the user agent strings of the visitors. This is not always reliable because the user agent could be tempered with. Yet this method is still sufficient enough for noncritical website features like design gimmicks.

An other reason to be interested in lists of real user agents are privacy concerns. Some users trying to conceal their true browser versions with help of browser extensions which needed time and again fresh lists to function properly.

One last reason to deal with this topic is the complex area of automated website crawling. Changing the user agent between requests is a widespread method of bypassing countermeasures against automated requests. Knowledge on the behavior and identification of different types of clients is crucial to the development of both website protection systems and web-crawlers.

Why is such a search engine useful?

Comprehensive lists of User-Agents are needed to develop better web-tracking solutions and to get a broader understanding of web requests as such. Furthermore, they are useful for programming smarter tools against automated web-requests.

Last but not least, for someone interested in web-mining, it is a lot of fun to process hundreds of gigabytes of data to compile such a specialized database – uniquely in size and data quality. 😉

Data

What is the source of all that million agents?

In 2006, while researching the source of some suspicious website requests, I realized that Google has indexed huge amounts of raw apache access logs. (You could search them with queries like this: inurl:access filetype:log) In this year I started to monitor search engines for this raw data and to process some of these log files. I also included the user agent data from visors to various of my private websites. In this process no personal information was stored, see the topic “Privacy Issues” in this FAQ.

Privacy Issues

Am I personal identifiable through the user agent of my browser?

In general the answer is No. By design, user agent strings are not meant to identify anyone personally, but rather the software used to connect to a server. In fact, this identifier changes with every software/browser update.

Are user agents never unique?

There are circumstances under which a single computer – and thus its users – might become uniquely identifiably via its user agent string. This is primarily the result of spyware and mostly an issue affecting Microsoft Internet Explorer up to its version 8.0 (about 2011). Sometimes also well intentioned network administrators configure firewalls to uniquely identify single computers via agent strings. This case is nowadays, with much more privacy awareness in the IT in general, much less common.

Is this user agent data a privacy risk?

A clear NO! In general the raw data processed here are freely visible to everyone and for example already indexed by public search engines. The user agents of most web requests are not unique and in principle transmitted unprotected many thousand times a day and per user via the Internet as part of the HTTP protocol. They already processed and stored on millions of servers around the world every day.

In the preparation of the raw data for this search engine, privacy issues ware still taken seriously into account: in the processing of log files the IP addresses, the requested hosts and URLs and the precise request time where discarded and not stored.

But the efforts not stopped at this precaution as it became clear that some percent of the resulting unique user agents still MIGHT have some privacy related issues. While this was primarily in the early years of the data where the connection to real users was already lost, the raw data was additionally refined. The data was filtered and obfuscated with manually created rules and the help of machine learning. In total one third of the agent data was removed to further protect the data originators.

If you think you have found find a problematic user agent respectively search results, please inform me via the contact form on the menu and I will remove the data as soon as possible if it seems appropriate! 🙂

The User Agent Project – FAQ

Table of Contents