The 5 Ws of Data Validation – Part 1

As web developers, the applications you write are complex data processing engines.  They try and convince your users to enter good, meaningful data and to respond in solid, predictable ways based on what was entered. Robust data validation will allow the rest of you application to work effectively.

What is Data Validation?

Data validation is the process of making sure that the data you are using is

  • Clean
  • Complete
  • And useful.

Data is scrutinized in various ways to make sure that it adheres to basic restrictions and to fundamental properties.  It’s no good receiving a sandwich when you expect a car.  In a more practical sense, if you expect an integer, a string simply won’t do, and you had better know it before your code dies an ugly death. PHP can handle some basic magic due to it being loosely typed. That said, “foo” will never be a useful integer.

Once your data passes an initial validation, it is wise to run your data by your business rules to ensure that everything falls within acceptable limits. Data outside these limits can be considered low quality or possibly an outright error and will be of very little use. In some settings it may still be fine to accept this data, but most of the time your application will need to find a way of responding to these problem values in a practical way.

A good example would be a user (say, Joe) is submitting a date for when he would like his car (a Ford Pinto) to be serviced. The incoming value is first checked to verify that it really is a date. With that done, it now needs to be evaluated against your business rules. In this case, the date clearly cannot be in the past, since it wouldn’t make sense for requesting an appointment. Likewise, the shop (The Car Repair Warehouse) does not accept dates more than 2 weeks in advance. Data outside these limits are not useful. The date Joe entered was the day he bought his Pinto (June 9, 1976), well outside the acceptable range. Now that you know the data is out of range, it can be dealt with properly, most likely by informing the user of the problem.

Why is Data Validation Important?

Without validating your data, horrible, ugly, bad things can happen. Very bad.

In a bad case, a malicious user could destroy your database. That’s very inconvenient, as long as you’re diligently backing up your data. You are, right?

Worse than that, they could manipulate your data in ways that are hard to identify, that benefit them and that are detrimental to your other users. Most of the time, you won’t even know this is happening.

Rising out of the really glum cases, there are a whole lot of usability reasons why validation is a good idea. For instance, it will force you to reflect on what data you should be expecting from your users. You analyze your assumptions and codify them in code. This will help you communicate to users what the expectations are. It is pretty poor when a user enters a value that they assume should be allowed by a form just to have the thing error out on them. Suddenly, they end up on a blank screen, or worse, a screen showing an exception.

Validating your data will also ensure that you have a higher overall consistency in the data that you’re collecting. As they say, “Garbage in, garbage out”. It’s important to make sure that you’re not the trash collector. This is where your business rules will really come in handy. They will ensure that the well typed data coming in is in fact good, useable, consistent data; data that has excellent business value.

Where Should I be Validating Data?

Validation should occur at trust boundaries. That is to say, at any point which the application can no longer trust the data that it is receiving.

In a perfect world, the data should be validated at every tier of your application stack; anywhere the layer loses control over the data it is using. Here are some common boundary points that a somewhat common PHP stack would encounter:

  • Database
    • All incoming data from queries and parameters. In most cases this would need to be done at the driver and/or in stored procedures.
  • Application server
    • Any data being retrieved from external APIs such as web service calls, cURL calls, or data loaded from files.
    • Any data being submitted back from the user or client via http (get/post requests, etc).
  • View/Client
    • Any data being entered by the user (client-side validation of forms)
    • Any data being received from the server via AJAX calls

Based on the kind of development work you’re doing, there could certainly be many more trust boundaries.

For most people the data being returned from a database is considered highly trusted, since it tends to be the ultimate data repository (the “if we can’t trust the database, we can’t trust anything” scenario). This would be the case for smaller development projects, where the database is relatively simple and it’s being developed by the same team that is developing the rest of the application and the database resides on the same subnet as the application. In a larger environment where the data/persistence layer gets complicated (perhaps all database work is passed to a different team for development, or the database traffic is being passed across networks outside the team’s control), the level of trust could be significantly lower.

For the focus of these articles, I’ll stick to scenarios where the data is being validated via the PHP code (with perhaps a bit of JavaScript thrown in).

When Should My Data Be Validated?

All incoming data is considered untrusted until it’s been validated; until you know it’s good, it’s bad. As data is being received from a lower trust boundary, it should either be validated or dropped. The data shouldn’t be used in any way until it’s been proofed and deemed trustable. Once it’s been properly validated, it can be used in the same way any other internally available data is used

With that in mind, it means you can’t use any incoming user data until it has passed a validation and verification process. Some of this may occur at very low levels within your application, such as using request data for routing (identifying which controller and action to use) within your application. If you’re using a framework, it’s quite possible that this is out of your control.

Most validation will occur at a higher level, such as within your controller. At this point, as a developer, validation becomes your responsibility.

Who is Responsible for Validation?

You are, of course. It doesn’t matter if your framework is handling parts of it internally, –new sentence– as the developer you’re still ultimately responsible for the trustworthiness of the data that are used.

How you architect that validation into reusable and manageable methods will be central to how secure your application will be. If your validation methods are easy to use and are light on boilerplate code, it will help you be more consistent in your validation attempts.

How Do I Validate My Data?

Ok, so I’ve thrown an H into my 5 Ws article. This will be a very high level “how”, looking at strategies rather than implementations. For this, I fall back onto OWASP’s recommendations. There are four strategies to validating data. Ranked from the best way to the worst, they are:

  • White-listing – Accepting known good values
  • Black-listing – Rejecting known bad values
  • Sanitizing the data
  • No validation

These strategies are discussed in the OWASP Data Validation Guide (https://www.owasp.org/index.php/Data_Validation#Data_Validation_Strategies).

White-listing has obvious benefits. You literally only accept data that matches a known list of valid values. This works great when you know what all the possible valid data can be. For example, let’s say you are expecting an integer value between 1 and 5. In that case it’s easy to say that it must be a value contained in this list: 1, 2, 3, 4, 5.

What if the value you’re expecting is a float between 1 and 5? The possible values have just become infinite. Obviously, white-listing isn’t going to help you anymore, you’ll have to settle for the next best thing: black-listing. With black-listing, you will identify all the values that are not allowed. In this case, you can say that in the incoming value must be >= 1 and <= 5=”” you=”” may=”” want=”” to=”” put=”” other=”” reasonable=”” limitations=”” on=”” the=”” data=”” say=”” only=”” decimal=”” places=”” are=”” allowed=”” p=””>

Sanitizing is so often used and so very flawed. Sanitizing is over used by PHP programmers to protect against SQL injection attacks (mysql_real_escape_string() anyone?).  In order for this method to work, you would need to both have a perfect knowledge of all the possible bad data and be able to code against them. You end up making assumptions about how you think data will be interpreted in your system or database. In reality, everything you didn’t anticipate gets by you with sanitizing; and really, the people trying to get things by you can be pretty inventive.

Don’t rely on sanitizing. Just don’t do it. It will let you down like a plummeting elevator.

If you choose not to validate your data, fear the reaper.

Conclusion

In the real world data is ugly, crazy, and untrustworthy. Your only hope to taming the data beast is to diligently, methodically validate your data. Strong data validation combined with rigorous business rules will ensure that the data you use is clear of security problems and as useable as possible.

In my next article, I’ll be looking at the how to do basic data validation in PHP.

 

Leave a Reply

Your email address will not be published. Required fields are marked *