The 5 Ws of Data Validation – Part 1

As web developers, the applications you write are complex data processing engines.  They try and convince your users to enter good, meaningful data and to respond in solid, predictable ways based on what was entered. Robust data validation will allow the rest of you application to work effectively.

What is Data Validation?

Data validation is the process of making sure that the data you are using is

  • Clean
  • Complete
  • And useful.

Data is scrutinized in various ways to make sure that it adheres to basic restrictions and to fundamental properties.  It’s no good receiving a sandwich when you expect a car.  In a more practical sense, if you expect an integer, a string simply won’t do, and you had better know it before your code dies an ugly death. PHP can handle some basic magic due to it being loosely typed. That said, “foo” will never be a useful integer.

Once your data passes an initial validation, it is wise to run your data by your business rules to ensure that everything falls within acceptable limits. Data outside these limits can be considered low quality or possibly an outright error and will be of very little use. In some settings it may still be fine to accept this data, but most of the time your application will need to find a way of responding to these problem values in a practical way.

A good example would be a user (say, Joe) is submitting a date for when he would like his car (a Ford Pinto) to be serviced. The incoming value is first checked to verify that it really is a date. With that done, it now needs to be evaluated against your business rules. In this case, the date clearly cannot be in the past, since it wouldn’t make sense for requesting an appointment. Likewise, the shop (The Car Repair Warehouse) does not accept dates more than 2 weeks in advance. Data outside these limits are not useful. The date Joe entered was the day he bought his Pinto (June 9, 1976), well outside the acceptable range. Now that you know the data is out of range, it can be dealt with properly, most likely by informing the user of the problem.

Why is Data Validation Important?

Without validating your data, horrible, ugly, bad things can happen. Very bad.

In a bad case, a malicious user could destroy your database. That’s very inconvenient, as long as you’re diligently backing up your data. You are, right?

Worse than that, they could manipulate your data in ways that are hard to identify, that benefit them and that are detrimental to your other users. Most of the time, you won’t even know this is happening.

Rising out of the really glum cases, there are a whole lot of usability reasons why validation is a good idea. For instance, it will force you to reflect on what data you should be expecting from your users. You analyze your assumptions and codify them in code. This will help you communicate to users what the expectations are. It is pretty poor when a user enters a value that they assume should be allowed by a form just to have the thing error out on them. Suddenly, they end up on a blank screen, or worse, a screen showing an exception.

Validating your data will also ensure that you have a higher overall consistency in the data that you’re collecting. As they say, “Garbage in, garbage out”. It’s important to make sure that you’re not the trash collector. This is where your business rules will really come in handy. They will ensure that the well typed data coming in is in fact good, useable, consistent data; data that has excellent business value.

Where Should I be Validating Data?

Validation should occur at trust boundaries. That is to say, at any point which the application can no longer trust the data that it is receiving.

In a perfect world, the data should be validated at every tier of your application stack; anywhere the layer loses control over the data it is using. Here are some common boundary points that a somewhat common PHP stack would encounter:

  • Database
    • All incoming data from queries and parameters. In most cases this would need to be done at the driver and/or in stored procedures.
  • Application server
    • Any data being retrieved from external APIs such as web service calls, cURL calls, or data loaded from files.
    • Any data being submitted back from the user or client via http (get/post requests, etc).
  • View/Client
    • Any data being entered by the user (client-side validation of forms)
    • Any data being received from the server via AJAX calls

Based on the kind of development work you’re doing, there could certainly be many more trust boundaries.

For most people the data being returned from a database is considered highly trusted, since it tends to be the ultimate data repository (the “if we can’t trust the database, we can’t trust anything” scenario). This would be the case for smaller development projects, where the database is relatively simple and it’s being developed by the same team that is developing the rest of the application and the database resides on the same subnet as the application. In a larger environment where the data/persistence layer gets complicated (perhaps all database work is passed to a different team for development, or the database traffic is being passed across networks outside the team’s control), the level of trust could be significantly lower.

For the focus of these articles, I’ll stick to scenarios where the data is being validated via the PHP code (with perhaps a bit of JavaScript thrown in).

When Should My Data Be Validated?

All incoming data is considered untrusted until it’s been validated; until you know it’s good, it’s bad. As data is being received from a lower trust boundary, it should either be validated or dropped. The data shouldn’t be used in any way until it’s been proofed and deemed trustable. Once it’s been properly validated, it can be used in the same way any other internally available data is used

With that in mind, it means you can’t use any incoming user data until it has passed a validation and verification process. Some of this may occur at very low levels within your application, such as using request data for routing (identifying which controller and action to use) within your application. If you’re using a framework, it’s quite possible that this is out of your control.

Most validation will occur at a higher level, such as within your controller. At this point, as a developer, validation becomes your responsibility.

Who is Responsible for Validation?

You are, of course. It doesn’t matter if your framework is handling parts of it internally, –new sentence– as the developer you’re still ultimately responsible for the trustworthiness of the data that are used.

How you architect that validation into reusable and manageable methods will be central to how secure your application will be. If your validation methods are easy to use and are light on boilerplate code, it will help you be more consistent in your validation attempts.

How Do I Validate My Data?

Ok, so I’ve thrown an H into my 5 Ws article. This will be a very high level “how”, looking at strategies rather than implementations. For this, I fall back onto OWASP’s recommendations. There are four strategies to validating data. Ranked from the best way to the worst, they are:

  • White-listing – Accepting known good values
  • Black-listing – Rejecting known bad values
  • Sanitizing the data
  • No validation

These strategies are discussed in the OWASP Data Validation Guide (https://www.owasp.org/index.php/Data_Validation#Data_Validation_Strategies).

White-listing has obvious benefits. You literally only accept data that matches a known list of valid values. This works great when you know what all the possible valid data can be. For example, let’s say you are expecting an integer value between 1 and 5. In that case it’s easy to say that it must be a value contained in this list: 1, 2, 3, 4, 5.

What if the value you’re expecting is a float between 1 and 5? The possible values have just become infinite. Obviously, white-listing isn’t going to help you anymore, you’ll have to settle for the next best thing: black-listing. With black-listing, you will identify all the values that are not allowed. In this case, you can say that in the incoming value must be >= 1 and <= 5=”" you=”" may=”" want=”" to=”" put=”" other=”" reasonable=”" limitations=”" on=”" the=”" data=”" say=”" only=”" decimal=”" places=”" are=”" allowed=”" p=”">

Sanitizing is so often used and so very flawed. Sanitizing is over used by PHP programmers to protect against SQL injection attacks (mysql_real_escape_string() anyone?).  In order for this method to work, you would need to both have a perfect knowledge of all the possible bad data and be able to code against them. You end up making assumptions about how you think data will be interpreted in your system or database. In reality, everything you didn’t anticipate gets by you with sanitizing; and really, the people trying to get things by you can be pretty inventive.

Don’t rely on sanitizing. Just don’t do it. It will let you down like a plummeting elevator.

If you choose not to validate your data, fear the reaper.

Conclusion

In the real world data is ugly, crazy, and untrustworthy. Your only hope to taming the data beast is to diligently, methodically validate your data. Strong data validation combined with rigorous business rules will ensure that the data you use is clear of security problems and as useable as possible.

In my next article, I’ll be looking at the how to do basic data validation in PHP.

 

, , ,

No Comments

ResponseHound project update

ResponseHound has been an incredibly useful tool in my most recent work project. My team is building an application that uses a GWT (JavaScript) client-side app connecting to a PHP server. They communicate using JSON. As usual, some of the more stringent testing (unit) has been pushed to the side. ResponseHound gives us a way to validate that the entire server system is doing what it’s supposed to be doing for each incoming request.

I’ve been continually adding new features here and there as I find that I need them. I have not been pushing these out the git repository since I have not documented them well or fully tested them. I’ve also been throwing around some alternatives for writing tests. This might include a more OOP based approach. I’m also eyeballing an possibility of XML tests. That seems a bit trickier.

Some of the features that I’ve added that need testing are:

  • Request options
    • Show the full request URL
    • Show all response data formatted
    • Show raw response data
  • Conditional tests
    • Test will only execute if another piece of data matches the condition
    • Match against null/notNull, a single value, or a set of values
  • Direct JSON request emulation
    • Allows passing JSON directly in for a request
    • For complex requests, this is very useful

Hopefully, I’ll be able to get these tested and ready for release soon. I also plan on putting together a full demo soon, complete with testing-rig controller.

I’m always open to suggestions or recommendations. Send me your comments.

You can find more info on ResponseHound in the wiki: http://sethmay.net/wiki/ResponseHound.

, , , ,

No Comments

The Story of Spaz: How To Give Away Everything, Make No Money, & Still Win

ZendCon 2010 – Tuesday Morning, 10am UnCon Session Summary.

Presented by Ed Finkler  – @funkatron (funkatron.com)

http://getspaz.com

Ed Finkler has been on Spaz since 2007. Spaz is an open source micro-blogging client. He joined the twitter dev mailing list, became a moderator, back in the days when Twitter had 6 employees. Initially, Spaz was writen in Real Basic. There ended up being a lot of issues with this language (such as theming, and inline linking).

Ed had a strong working knowledge of CSS & HTML really well (JavaScript, not so much). That lead him to Apollo (which later became adobe Air).  There was a lot to learn about JavaScript such as ajax and event handling. So he rebuilt the app. There was some initial interest in the app, especially since it was one of only two desktop apps for twitter. He ended up winning some contests with Spaz, such as a computer and a chair. Plublicity was welcome and abundant in the early days.

From there, things got a bit more complicated. Soon, Ed  started to take some flak for the name, as it was a bit offensive in the U.K. (spaz is a derogatory name for someone with cerebral palsy). Feature requests could be a bid downer as well. Big comparisons started to come up between his app and other twitter apps. People like shinny thing, it gets tem excited. Most end users don’t care if something is open source or not. They care if it’s free or not. When people work with developers, they tend to treat them as nameless, faceless entities, not real people. It’s can be really hard not to get offended by the comments and feedback. People are unaware of your motivations, or they just don’t care.

Eventually, twitter changed their authentication to OAuth, most the other free systems didn’t change their systems. Spaz did and so the user base tripled over a couple of days.

Adding new features like image uploaders became hot spit. Comparisons and feature requests to match other apps started coming in fast and furious. Since this was open source, all development was happening in Ed’s free time. It became impossible to keep up with the demands.

He was then approached by Palm to discuss using the twitter app on their platform. He agreed and signed an NDA. This became very difficult since sharing code is an intrinsic part of who he is. Even though he was told that he could open source the code, they still attempted to stop him the weekend before launch. They really just didn’t get it (today, they have done a much better job of embracing open development on their platform).

After this experience, Ed had to redefine what his definition of success was. He ended writing a declaration of purpose to specifically define what the project was about and the goals it was meant to achieve But he really needed to help. Continued development couldn’t be done by just him. Two platforms, building a community, and the decision not to charge for it where all major factors in the need to bring in more people. Ed wasn’t even testing on a device, emulators only. Getting a device really helped increase motivation. “Eating your own dog food on a consistent basis really helps motivate you”. Using your own apps helps you focus on improvements.

Ed then spent most of 2010 cultivating a community. Originally, Spaz was hosted entirely on Google Code and Google Groups. It was then moved to GitHub which allowed for a better social environment. Go where the community goes.  A lot of developers, especially JavaScript developers, where already on GitHub. This made things the project higher profile and easier to interact with. He also started to use TenderApp and Lighthouse for much better for issue tracking. It made things easier and simpler for people.
Road-maps and milestones also became important tools for the Spaz project. It really helped the community to see what was going on and focus them. Hack-a-thons also really helped bring people together, even when people weren’t working on it. If you want people to work on your project, you need to give it to them in small, bit-size segments, that they can sink their teeth into. If you say “work on what-ever”, people will quickly get overwhelmed and back off, and no one will help.

Take away:

  • “Eating your own dog food on a consistent basis really helps motivate you”
  • Cultivating a community – Go where the community goes. Make it easy for the community to contribute and buy in.
  • Make good things in the right way.
  • Keep it pure: do it because you love doing it.

, ,

1 Comment

Cydia repo and rebuilding apt on iOS4

It took me quite a bit of work to locate the cydia repo so that I could manually download packages. So here it is for future reference (for the most current version):

http://apt.saurik.com/dists/dude/main/binary-iphoneos-arm/debs/

or to the repo root:

http://apt.saurik.com/dists/

Installing apt7 took the following packages:

  • apt7_0.7.25.3-6_iphoneos-arm.deb
  • apt7-key_0.7.25.3-3_iphoneos-arm.deb
  • apt7-lib_0.7.25.3-9_iphoneos-arm.deb
  • apt7-ssl_0.7.25.3-3_iphoneos-arm.deb
  • berkeleydb_4.6.21-4_iphoneos-arm.deb
  • curl_7.19.4-6_iphoneos-arm.deb

Word of advice: don’t hose apt on your iPhone. It makes things suck.

, ,

2 Comments

An AllTest Workaround For PHPUnit

PHP + Unit Testing = Pain in the rear!

I have a fairly large library of PHP code that I maintain. It’s been one of my goals over the last year to implement unit testing for as much of this code as I can. The process has gone fairly well, but one BIG issue always seems to come up: when I attempt to group my tests into suites, class conflicts occur.

This is due to the fact that I have stubbed out many of the classes that are dependencies. These classes, obviously, have the exact same name as the classes they are replacing. If you include a test for object X and then include a test for object Y which requires X (which has been thoughtfully stubbed out), you get this fine error:

Fatal error: Cannot redeclare class X in Y on line N

I know there are various ways around this, many of which require a fair amount of refactoring. I just don’t have the time or energy to undertake that task right now. BUT I still want an AllTests suite that will allow me to run all my unit tests for my project at one time! Manually running hundreds of test is no good. Manually running dozens of small suites is no good. Here is my work around.

Goal

Go from hand executing a ton of tests/suites to executing all of them in a single call. It would also be nice if we end up with some nice output from the entire things. Say, something like this:

1. We need an install of PHPUnit

Install PHPUnit from PEAR. You can find the instructions here: http://www.phpunit.de/manual/current/en/installation.html

I’m currently using v3.5. It won’t work with the PHPUnit that comes with Zend Studio. You also have the added bonus of using the most recent version, which can be very helpful, indeed.

2. Setup you suites as XML files

Let’s assume the following directory structure for this example:

/tests/unit/library/project/

Our next step is to use the XML configuration system for setting up PHPUnit test suites. I typically put my xml files in the same directory as the tests. Here is an example of one of my files:

library/Project/Class1Test.php library/Project/Class2Test.php ...

I use /tests/unit as my root directory for all things unit test. The bootstrap.php is located here and all my tests are referenced from that directory. You will notice that the bootstrap is referenced from the location of the AllTest.xml file.

3. The bootstrap.php file

The bootstrap.php file is used to setup anything that is needed commonly for your testing environment that can’t be done from your tests. This includes things like PHP INI directives and environmental variables.

<?php set_include_path(implode(PATH_SEPARATOR, array(realpath(dirname(__FILE__)), get_include_path(),))); $newPath = "../../"; set_include_path(implode(PATH_SEPARATOR, array(realpath(dirname(__FILE__) ."/". $newPath), get_include_path(),)));

My file sets the unit test path and my project root path in the php include_path. Place this file in /tests/unit.

4. AllTest.bat file

Yes. I’m in a windows environment. WISP to be specific, and all the joy and happy butterfly flowers that it brings. That said, batch files seemed to be a decent way to go to accomplish what I needed. PowerShell might have worked too, but I have no experience with it. So enter batch hell.

Rather than write out the entire file here, I’ll just summarize the basic functionality:

  1. Set environmental variables: working directory, timestamp, program root directory, config file location
  2. Read in the config file (unit.ini)
  3. Scan all the directories in the unit test folders looking for files named AllTest*.xml
  4. Execute phpunit on each XML file, using the –log-junit switch to generate an xml log file for each test.
  5. Execute my merge.php script to take the individual XML log files and merge them into a single XML file. The script also executes an XSL transformation on the log file to generate an HTML output file.
  6. Open the HTML output file in firefox for viewing.

You will notice several files that I listed in there. Here’s what they are:

  1. unit.ini: a basic configuration file. This was an attempt to handle as much configuration as possible outside the batch file. Moderately successful.
  2. merge.php: a php script that reads in all the phpunit generated log files, merges them together into a single XML log file and executes and XSL transformation on the final log file.
  3. plain.xsl: a modified version of the JUnit stylesheet. Used to transform the log.xml file into a log.html file for easy viewing
  4. log-x.xml: the various xml file output by PHPUnit. One for each AllTest*.xml file that it encounters
  5. log.xml: the merged version of the individual xml files.
  6. log.html: an easy to view version summary of the phpunit output

You can take a look at each of these files. Once everything is said and done, I end up with a very useful summary of my phpunit tests. It’s actually more useful than the output that is shown in Zend Studio (v8 beta2).

It might also be a good idea to create a customized version of your PHP.ini file that has at least the php_xsl extension turned on as well as any other settings required for your environment.

Everything is pretty rough, but it works well enough. The same technique could easily be ported over to a shell script. It’s now saving me a lot of time and helps me quickly validate my entire library in a single execution.

Get the project files here:

phpUnitAllTestWorkAround_v1.0_20101022

, , , , , ,

2 Comments