Archive | databases RSS feed for this section

Part I: Is There a Such Thing as Real World Haskell?

8 Sep

Part II:   How not to Start Your Haskell Program >

Here’s a bit of sardonic code, that I’d like to propose to any Haskell advocate out there.

data Works = Works | Does_not
computerApp a ::  Maybe a -> Works
computerApp a
     | isJust a = Works
     | otherwise = Does_not

I have been playing around with functional programming in Haskell.    I have to say that it has more than certainly improved my ability to code in other languages, and probably has reduced the number of bugs I have to fix after the fact.    On the other hand, it has driven me absolutely batty.

To be fair, I need to say that I am not a computer engineer.    I have a BA in English.    My Masters are in Public Administration and Information Management.      I engaged in Haskell code simply as a curiosity and a challenge.    I love math, and became curious about Monads and Lambda calculus.    I am probably not smart enough to be a great Haskell programmer.   However, I do understand two things.    1)  Not-smart-enough people can and want to participate in application development   2)  Coders, while making apps that do what they expect them to do, do not always understand (or care) about the sustainability and/or scalability of their code.

Web Development is an important test case.      Just about anyone, with a reasonable amount of time and effort, can learn to develop a website in PHP, probably supported by some content management system as Drupal or ModX.    Somewhere, their development goes overboard, the system does an upgrade to support some security risk or vulnerability, and ‘pop’ –>  all that likely un-documented and messy code goes nowhere and wheels need to be reinvented.

That’s why learning Haskell is probably a good idea.    Without getting into the code itself, it insists that a function always causes the same result to happen with any given input(s).    Once developed, the documentation pretty much always exists in a minimal form (via Type declarations).    So many bad habits would disappear if only people were forced into developing this way.

The problem, unfortunately, is that Haskell coding is confusing.     There is no popular development framework to use it.     Once you try to apply the examples provided in text books to real world development, things go wonky.    I won’t go into the many reasons why, but I do have an observation based on what I’ve seen in responses from various gurus to newbies like me.

It’s this ->   Users think computers do things.    Computer engineers think computers solve problems.     In Haskell terms, any interaction between users and engineers results in a type error.    Somewhere along the line, an IO() monad needs to be created to turn what engineers like about Haskell into something that users will like about it.

I would like to propose a management framework, similar to extreme programming, to manage the development of functional code for regular people.   While Programming it in Haskell is not a bad start, it uses a problem solving model, rather than a ‘how do you make the software do x’ model.    It focuses on mathematical abstractions rather than simple actions.     For instance, I would like to see a book that uses the development of a rogue-like rpg game in Haskell as an example.    Instead of worrying about efficient computation, abstractions about ‘laziness’ and recursive factorial examples, the writer would have to focus on managing complex (a tuple of a lists of tuples) types, worrying about random numbers and IO issues that are inherent to Haskell.   In approaching such a game, should I worry about creating newtypes first, or work from what I want main to do and fill in the gaps?

But while I make this suggestion, I really have no idea of what kind of advice I can offer your typical new-to-haskell coder.     But I have some hypotheses:

  • work from the main :: IO() first and build a framework of functions to develop your outputs.
  • possibly create type variables for each of your functions, making it equal in Type to a typical output you would like to see.    Then work backwards from there to create a lazy output, then involve possible recursion and so on.
  • use generic types (eg. Int, String, Char etc.) with comments first, then develop types to make your code more clear.
  • unit tests should include the System.IO.Unsafe module (cheating should be allowed when you are testing your code – let the learning happen when you are developing real code)

I’ll add what I can as I continue to learn more about coding in Haskell.    The bottom line is that I think more people should be coding in a language like Haskell, but they are unlikely to work with it if they end up spending a bajillion hours just to get it to choose randomly from a list of monsters (for example).   Especially when they can learn how to do the same in three minutes using an imperative language like Python.

For the greater good and more sustainable code overall, what high-level tips or approaches can you offer any newbie coders of Haskell, so they can develop without becoming absolutely bogged down in failure with their Haskell programs?

UPDATE:   After writing this, I found a great powerpoint tutorial by Graham Hutton that uses Hangman as an example of interactive Haskell code development.

What is a Database, really? Data Storage for Librarians

23 Oct

When a librarian talks about a “database” he or she usually means a website that will let you search for information not normally found in Google.   A techy would probably correct on this and say “actually, Google is a database too.”    And a real pedant would say something horrible like “actually, both of you are wrong.   What you are talking about are interfaces that USE a database to store, modify and retrieve data.”

*** Yawn! ***

The problem is that we have jargon based on jargon that gets passed around until it means nothing to nobody.    Here is my attempt to explain my pedant for you, and I will do it by answering the question:  “What are my options for storing data?”    Ooops, I’m hearing my inner pedant again.   What are my options for storing electronic data?” would be more precise.   Then I will explain why a relational database is one of the best options available for doing so.

So, how can you store data?   Here are the options:

A Text File

Good old text is still not a bad way to store data.   In fact, if you count scripting code like PHP, a heck of a lot of data is actually stored as plain text.    It’s all pretty simple.   You type some information in a text file and then save it on your harddrive.

If you want to retrieve that information, you can use a script to tell the computer to organize the information in a particular way.   This is called parsing aka manipulating data according to specific rules.

There’s only one problem:  you need ways to separate some kinds of information from others.   Here are some possibles:

  • spaces — this is not bad, except you will not be able to distinguish phrases from plain words.  (eg. “cracker jack in the box of tissues” parsed using only spaces would not help us make sense of the data).
  • commas — this is precisely what csv (comma separated values) does — separate each data cell with a comma.
  • indents/carriage returns — this is slightly better than the others, but still quite inadequate.   How will distinguish a paragraph from a list of things, for instance?

Text is good for basic data storage, but it does not give us much flexibility in its use.   For instance, what if we wanted to classify the data into groups like cats (siamese, manx, tabby)?   In order to do this, we need a more complex structure.

Structured Text

As you climb the data food chain, complex systems get developed to organize information.   Basically, this means you use symbols to help a computer understand how the data can be structured.   There are many subsets to structured text.

  • Mark-up:   Marked information is data that has marks or signals to let a computer distinguish one type of data from the other.   Python and C computer languages use a # to indicate that the information in the line following is intended to be read by humans only.  Javascript Object Notation (JSON) uses quotation marks to encase data, and squiggly brackets {} to show datasets that should be grouped.
  • A variable:   A variable happens when you give some data a name.   In PHP, $pie = 3.14; would assign the value 3.14 to the name “pie” which could be manipulated later on.
  • A string:   A string is a kind of variable that has text.   For instance in PHP you can say $myName = “Ryan”;.
  • An array:   An array is a type of variable that includes a list used by computer programs for later manipulation.     $myBooks = array(“the blue one”, “the red one”, “the yellow one”); would provide a numbered list that could be manipulated and organized.
  • An object:   Explaining objects in full requires that someone read up on object-oriented programming.   But one thing you can do with an object is store a broad taxonomy like animals->[cats->[siamese, manx, tabby], dogs->[collie, poodle->[toy, miniature, standard], doberman]].

Tree-based  Structures

Tree-based structures, or cluster models are a subset of “structured text” data storage models.   XML is probably the best example.   In tree-based stuctures, data is organized in “parents”, “siblings” and “children”.   Library-thesaurus addicts might understand these better as “broader, related and narrower” terms.   Take the following code:
<libraries>
   <library>
     <libraryname>Halifax Public Libraries</libraryname>
     <librarybranch>
       <librarybranchname>Alderney Gate Library</librarybranchname>
     </librarybranch>
   </library>
</libraries>

In this example, “librarybranch” is a child of “library” which is a child of “libraries.”  Conversely, “libraries” is the parent of “library” and so on.   If I wanted, I could add another child “library” within “libraries” and call it “New York Public Library” or “Ann Arbor Library” or whatnot and then add some respective branches.

The benefit to this kind of framework is that I use an XML parser to extract information such as “tell me all the branches of libraries that begin with the letter “A.”   It is useful because the information is grouped into nice categories (like “Libraries” and “branches”).

The downside to all this is the sheer verbosity of it all.   Say I add a parent to this tree file and call it “organizations.”   Then I add a whole list of other organization types like “hospitals,” “businesses,” and “schools.”   Depending on the scope of my data, this file could be huge in size and, even if you just want to find information about “libraries” you’d have to load up the entire file everytime you did a search (yes, programmers, I know there are ways around this, but they are generally quite complicated ways).

Also, tree-based structures do not always account for data anomalies.   For instance, a library could be a branch department in a hospital, business or school.   So, do you put the “library” underneath “school” or leave it on its own.   Or maybe you do both?    Ahh– but then you have to spend alot of time maintaining data in two, three or even more places.   In database terms, this is called redundancy.

The Relational Model

The relational model is better than a tree-model when your dataset is large and complicated.    The way it works is, instead of thinking in terms of “parents, siblings and children,” you think in term of relationships.   This provides alot of flexibility for your database design, and helps to reduce the amount of “places” where you will have to keep information updated.

How does it work?

I won’t go too far into database design, but there are some basic principles.   The first is called the “Primary Key.”   This means that every data object (such as a library) has a way of identifying itself in a unique way (usually through a number).    The second is called a “Foreign Key.”    The Foreign key is a way to associate one dataset (eg. libraries) with another dataset (eg. library branches, hospitals or businesses).   This association is called a relationship.    Once each entity/data object has its own unique identifier, you can make just about any relationship you want, either by creating a foreign key within your table.

Take the example of this table I’m going to call “Libraries”:

| ID   |  Type            |   Name                   |
______________________________________________________
| 1    | 1(public)        | Halifax Public Libraries |
| 2    | 1(public)        | New York Public Library  |
| 3    | 2(academic)      | Dalhousie Killam Library |
_______________________________________________________

And a related table “libraryTypes”:

| ID   |  LibraryTypeName            |
_____________________________
| 1    | Public                      |
| 2    | Academic                    |
| 3    | Special                     |
______________________________________

By using the unique identifier from table “LibraryTypes,” we are able to create a relationship between a library and a libraries purpose or type.    Once we have this relationship established, any data we include in the “Library Type” table, will automatically be associated with the library.   For instance, we could add a bit of data that defines what an “academic,” “public” or “special” library is, or define library associations that support/govern them.    Then, everytime a library is called “public” all the information pertaining to the “public” librarytype also belongs to that library.   This really helps reduce the amount of maintenance required for updating your database.

In more complicated relationships (called the Many-to-Many relationship), you might have to create a third table to associate two entities.      I won’t go into this too far, but the principle is the same — you create relationships among entities using the unique identifiers among the tables.

Within the relational database model, there are at least two different kinds of databases:

TEXT/XML Based

Text files can be organized in such a way as to accommodate some of the benefits of the relational model.   This is done by constructing keys within the XML files and associating two or more files together.    Once very popular, this is not very common anymore.   The advantage of this model is that text files are quite easy to share with others across a network.   Text files are also readable by humans.    The disadvantage is that, because the computer has to parse the text as well as compute the relationships among files, text models tend to be slower to operate, update and maintain.

Two examples of XML-based databases include Xindice and Sedna.

SQL/Binary-based Databases

The more common relational database type is a piece of software running on a server, rather than a set of text files.   They are generally accessed using a standard language called SQL (Structure Query Language), or more specifically SQL as supported by a popular scripting language like PHP, Python or Java.

What you choose in terms of an SQL-type database will depend on your needs.   Here are some examples of popular Open Source databases and why you would use them:

SQLite:   SQLite is excellent for a database that is updated by only a few people.    It is the simplest of all SQL database types, requires very little configuration and is stored in a single file, which makes it easy to backup, secure, share and manage.   Also, it is a public domain database, meaning there are hardly any restrictions on how it can be used and/or modified.    It’s major drawback is that because it comes in a single file only, it will not permit many updates from different people.   In short, SQLite would be horrible for a large scale wiki project.

MySQL:  MySQL is probably the most popular of open source databases.   It takes a little more effort to configure on a server, but it also provides more power for people to manage and update information.   MySQL handles multiple updates easily, so it is the database of choice for most CMS systems, including wikis.   THe Koha ILS uses MySQL as well.

PostgreSQL: PostgreSQL is the most powerful of databases IMHO, providing just about everything MYSQL does, with some added back-end configuration options that make it better for maintaining very large projects.   I highlight this one because it is the database behind the PINES Evergreen projects. 

 

Final Thoughts + It’s Not All Mutually Exclusive!:

This intention of this article has been to identify the ways data can be stored and to explain a little about why the Relational Database Model is so important.

That said, I neglected one important thing.   Almost any major web application will have a combination of all these types of data storage methods!    For example, a website may store its data in a MySQL database, but offer part of that database in an XML file that can be accessed by other websites.   This is called a “Web Service” or, more popularly, an API.    Configuration information for servers and computer languages like PHP is almost always stored in a plain text file.   In short, each data storage method has its own benefits and sacrifices — and that’s why it’s almost always better to diversify your data storage method.

Finally, for those who are interested in learning a little bit more about how a relational database works in the real work, I recommend the SQLZOO tutorials.  They cover SQL in pretty straight-forward ways, covering a wide range of services from Oracle to SQLlite.

Follow

Get every new post delivered to your Inbox.

Join 3,509 other followers