Collections

At this point, you have worked with collections a little bit. This chapter will delve into them in more detail. Collections are a group of records, which can be a representation of what models to fetch from the database - or a list of records themselves.

Iterating through Records

The most common thing to do with a collection is get the records within it. Collections behave the same way as Python lists, so you can iterate through them just as any other list:

>>> from intro import *
>>> users = User.all()
>>> for user in users:
...     print user.get('username')
'john'
'jane'
...

Collection.records

Underneath the hood, this is actually calling the records method for iteration. The same could have been written as:

>>> from intro import *
>>> users = User.all()
>>> for user in users.records():
...     print user.get('username')
'john'
'jane'
...

Why is it important to know that? The first time a collection's records property is called, it will decide if it has already been loaded from the database, or if it needs to go load itself. Every subsequent call will return the data that was loaded previously.

Collection.iterate

Often times, this is a perfectly acceptable way to iterate through a list of users. Sometimes however, particularly with large data sets, you will want to speed up the process by only loading sub-sets of information at a time. You can do this with the iterate method:

>>> from intro import *
>>> users = User.all()
>>> for user in users.iterate():
...     print user.get('username')
'john'
'jane'
...

Great. So what is different here?

In the call to records, all the records of the collection are fetched first, then iterated through.

In the call to iterate, we are actually batching the results. There is an optional batch keyword argument to the iterator. As you loop through, and the iterator hits its end, it will then go and select the next subset of data. The default size of the batch is 100 records, so each 100 records a new query to the backend will be performed. You can modify your sample size by just changing the batch keyword:

>>> from intro import *
>>> users = User.all()
>>> # select 10 users at a time vs. 100
>>> for user in users.iterate(batch=10):
...     print user.get('username')
'john'
'jane'
...

Pagination

Similarly to iteration, you can also limit the records you fetch via paging. Paging with ORB is very easy -- just define the size of the page and which page you want to get. Assume we have a database with 100 users, with IDs 1-100. If I were to query that using paging, it would look like:

>>> from intro import *
>>> users = User.all()
>>> print len(users)
100
>>> print users.pageCount(pageSize=25)
4
>>> page_1 = users.page(1, pageSize=25)
>>> page_2 = users.page(2, pageSize=25)
>>> print page_1.first().id(), page_2.first().id()
1, 26

If you want to pre-define your page size, you can do so in the all or select methods as well:

>>> from intro import *
>>> users = User.all(pageSize=25)
>>> page_2 = users.page(2)
>>> print page_2.context().start, page_2.context().limit
25, 25

Pre-defining the page size is passed via context, which we will go more into later.

Start and Limit

Pagination often times makes sense to work with, however what really drives it is the start and limit operations. This will determine what record number to start at, and how many to fetch. Paging works by saying "limit to pageSize, and the start is the page number * the pageSize". You can have direct access to this by using slicing.

>>> from intro import *
>>> page_2 = User.select(start=25, limit=25)
>>> print page_2.context().start, page_2.context().limit
25, 25

The above would be the same way to represent page 2 in the paging structure.

Slicing

In addition to paging and manually setting start/limit, collections also support Python's slicing syntax. Before a collection has fetched records from the backend, you can slice it down and that will update the start and limit values accordingly.

To represent page 2 again using slicing:

>>> from intro import *
>>> users = User.all()
>>> page_2 = users[25:50]
>>> print page_2.context().start, page_2.context().limit
25, 25

Reading values only

Sometimes, all you need are the values from a collection vs. full object models. You can improve performance by only requesting the data that is needed for your system from the backend. To do this, you can ask for specifically values -- or even just unique values -- through the collection.

>>> users = User.all()
>>> print users.values('username')
'john'
'jane'
...

If you only request a single column, then you will get a list containing the values. If you request more than one column, you will get a list of list values back, where the contents of the list is based on the order of the columns requested.

>>> users = User.all()
>>> print dict(users.values('id', 'username'))
{
  1: 'john',
  2: 'jane',
  ...
}

If you are looking at non-unique columns and want to just get a sense of all the unique values with in a column, you can use the distinct method to fetch only unique entries:

>>> users = User.all()
>>> print sorted(users.values('first_name'))
'Jane'
'John'
'John'
...
>>> print sorted(users.distinct('first_name'))
'Jane'
'John'
...

As you can see, the values method will return all values for a column within your collection -- including the duplicate first name of 'John'. Using distinct, we just get one instance per value.

Note:

If you request a reference by name from the collection, it will return an inflated model object unless you specify inflate=False. If you request a field, you will get the raw reference value.

>>> users = User.all()
>>> users.values('role')
<intro.Role>
<intro.Role>
...
>>> users.values('role_id')
1
1
...

Grouping

As we just showed, you can create a dictionary based on the returned data from the values method. You can also use the built-in grouped method for a collection to group data via multiple columns into a nested tree structure.

>>> users = User.all()
>>> users.grouped('role')
{<intro.Role>: <orb.Collection>, <intro.Role>: <orb.Collection>, ..}

If you pass in the optional preload=True keyword, this method will fetch all the records in the database first, and then group them together. The resulting collections within the groups will be already loaded and no further queries will be made. This is useful if you are not loading a large set of data.

If you choose preload=False (which is the default), then only the values of the columns will be selected. The resulting collections within the group will be un-fetched collections with query logic built in to filter based on the given column values that group them. This will result in more queries, but smaller data sets, which could be useful for generating a tree for instance that would load children dynamically.

Retrieving Records

Short of iterating through the entire record set, there are a few different ways that you can retrieve a record from a collection.

Collection.first and Collection.last

Often times, you will want to get the first and/or last record of a collection. Maybe you are just looking at the last comment that was made in a thread, or the original edit made to a record. The first and last methods will allow you to do this easily:

>>> comments = Comment.all()
>>> # get the first comment made
>>> first_comment = comments.first()
>>> # get the last comment made
>>> last_comment = comments.last()

How does the system know what is first or last? If not specified, it will default to ordering the record's IDs. First will return the lowest ID while last will return the highest.

This doesn't work for non-sequential IDs however, or if you want to sort based on some other criteria. To manage this, simply provide the ordering information:

>>> comments = Comment.all()
>>> comments.ordered('+created_by').first()
>>> comments.ordered('+created_by').last()

The ordered method will return a copy of the collection with a new order context (see the Contexts page for more detail on ordering).

You could also pre-order the collection:

>>> comments = Comment.all(order='+created_by')
>>> comments.first()
>>> comments.last()

Collection.at

You can also lookup a record by it's index in the collection. You can do this using the at or __getitem__ of the collection:

>>> comments = Comment.all()
>>> first = comments.at(0)
>>> last = comments.at(-1)
>>> first = comments[0]
>>> last = comments[-1]

Warning:

This is more expensive than using first and last because it will query all the records and then return the record by index. You should only use this if you already have, or are going to get, all the records.

For instance, you may want to do something like:

>>> users = User.all()
>>> for i in len(users):
...     user = users[i]
...     print user.get('username')
'john'
'jane'
...

In this case, the first time the users[i] is called, all the records are looked up. Every subsequent request by ID will be against your already cached records, no more database calls will be made.

Modifying Records

After a collection has been defined, there are a number of ways to modify the records.

Collection.refine

The most common thing to do is to refine selections. This will allow you to make selection modifications for your record for reuse.

>>> users = User.all()
>>> johns = users.refine(where=Q('first_name') == 'John')
>>> janes = users.refine(where=Q('last_name') == 'Jane')

The refined collection will join the query of the base collection with any modifications and return a new query.

Collection.ordered

>>> users = Users.all()
>>> alpha = users.ordered('+username')

The ordered method will update the order by properties for the collection's context and return a newly updated collection set.

Collection.reversed

>>> users = Users.all()
>>> alpha = users.ordered('+username')
>>> rev_alpha = alpha.reversed()

The reversed method will invert the order by property and return a new collection set in reverse order.

Editing Records

In addition to using collections as a means of selecting and retrieving records, they are also used for bulk creation, editing and deletion.

Collection initialization

So far, we've seen two ways of accessing collections -- from the Model selectors (all and select) and from collectors. You can also just create an empty collection.

>>> empty = orb.Collection()

When you create a new blank collection, it will be null by default (no information whatsoever)

>>> empty = orb.Collection()
>>> empty.isNull()
True

You can also initialize a collection with a list of models.

>>> user_a = User({'username': 'tim', 'password': 'pass1'})
>>> user_b = User({'username': 'tam', 'password': 'pass2'})
>>> users = orb.Collection([user_a, user_b])

What can you do with this? Working and iterating through these records in memory is really not much different than using a regular list. However, you can use this to save these records to your backend data store with a single save call now:

>>> users.save()

Calling the save method will create 2 new records in the data store.

Collection.add

In addition to initialization, you can dynamically add records to collections using the add method. One thing to note - this method is context aware, meaning based on where the collection came from, it will behave slightly differently.

>>> users = orb.Collection()

>>> users.add(User({'username': 'tim', 'password': 'pass1'}))
>>> users.add(User({'username': 'tam', 'password': 'pass2'}))

>>> users.save()

This example starts with a blank collection, and then adds users to a list in memory, and then saves. This is functionally the same as initializing the collection with a list of records.

When you access a collection via a collector -- such as a ReverseLookup or Pipe, the add method will actually associate the given record in the data store directly.

>>> group = Group.byName('admins')
>>> user = User.byUsername('john')

>>> # add a new record through a pipe
>>> user.groups().add(group)

>>> # add a new record through a reverse lookup
>>> role = Role.byName('Software Engineer')
>>> role.users().add(john)

In these examples, the add method will dynamically associate the records through the collectors in the data store. These calls do not require a save afterwards.

For a pipe, a new record within the intermediary table is created (in this case, a new GroupUser record is created with the user and group relation). If the relation already exists a RecordNotFound will be raised.

For a reverse lookup, the add method will set the reference instance to the calling record. In this example, we have a Role model, a role reference on the User model, and a ReverseLookup that goes through the User.role column. Calling the role.users().add(john) would set the role column to the calling role instance on john, and save the record. Again, this does not rqeuire saving afterwards.

Collection.create

The create method will act the same way as the add method, except instead of accepting a record, it accepts a dictionary of column/value pairs and will dynamically create a record vs. associate it.

For this method to work, you will need to have defined what model is associated with a collection. If the collection is returned as a part of a collector, then it will automatically be associated.

>>> # create a generic user
>>> users = orb.Collection(model=User)
>>> users.create({'username': 'tim', 'password': 'pass1'})

>>> # create a pipe record, and associate it
>>> user = User.byUsername('john')
>>> user.groups().create({'name': 'reporters'})

>>> # create a reversed lookup
>>> role = Role.byName('Software Engineer')
>>> role.users().create({'username': 'tam', 'password': 'pass2'})

In these examples, we first created a basic User record. The second example will create a Group called 'reporters', and then associate it with the user john. The final example will create a new user automatically setting it's role to the 'Software Engineer' record.

Creating vs. Updating

The save function on collection can be used for bulk updates in addition to bulk saves. If you are dealing with multiple records, some of which are new, and some of which are updates, the system will automatically separate them out for you and create two queries.

>>> users = User.all()
>>> users.at(0).set('username', 'peter')
>>> users.at(1).set('username', 'paul')
>>> users.add(User({'username': 'mary', 'password': 'pass2'})
>>> users.save()

This will modify 2 records and create 1 -- with a total of 2 queries that will be made to the backend data store.

Deleting Records

Collection.delete

Bulk deletions can also be performed using collections. Again, depending on where the collection comes from -- different deletion options will be available to you.

>>> users = User.select(where=Q('first_name') == 'John')
>>> users.delete()

Calling this will perform a delete on all the users whose name is 'John', in one go from the database.

If you are working with the results of a collector (reverse lookup or pipe), you have a few more options for removing records.

Collection.remove

>>> # remove from a pipe
>>> g = Group.byName('editors')
>>> u = User(1)
>>> u.groups().remove(g)

>>> # remove from a reverse lookup
>>> r = Role(1)
>>> r.users().remove(u)

In this example, removing the 'editors' group from the pipe will not delete the group record. Rather, it will remove the record from the intermediary table. Both the user and group will still exist at the end of the remove.

The second example, removing from a reverse lookup, will not actually remove either the Role or the User record, but instead will set the User.role column value to null. This is how records are found for reverse lookups, so removing is simply setting the value to an empty value.

Collection.empty

This method is more similar to the Collection.remove method than the Collection.delete method. What this will do is it will remove intermediary tables for pipes, disassociate reverse lookups from their references, and do nothing for empty collections.

>>> # empty a pipe
>>> u = User(1)
>>> u.groups().empty()

>>> # remove from a reverse lookup
>>> r = Role(1)
>>> r.users().empty()

In these two examples, the first will remove all GroupUser records where the user_id is 1, this will not remove any groups or any users. The second will not actually remove any records at all, instead it will update all User records whose role_id is 1, and set it to None.

The next time the groups or users call is made for these records is performed, the collection will be empty.

How expensive are collections?

As we discussed in the Models chapter -- doing something like User.all() will return a collection of all the user records.

But what does that mean?

>>> from intro import *
>>> users = User.all()

At this point, the users variable does not actually have any records in it. It just has the knowledge that, when asked, it should go and fetch all User records and return them. Depending on the action you take next, ORB will optimize how it interacts with your backend store.

For instance, if all I care about is the number of users I have, and not the users themselves, doing:

>>> users = User.all()
>>> len(users)

Will query the database for just the count of the users based on the selection criteria (in this case, all records) and return that number.

But doing:

>>> users = User.all()
>>> list(users)
>>> len(users)

Will query the database for all the user records and cache the results. Then, when I call len will not query the database at all -- it will determine that we've already loaded the records, and will instead just ask for the length of the loaded list.

It is important to understand what functions are available for collections, but it is also important to keep in mind their relative expense.

Warning:

Each time Model.all or Model.select is called, a new collection object will be returned. This means all previous caching will not exist for the new instance.